← Back to glossary

Raft Consensus Algorithm

architecture Advanced

Also Known As

Raft leader election log replication distributed consensus

TL;DR

A consensus algorithm designed to be understandable — Raft elects a leader who coordinates all writes, replicates log entries to followers, and ensures that committed entries are never lost even when servers fail.

Explanation

Raft decomposes consensus into three sub-problems: leader election, log replication, and safety. One node is always the leader; all writes go through the leader. The leader appends entries to its log and replicates them to followers. An entry is 'committed' when a majority of nodes have acknowledged it. If the leader fails, followers detect the timeout and elect a new leader via a randomised election timeout that prevents split votes. The new leader must have the most up-to-date log to be elected. This guarantees committed entries are never lost. Raft is used in etcd (Kubernetes configuration store), CockroachDB, Consul, and many other distributed systems. Understanding Raft explains why these systems require a majority quorum (2 of 3, 3 of 5) to operate.

Common Misconception

✗ Raft guarantees availability in all scenarios. Raft sacrifices availability for consistency — if a majority quorum cannot be reached (network partition, too many failed nodes), the cluster stops accepting writes rather than risk inconsistency. This is intentional: CP, not AP.

Why It Matters

Raft is the consensus algorithm behind etcd, which is the backbone of Kubernetes. Understanding Raft explains why Kubernetes requires an odd number of control plane nodes, why losing more than half the etcd cluster is catastrophic, and why network partitions cause availability loss in systems that prioritise consistency (CP systems).

Common Mistakes

Running Raft clusters with even numbers of nodes — provides no additional fault tolerance compared to one fewer odd number.
Not monitoring leader elections — frequent re-elections indicate network instability or overloaded nodes; alert on election metrics.
Placing all Raft nodes in the same availability zone — defeats the purpose; a rack failure or AZ outage takes out the entire cluster.
Confusing Raft with Paxos — Raft is a specific algorithm designed for understandability; Paxos is a family of protocols that are theoretically equivalent but harder to implement correctly.

Code Examples

✗ Vulnerable

// ❌ Deploying etcd with 2 nodes — no fault tolerance
// 2 nodes need both to agree (majority of 2 = 2)
// One failure = cluster down
// This is worse than a single node in some failure modes

// Also: deploying 4 nodes — same fault tolerance as 3
// 4 nodes need 3 to agree; 3 nodes also need 2 to agree
// 4th node adds cost with no additional fault tolerance

✓ Fixed

# ✅ etcd cluster sizing — always odd numbers
# 3 nodes: tolerates 1 failure (majority = 2)
# 5 nodes: tolerates 2 failures (majority = 3)
# 7 nodes: tolerates 3 failures — rarely needed, adds latency

# In Kubernetes: control plane with 3 etcd nodes
# kubeadm init --control-plane-endpoint=lb:6443
# Join 2 more control plane nodes for HA

# Check Raft health in etcd
# etcdctl endpoint health --cluster
# etcdctl endpoint status --cluster --write-out=table

Raft Consensus Algorithm

Also Known As

TL;DR

Explanation

Common Misconception

Why It Matters

Common Mistakes

Code Examples

References

Tags

Raft Consensus Algorithm

Also Known As

TL;DR

Explanation

Common Misconception

Why It Matters

Common Mistakes

Code Examples

References

Tags

Related Terms