Raft Consensus Algorithm
Also Known As
TL;DR
Explanation
Raft decomposes consensus into three sub-problems: leader election, log replication, and safety. One node is always the leader; all writes go through the leader. The leader appends entries to its log and replicates them to followers. An entry is 'committed' when a majority of nodes have acknowledged it. If the leader fails, followers detect the timeout and elect a new leader via a randomised election timeout that prevents split votes. The new leader must have the most up-to-date log to be elected. This guarantees committed entries are never lost. Raft is used in etcd (Kubernetes configuration store), CockroachDB, Consul, and many other distributed systems. Understanding Raft explains why these systems require a majority quorum (2 of 3, 3 of 5) to operate.
Common Misconception
Why It Matters
Common Mistakes
- Running Raft clusters with even numbers of nodes — provides no additional fault tolerance compared to one fewer odd number.
- Not monitoring leader elections — frequent re-elections indicate network instability or overloaded nodes; alert on election metrics.
- Placing all Raft nodes in the same availability zone — defeats the purpose; a rack failure or AZ outage takes out the entire cluster.
- Confusing Raft with Paxos — Raft is a specific algorithm designed for understandability; Paxos is a family of protocols that are theoretically equivalent but harder to implement correctly.
Code Examples
// ❌ Deploying etcd with 2 nodes — no fault tolerance
// 2 nodes need both to agree (majority of 2 = 2)
// One failure = cluster down
// This is worse than a single node in some failure modes
// Also: deploying 4 nodes — same fault tolerance as 3
// 4 nodes need 3 to agree; 3 nodes also need 2 to agree
// 4th node adds cost with no additional fault tolerance
# ✅ etcd cluster sizing — always odd numbers
# 3 nodes: tolerates 1 failure (majority = 2)
# 5 nodes: tolerates 2 failures (majority = 3)
# 7 nodes: tolerates 3 failures — rarely needed, adds latency
# In Kubernetes: control plane with 3 etcd nodes
# kubeadm init --control-plane-endpoint=lb:6443
# Join 2 more control plane nodes for HA
# Check Raft health in etcd
# etcdctl endpoint health --cluster
# etcdctl endpoint status --cluster --write-out=table