Raft Consensus Algorithm
debt(d7/e7/b9/t7)
Closest to 'only careful code review or runtime testing' (d7). Misconfigurations like even-node clusters or co-located AZs aren't caught by linters; they surface during chaos testing, partition events, or operational review. No standard SAST detects Raft topology mistakes.
Closest to 'cross-cutting refactor across the codebase' (e7). Fixing a Raft deployment mistake (e.g. relocating nodes across AZs, resizing the cluster, migrating from even to odd membership) requires coordinated cluster reconfiguration, data migration, and downtime planning — well beyond a one-line patch.
Closest to 'defines the system's shape' (b9). Raft (via etcd) is the backbone of Kubernetes and similar control planes; the CP tradeoff, quorum sizing, and AZ topology shape every operational decision around the system. Rewrite-or-live-with-it.
Closest to 'serious trap' (t7). The misconception explicitly states devs assume Raft guarantees availability, but it's CP not AP — quorum loss halts writes. This contradicts the default 'distributed = highly available' intuition, and the even-vs-odd node count is counterintuitive (4 nodes is worse than 3).
Also Known As
TL;DR
Explanation
Raft decomposes consensus into three sub-problems: leader election, log replication, and safety. One node is always the leader; all writes go through the leader. The leader appends entries to its log and replicates them to followers. An entry is 'committed' when a majority of nodes have acknowledged it. If the leader fails, followers detect the timeout and elect a new leader via a randomised election timeout that prevents split votes. The new leader must have the most up-to-date log to be elected. This guarantees committed entries are never lost. Raft is used in etcd (Kubernetes configuration store), CockroachDB, Consul, and many other distributed systems. Understanding Raft explains why these systems require a majority quorum (2 of 3, 3 of 5) to operate.
Common Misconception
Why It Matters
Common Mistakes
- Running Raft clusters with even numbers of nodes — provides no additional fault tolerance compared to one fewer odd number.
- Not monitoring leader elections — frequent re-elections indicate network instability or overloaded nodes; alert on election metrics.
- Placing all Raft nodes in the same availability zone — defeats the purpose; a rack failure or AZ outage takes out the entire cluster.
- Confusing Raft with Paxos — Raft is a specific algorithm designed for understandability; Paxos is a family of protocols that are theoretically equivalent but harder to implement correctly.
Code Examples
// ❌ Deploying etcd with 2 nodes — no fault tolerance
// 2 nodes need both to agree (majority of 2 = 2)
// One failure = cluster down
// This is worse than a single node in some failure modes
// Also: deploying 4 nodes — same fault tolerance as 3
// 4 nodes need 3 to agree; 3 nodes also need 2 to agree
// 4th node adds cost with no additional fault tolerance
# ✅ etcd cluster sizing — always odd numbers
# 3 nodes: tolerates 1 failure (majority = 2)
# 5 nodes: tolerates 2 failures (majority = 3)
# 7 nodes: tolerates 3 failures — rarely needed, adds latency
# In Kubernetes: control plane with 3 etcd nodes
# kubeadm init --control-plane-endpoint=lb:6443
# Join 2 more control plane nodes for HA
# Check Raft health in etcd
# etcdctl endpoint health --cluster
# etcdctl endpoint status --cluster --write-out=table