Raft Consensus with a Minority of Nodes
Recorded: May 27, 2026, 1:21 p.m.
| Original | Summarized |
Raft Consensus with a Minority of Nodes Raft Consensus with a Minority of Nodes tl;dr — This post describes a (wacky) modification to the Raft consensus protocol such that progress can be made even if fewer than a majority of nodes are actively participating, given some constraints on exactly which minority of nodes are active. The math behind this comes from the same place as the card game Spot It! (Dobble). Raft Consensus Basics Raft is a consensus protocol for managing a replicated log across a cluster of nodes. Its key goals are: (1) maintain a consistent replicated log of state transitions, (2) tolerate node failures, and (3) ensure a single leader coordinates all changes while multiple followers replicate. Raft is designed to be understandable — it decomposes consensus into leader election, log replication, and safety — and is widely used in systems like etcd, CockroachDB, and TiKV. In steady state, the leader receives client requests and appends them to its log. It then sends AppendEntries RPCs to all followers. Once a majority of nodes (including itself) have appended the entry, the leader considers it "committed" and applies it to the state machine. For example, in a 5-node cluster, the leader needs acknowledgments from at least 2 followers (3 total including itself) before committing. This provides fault tolerance for up two node failures or a network partition where at least a majority of nodes are able to communicative with each other. If the leader crashes, a new one is elected. Any node can become a candidate, start an election for a new "term," and request votes. A candidate wins if it receives votes from a majority of nodes. This guarantees that at most one leader exists per term. Once elected, the new leader synchronizes followers' logs and resumes normal operation. The key correctness insight is this: any two majorities of nodes must overlap in at least one node. So between any two consecutive global state changes — whether two commits, two leader elections, or one of each — at least one node participated in both. This single overlapping node carries forward the knowledge of what was previously committed, preventing conflicts and ensuring consistency. In a 5-node cluster, any two sets of 3 nodes must share at least one member. This overlap is what makes Raft safe. Spot It! Spot It! (also known as Dobble) — the author's favorite family card game. Spot It! (known as Dobble outside North America) is a card game whose rules are relatively straightforward: flip a card from the deck to the center, and race to find the one symbol your card has in common with the center card. Call it out, discard your card, and repeat. It's fast, fun, and requires no reading or arithmetic. It's simple enough for a 5-year old to learn quickly, yet the game design is surprisingly complex. Here's the remarkable property that makes the game work: the deck has 55 cards, each with 8 distinct symbols drawn from a pool of 57 unique symbols, and any two cards share exactly one symbol in common. This isn't trivial to engineer. Try designing even 10 cards with this property and you'll find it surprisingly difficult. The game's designers didn't just get lucky — they used a beautiful piece of mathematics: finite projective planes. Finite Projective Planes A finite projective plane of order $n$ is a combinatorial structure consisting of points and lines with three key properties: (1) any two distinct points lie on exactly one common line, (2) any two distinct lines intersect in exactly one common point, and (3) every line contains exactly $n + 1$ points and every point lies on exactly $n + 1$ lines. The total number of points equals the total number of lines, and both equal $n^2 + n + 1$. The Fano plane — the smallest finite projective plane (order 2). It has 7 points and 7 lines (including the inscribed circle), with 3 points per line and 3 lines per point. Any two lines intersect in exactly one point. The smallest example is the Fano plane (order $n = 2$): 7 points, 7 lines, with 3 points on each line and 3 lines through each point. In the diagram above, the seven "lines" are the three sides of the triangle, the three altitudes, and the inscribed circle — each passing through exactly 3 of the 7 points. You can verify that any two of these lines share exactly one point. Spot It! uses a finite projective plane of order $n = 7$. This gives $7^2 + 7 + 1 = 57$ points (symbols) and 57 lines (cards), with $7 + 1 = 8$ points per line (symbols per card). The intersection property guarantees any two cards share exactly one symbol — exactly what the game needs. Finite projective planes are known to exist when the order $n$ is a prime power. Here are some small examples: Order ($n$) 273Fano plane (Order 6 is notably absent — it was proven not to exist. Order 10 was shown not to exist by an exhaustive computer search in 1989. Whether finite projective planes exist for non-prime-power orders remains an open question in combinatorics.) 💡 Key insight: In Raft, the reason we need a majority is the overlap property — any two majorities share at least one node. But majorities aren't the only set systems with guaranteed pairwise intersection. Finite projective planes give us another: any two lines intersect in exactly one point. So if we assign nodes to points and designate each "line" as a valid voting bloc, any two blocs are guaranteed to share a common node — the same safety property Raft relies on. For a 57-node system using the order-7 projective plane, we'd have 57 designated blocs of 8 nodes each. Consensus requires just 8 nodes to agree — far fewer than the 29 needed for a traditional majority. The trade-off, of course, is that not every subset of 8 nodes forms a valid bloc. We'll explore this trade-off later. Raft with Finite Projective Planes Here's the general construction. Given a cluster of $N$ nodes, find the smallest prime power $p$ such that $p^2 + p + 1 \geq N$. Construct the finite projective plane of order $p$, which gives us $p^2 + p + 1$ points (we use $N$ of them as our nodes) and $p^2 + p + 1$ lines. Each line contains $p + 1$ points. We call these lines blocs — the valid quorum sets for our modified protocol. The modifications to Raft are straightforward: Log replication: A leader's AppendEntries is considered committed once an entire bloc of nodes (potentially including the leader itself) has appended the entry. In the best case, this requires only $p$ followers to respond (the bloc minus the leader itself). Why is this correct? Consider any two global state changes (commits or elections). Each involved some set of participating nodes that contains at least one complete voting bloc. Call these sets $S_1$ and $S_2$, containing blocs $B_1$ and $B_2$ respectively. By the projective plane intersection property, $B_1 \cap B_2 \neq \emptyset$ — there exists at least one node in both blocs. Since $B_1 \subseteq S_1$ and $B_2 \subseteq S_2$, we have $S_1 \cap S_2 \neq \emptyset$. Therefore, at least one node participated in both state changes, preserving Raft's consistency guarantee. Demo: 7-node cluster with the Fano plane Let's make this concrete with the smallest non-trivial example: 7 nodes arranged according to the Fano plane (order 2). We have 7 nodes and 7 blocs of 3 nodes each: BlocNodes You can verify: any two blocs share exactly one node. (E.g., $B_1 \cap B_4 = \{2\}$, $B_2 \cap B_7 = \{5\}$, etc.) Scenario 1: Steady state — all nodes active Node1234567 Status The responding set {1,2,3,4,5,6,7} contains every bloc. The leader commits as soon as any bloc is complete — for instance, once nodes 2 and 3 respond, bloc $B_1 = \{1,2,3\}$ is satisfied. Scenario 2: Only 3 nodes active — and they form a bloc Node1234567 Status The responding set is {1, 2, 3} = bloc $B_1$. Even though 4 out of 7 nodes are down (a majority has failed!), the protocol can still make progress because the active nodes happen to form a valid bloc. Scenario 3: Only 2 nodes active — no bloc possible Node1234567 Status Node 2 gets a vote from node 3, so its vote set is {2, 3}. The blocs containing node 2 are $B_1 = \{1,2,3\}$, $B_4 = \{2,4,6\}$, $B_5 = \{2,5,7\}$. None of these are subsets of {2, 3}. No bloc is satisfied. Scenario 4: Successful leader election — with a provable overlap Node1234567 Status The vote set is {2, 4, 6} = bloc $B_4$. Election succeeds! Scenario 5: A majority is active, but no bloc is present Node1234567 Status The active set is {1, 2, 4, 7}. Let's check every bloc: $B_1 = \{1, 2, 3\}$ — node 3 is down ✗ No bloc is fully contained in the active set. Even though a majority of nodes is available, the system cannot make progress. Scenario 6: Full recovery Node1234567 Status The leader sends AppendEntries to all. As soon as any 2 followers respond (completing a bloc with the leader), entries commit. Recovered nodes that missed entries get their logs brought up to date via Raft's normal log replication mechanism. The system is fully operational again — every possible bloc is satisfiable. Trade-off The fundamental trade-off is clear from Scenario 5: unlike classical Raft, our modified protocol is not guaranteed to make progress whenever a majority of nodes is active. We need the active set to contain at least one complete bloc. So the natural question is: given a random subset of $k$ active nodes out of $N$ total, what's the probability that it contains at least one of our blocs? Let's work this out. We have a projective plane of order $p$ with $N = p^2 + p + 1$ nodes and $N$ blocs of size $p + 1$ each. If $k$ nodes are active (chosen uniformly at random), the probability that a specific bloc is entirely contained in the active set is: $$P(\text{bloc } B_i \subseteq \text{active set}) = \frac{\binom{N - (p+1)}{k - (p+1)}}{\binom{N}{k}}$$ (We choose the remaining $k - (p+1)$ active nodes from the $N - (p+1)$ nodes not in bloc $B_i$, assuming all $p + 1$ nodes of $B_i$ are active.) By a union bound (which slightly overcounts since blocs can overlap), the probability that at least one bloc is present is at most: $$P(\text{any bloc present}) \leq N \cdot \frac{\binom{N - (p+1)}{k - (p+1)}}{\binom{N}{k}}$$ For an exact answer, we'd use inclusion-exclusion, but the union bound gives a useful upper estimate. Let's compute some examples: Example: Order 2 (Fano plane, $N = 7$, bloc size 3) With the Fano plane, $N = 7$ and each bloc has 3 nodes. We can compute exact probabilities by brute force: there are $\binom{7}{k}$ equally likely subsets of $k$ active nodes, and we count how many contain at least one of the 7 blocs. Active nodes ($k$) 3$7/35 = 20\%$No $k = 3$: $\binom{7}{3} = 35$ possible sets. Exactly 7 of these are blocs (each bloc is one of the 35 triples). So $P = 7/35 = 20\%$. So with the Fano plane: you can sometimes make progress with just 3 active nodes (20% chance if the active set is random), with 4 active nodes you succeed 80% of the time, and with 5+ you're always fine. Classic Raft always works at $k = 4$, but our scheme has a 20% failure rate there — the cost of needing only 3 nodes in the best case. With 57 nodes, 57 blocs of size 8, and a classical majority quorum of 29, the landscape looks quite different. Exact computation via brute force is infeasible ($\binom{57}{20} \approx 10^{14}$), but the union bound gives a reasonable estimate for moderate $k$: $$P(\text{any bloc present}) \leq 57 \cdot \frac{\binom{49}{k - 8}}{\binom{57}{k}}$$ Active nodes ($k$) 8$\approx 0.0000035\%$No The numbers are sobering: even at the classical majority threshold of $k = 29$, a random subset only has about a 15% chance of containing a valid bloc. Our scheme trades away the guarantee of progress at majority for the possibility of progress with far fewer nodes — but that possibility is slim unless the active set is much larger than a single bloc. Final Thoughts: Erdős–Ko–Rado theorem Stepping back, what we're really trying to do is solve an optimization problem: given $N$ nodes, find a family of subsets (our "blocs") such that (1) any two subsets in the family intersect (the safety requirement), (2) the size of each subset is minimized (so fewer nodes need to be active for a quorum), and (3) the number of subsets in the family is maximized (so a random set of active nodes is more likely to contain one). Finite projective planes give one beautiful construction, but they're not the only option. The Erdős–Ko–Rado theorem (1961) provides fundamental bounds on exactly this kind of structure. It tells us: given $N$ points, what is the maximum number of subsets of size $r$ such that any two subsets share at least one element? The answer is $\binom{N-1}{r-1}$ (when $N \geq 2r$) — achieved by fixing one element and taking all $r$-subsets containing it. This gives us a framework for understanding the trade-off space. If we want blocs of size $r$ from $N$ nodes: The maximum number of pairwise-intersecting blocs is $\binom{N-1}{r-1}$. Projective planes are special because they achieve a particularly elegant balance: they give $N$ blocs of size $\sqrt{N}$ (roughly), all pairwise intersecting in exactly one point. The EKR theorem tells us we could potentially have more blocs of the same size if we relaxed other structural constraints — but the projective plane's rigid structure makes it easy to construct and reason about. The deeper question is: can we find intersecting families that beat projective planes on the metric we care about most — the probability that a random $k$-subset of active nodes contains at least one bloc? This is an open design space, and the EKR theorem provides the ceiling on how many blocs we can have for a given size. Exploring constructions that approach this ceiling while remaining practical to implement could be a direction worth pursuing. There is also a completely different angle: forget symmetry, and optimize for real-world failure patterns instead. In practice, failures are not random — they tend to be correlated within failure domains such as racks, availability zones, or regions. If we're willing to design our bloc family around the specific failure topology we care about, we don't need a projective plane at all. For example, suppose a cluster spans three availability zones. We could simply define three blocs, one rooted in each AZ, such that each bloc has at least one node from each of the other two AZs. Any two such blocs share at least one node (since they each reach into the other's home zone), satisfying the intersection property — and any two-AZ failure leaves the third bloc intact. This isn't as mathematically elegant, and it requires thinking carefully about your deployment rather than turning a combinatorial crank, but it will likely be more effective in practice than betting on a random active set containing a Fano-plane triple. Last updated on May 24, 2026 |
The author explores a modification to the Raft consensus protocol designed to allow progress even when fewer than a majority of nodes are actively participating, contingent upon the active nodes forming specific, mathematically structured sets. Raft, fundamentally, ensures consistency through a replicated log where a leader commits entries only after receiving acknowledgments from a majority of nodes, which provides fault tolerance for node failures. This safety mechanism is rooted in the property that any two majorities must overlap, ensuring that consecutive state changes share at least one participating node. The text introduces the concept of finite projective planes, drawing an analogy from the card game Spot It!, where the structure is governed by mathematical principles ensuring pairwise intersection. A finite projective plane defines a combinatorial structure of points and lines where any two lines intersect in exactly one point. This property provides an alternative basis for ensuring safety in distributed systems. By mapping nodes to points and defining potential quorum sets, the principle of projective plane intersection guarantees that any two such sets (blocs) share at least one node, thus preserving the consistency guarantee relied upon by Raft. The proposed modification involves redefining consensus triggers based on these geometric blocs. For log replication, an entry is committed once an entire bloc of nodes has appended the entry, potentially requiring only the response of the bloc minus the leader. For leader election, a candidate wins by securing votes from all nodes within a single bloc. The correctness of this modification stems from the intersection property: if two state changes involve blocs $B_1$ and $B_2$, they must share a node, ensuring knowledge of prior commits is carried forward, mirroring Raft's inherent safety mechanism. An analysis using a seven-node cluster arranged according to the Fano plane demonstrated the protocol's behavior across various scenarios. While the system succeeded easily when all nodes were active, progress was possible even when only a specific configuration of three nodes formed a valid bloc, successfully committing an entry. Conversely, the system can become stalled if the active nodes do not form any valid bloc, even if a classical majority is present. A significant trade-off is identified: unlike classical Raft, this scheme sacrifices the guarantee of progress upon achieving a simple majority, requiring the active set to contain at least one valid bloc. Probability calculations further elucidate this trade-off. For a system based on the Fano plane (seven nodes), while a classical majority threshold of four nodes ensures progress in Raft, the modified protocol only guarantees progress if the set of active nodes contains a valid bloc, which occurs only 80 percent of the time when four nodes are active, and 100 percent when five are active. For larger systems, such as a fifty-seven node setup based on the order-seven projective plane, the probability that a randomly chosen majority contains a bloc is substantially lower, indicating that the guarantee of progress under the modified scheme is less certain than the classical majority guarantee. The text concludes by framing the problem in terms of an optimization challenge: maximizing the number of pairwise-intersecting subsets (blocs) while minimizing their size to improve fault tolerance, and increasing their total count to enhance the probability that a random active set contains one. The Erdős–Ko–Rado theorem establishes a bound on the maximum number of such subsets, suggesting that while projective planes offer an elegant construction, exploring other family structures or real-world failure topology-based groupings may offer more practical optimization for deployment, prioritizing real-world failure patterns over purely combinatorial elegance. |