Do not worry. I will end this series. It’s just…taking a while.
You’ve been watching me implement the Raft distributed consensus algorithm from this paper for almost nine months (with a five month intermission—I’m not that slow, y’all). I try to maintain a refreshing banter, but with plenty of code samples so you can see exactly how it works.
We’re ready to address the final implementation chunk: configuration changes.
What if we have to change out the servers?
At this point, the Raft paper expects that our current implementation works from a fixed cluster configuration. That’s not quite true: each of my servers, upon startup, puts its name and port in a
server_registry file. Then, when a server needs to broadcast a message, it scrapes this file for all the unique names and broadcasts to those ports. So I can add as many servers as I want, whenever I want, and they’ll get added to the Raft configuration.
That doesn’t mean that this is a great solution. Why:
- The file gets longer indefinitely. Servers write their name and port to this registry file every time they start, so there are repeat entries as servers go down and come back up. Over time, servers have to scrape a longer file to get the port set, so broadcasts take longer.
- n + 1 server replacements will break the cluster. Nothing gets removed from the registry, and servers treat all the unique names in the registry as voting servers. So if I perform, over time, six replacements on a five-server cluster, I will have 11 servers named in the registry. At that point, it will no longer be possible for the leader to commit entries or for candidates to get elected leader because those things require responses from a majority of the cluster—and six out of the 11 servers will never respond.
Raft prescribes a two-phase approach for configuration changes.
In this approach, there are two sets of servers: the old one and the new one. First, just the old one is up. Then, both the old and the new one is up. Then the old one is brought down. In the middle, both the old and the new configurations need to agree on stuff, but only some stuff, and they have to get separate majorities between them to make things happen.
There’s gotta be something I’m missing, because that sounds complicated and finicky, particularly for a system whose whole design is meant to be resilient to servers randomly going down whenever. If Raft does what it says on the tin, I should be able to make a “configuration change” by mercilessly tanking 40% of my servers. The catch: I need to tell the remaining servers to stop looking for responses from the ones that I tanked.
So I simplified the configuration change process for my purposes. Here’s the difference between the canonical approach and mine: I don’t configure the cluster by defining the entire old or new configuration in a single decree. Instead, I register and de-register servers one by one. The leader replicates logs and commits when a majority of the currently registered servers respond.
To make this work, I need to change the way servers get registered. As of this commit, I no longer write a server’s information to the registry on startup:
I also stop using the
server_nodes method, which reads the
server_registry and produces the list of servers in the cluster from that registry:
Instead, servers populate their list of active servers from the log. Commands to register and deregister specific servers get interspersed with write commands to the key value store in the log:
So our log might look something like this:
This means that, when we start servers with empty logs, we need to send commands to register them, that get placed in the logs and then executed, before the servers will start communicating with each other. This will work, but I get real nervous when I have to type under time pressure, because if I can’t register a server’s compatriots before its election timeout, it won’t have the information it needs to request votes or get elected leader. So I start the cluster with a prefabricated log where the first few commands go ahead and register my starting servers.
Once servers could change their understanding of the cluster configuration based on log commands, they needed to accept requests that add such commands to the log while the servers are already running. This allows me to change the configuration on a server cluster without having to restart it.
In this commit, the
server begins accepting commands from a client to register and deregister servers:
key_value_store starts executing those commands to change the state of the dictionary that stores cluster information, much in the same way it executes log commands to change the state of the key value store of data itself.
At this point, there’s an issue: ports. If a server comes up with an out-of-date log and does not yet know which server names to associate with which ports, it does not have the port info it needs to respond to a server that contacts it with only a name as the return address.
This is the latest iteration of a recurring theme in Raft: we can never rely on the state of an individual follower server as the source of truth. Instead, all of the necessary context about the system must be communicated in the servers’ messages to one another. That includes the port. So in this commit I add an explicit port declaration to the return address that each server already sends with each request.
There’s another issue: at this point, I’m executing the change to the server’s cluster dictionary in the
write_to_state_machine method. That’s incorrect. Here’s why:
A server always uses the latest configuration in its log, regardless of whether the entry is committed.Part 6: Cluster Membership Changes, The Raft Paper
We need to do this to avoid getting stuck in a chicken-egg situation where the new server configuration must commit its own registration changes before it can commit anything.
We fix it in this commit by moving that logic to
So that’s our basic cluster reconfiguration implementation. I can issue commands from the client such as “register tom 10001” or “deregester tom 10001” to add and remove a server named tom, available at port 10001, from the list of servers that receives
AppendEntries calls, or to whom the leader is looking for a majority response to commit an entry.
That leaves three remaining issues, though.
The Raft paper steps through these one by one.
The first issue is that new servers may not initially store any log entries. If they are added to the cluster in this state, it could take quite a while for them to catch up, during which time it might not be possible to commit new log entries.
In order to avoid availability gaps, Raft introduces an additional phase before the configuration change, in which the new servers join the cluster as non-voting members (the leader replicates log entries to them, but they are not considered for majorities). Once the new servers have caught up with the rest of the cluster, the reconfiguration can proceed.Part 6: Cluster Membership Changes, The Raft Paper
I implemented this with an attribute on the server called “voting.” The command that we use to start servers can explicitly set this to False so that we can start servers in a non-voting state when we are adding them to an existing cluster. I implement that change in this commit.
The attribute gets set to True when the new server’s log has caught up to the others—that is, upon the server’s first successful response to an
AppendEntries call from the leader:
As of this commit, servers only respond to a
RequestVotes call if they are caught up, voting servers…
…and when a server gets caught up and gains the ability to vote, it broadcasts a message announcing that fact…
…so that the leader can add a line to the log to indicate that the server earned voting privileges…
…and it can add that server to the dictionary of servers that it counts when looking for a majority of voting servers to respond to a
Now, we might have a log that looks something like this when the server
dick at port 10002 gets added to an existing cluster in a non-voting state:
So now we know that a server is caught up before it starts voting. On to the next problem:
The second issue is that the cluster leader may not be part of the new configuration. In this case, the leader steps down (returns to the follower state) once it has committed the Cnew log entry.This means that there will be a period of time (while it is committing Cnew) when the leader is managing a cluster that does not include itself.
It replicates log entries but does not count itself in majorities.Part 6: Cluster Membership Changes, The Raft Paper
In this case, we would have a server where
self.leader is True but
self.voting is False. In theory, it shouldn’t ever matter what an active leader’s
self.voting status is, because an active leader prevents the start of a new election.
Once again, it’s odd to me that we go through this rigamarole with leadership transfer in a system that is literally designed with the express purpose of quickly rectifying the situation if a leader just disappears. I can bring down the leader, wait for a new election to finish, and deregister the leader.
But, I don’t even have to do that. I can deregister the leader and then bring it down because the leader only sends
AppendEntries requests; it does not need to receive them. It’s true that the cluster leader may not be a part of the new configuration, but with the current implementation, deregistering a leader does not cause any issues.
We have one last situation to address:
The third issue is that removed servers (those not in Cnew) can disrupt the cluster. Those servers will not receive heartbeats, so they will time out and start new elections. They will then send RequestVote RPCs with new term numbers, and this will cause the current leader to revert to the follower state…
To prevent this problem, servers disregard RequestVote RPCs when they believe a current leader exists. Specifically, if a server receives a RequestVote RPC within the minimum election timeout of hearing from a current leader, it does not update its term or grant its vote.Part 6: Cluster Membership Changes, The Raft Paper
On my system, the election timeouts range from 10 to 18 seconds. So to resolve this issue, servers should not grant their vote if they receive a
RequestVote call within 10 seconds of having heard from a leader. That happens in this commit.
First I give the server an attribute called
election_allowed that gets set to True on a 10 second timer…
…and set to False (with the 10 second timer restarted) each time it receives an
It then denies its vote if a candidate reaches out with a
RequestVote call within that 10 second period.
With this system, I can issue commands to register and deregister servers, then bring them up and down, and keep the cluster going while changing out the individual pieces. In using and testing this system, here’s what I do:
- When I start the cluster, I manually register the initial set of servers by typing them into the log right before startup like so:
This way, these servers can start, and participate in, one another’s elections to elect a leader. I could also do this by bringing them up and then issuing commands from a client to register them, but I don’t trust my typing speed, so I do this instead.
- When adding a server to the cluster, I issue a command from a client to the leader to register that server. Then, I start the server in the non-voting state. Once that server has its log up to date, it automatically switches itself to a voting server.
- When removing a server from a cluster, I issue a command from a client to the leader to deregister the server. Then I bring the server down. This worked fine whether I was deregistering a follower or the leader itself. Theoretically if I brought down the leader before its deregistration log was replicated, other servers would still look for it. As trite as this is going to sound, I avoided that problem by not doing that (to be honest I was having a hard time engineering a situation that brought the server down fast enough to cause that problem). The deregistration command doesn’t even need to be committed yet because a) servers go off the latest cluster configuration dictated in their log regardless of whether the entries are committed and b) only a server with an up-to-date log can become the leader.
A Further Note on Code Design
To handle configuration changes, I needed to pull some functions into the server object that I had previously attempted to put in their own file. In this commit, for example, I move the methods into the server that add a return address to, or strip a return address from, a server request/response.
In other words, I guessed wrong about where to put seams in my implementation. But I find myself wondering whether my mistake lay, not in misplacing seams, but in attempting to introduce seams in the first place. Raft kind of resists seams. Its implementation revolves around a few orthogonal execution paths that together manipulate a shared set of state variables. Most changes loan themselves, not to adding a new collaborator, but to adding functionality to an existing collaborator.
Let’s look at an example from this portion of the implementation.
In this post, I describe adding server registration and deregistration that writes to, and gets executed from, each server’s log. The object responsible for managing the log is called
KeyValueStore. That made sense when the log exclusively stored information about the key value pairs that the cluster exists to replicate.
Now the cluster must replicate a different kind of information in the exact same way—in fact, in the exact same file. I thought about building a separate concern for this, but that concern would need to replicate much of the functionality of
KeyValueStore. Perhaps an opportunity for an abstract class? Maybe, though Python doesn’t encourage it (I wouldn’t call this kludge that leverages
Even if I did that, then I have two concerns editing the same file. The file, in fact, that the whole Raft system exerts itself to make consistent across multiple servers. Giving two concerns write access to this file opens up the possibility for race conditions that lead to a non-deterministic log result given the same set of inputs in similar time domains. Instead of adjudicating this, it made sense to me that the log-to-state and state-to-log translations, together, represented a single responsibility, to be covered by a single concern. Instead of making a separate class to
KeyValueStore, I added the registration and deregistration functionality to that class and renamed it to
LogManager in this commit.
This decision gets me further away from the takeaways of books like Refactoring to Patterns or Domain Driven Design, which lay out the imperative paradigm of breaking up our code into smaller objects. Philosophy of Software Design proposes another approach: one in which we strive for a deep set of functionality covered by a single API. Who should programmers listen to?
I don’t think the answer comes down to Gang of Four vs. Ousterhout (though I’d be remiss not to mention that the Ousterhout of that book is the same Ousterhout who co-authored the Raft paper). Instead, I think the answer comes down to context.
Here’s what I mean by that:
I think a lot of software engineering instruction gets presented as universal, when in fact it’s context-specific.
Either the instructor doesn’t realize that it’s context-specific because they’ve only worked in one context, or they don’t communicate that it’s context-specific even if they know it is.
A common example piece of programming advice: “legibility is more important than performance.” Most programmers who say this don’t add the context part: “Legibility is more important than performance…when building end-user applications, usually.” Because if you’re building a compiler, this is not true. If your iOS app loops over a collection of 5 things in a relatively inefficient way, it’s probably fine. If your programming language parses every line at the speed of cold molasses because your compiler uses tiered conditionals instead of an ugly-but-faster case statement, it is not fine.
But back to Raft, and what to do about the objects. I think that there is context here to take into account. Before any refactoring happened—before any of my code got written—this system’s design impacts the way its implementation gets factored, and that’s worth a look. In fact, I think it’s worth a really close look. So I’m leaving it to the next post because this one is long and I know you’re tired. Come back fresh next time, and bring an espresso: we have stuff to discuss.
If you liked this piece, you might also like:
The Philosophy of Software Design series
Techtivism! Because if you’re enough of a maverick to enjoy watching me rant about Raft for 15 posts, you’re enough of a maverick to care about your how to wield your role as a technologist.
The series on Bob Nystrom’s…book? Open education project? Anyway, it’s called Crafting Interpreters, and by the time you read the existing pieces I’ll be updating it again with new posts.
I had Crafting Interpreters, Raft, and debugging series all going at once, and then I needed to design a graduate course and adapt to pandemic life. Eventually, I accepted the fact that only one or two of these things could happen at a time. So now that the course and the debugging series are done and Raft is winding down, I can pick something else up again.