In December, I took a course in which we attempted to implement the Raft distributed consensus algorithm from this paper. Parts 1-5 of this series share insights from the course. From then on, I’m guiding you through my continued work implementing Raft “for fun” (I know. I don’t understand me, either).
In the previous post, we wrapped up a log replication mechanism that is recognizable to what the paper calls for. A follower server can tell when its logs are out of date and send that information to the leader, who can then backtrack to a common log entry and catch up the follower server from there.
So our system self-heals. But who is the healer?
So far the implementation relies on me, its overlord, to arbitrarily designate a server as the leader when I start it up. If I bring the leader down*, there is no more leader. No one will replicate the logs with the other servers. No one will even accept commands from a client. What a sad situation that would be.
The fix for this in raft: leader election. That is, the remaining follower servers will choose a leader amongst themselves.
I want the implementation to be able to pass, at a minimum, the following set of manual tests:
- Fire up a leader and two follower servers, with the leader having a log with multiple commands and the followers having empty logs.
- Once the leader has started issuing heartbeats to the followers, shut it down.
- Ensure that the two remaining followers choose a leader between themselves.
- Start up a client and issue some read commands to that leader. Ensure that its state matches the totality of its current logs.
- Issue some write commands to that leader. Ensure that it replicates those commands on the follower.
- Restart the former leader server, this time as a follower. Ensure that the new leader catches up the restarted server on the commands issued while it was down.
- Shut down the current leader server.
- Ensure that the remaining servers again choose a leader.
- Write some commands to this leader, and ensure they are replicated.
- Again restart the downed server and ensure the new leader brings it back up to date.
How does the Raft paper prescribe that we implement this?
5.2 Leader election
Raft uses a heartbeat mechanism to trigger leader election. When servers start up, they begin as followers. A server remains in follower state as long as it receives valid RPCs from a leader or candidate. Leaders send periodic heartbeats (AppendEntries RPCs that carry no log entries) to all followers in order to maintain their authority. If a follower receives no communication over a period of time called the election timeout, then it assumes there is no viable leader and begins an election to choose a new leader.
This can be a tough thing to picture from text alone, so I took a gif from RaftScope. If you want to gain a better understanding of Raft but you don’t feel like undertaking to write it (honestly what a ridiculous pastime to actually write it, who would do such a thing?), RaftScope shows you a diagram of a server cluster and a handy term table. It also lets you start/stop servers, change election timeouts, and even change the playback speed, so you can see for yourself what would happen in whatever situation you manage to contrive!
Anyway, here’s what it looks like when five servers all start as followers with slightly different election timeouts, shown here as the grey arcs surrounding each server:
A step-by-step explanation of the gif above:
- The first server to time out (S1) starts an election (it turns blue at about 0.120s).
- S1 votes for itself (the solid dot in the five dots that appear when it turns blue).
- S1 sends out RequestVote RPCs (the solid green dots). These are all the first RequestVote RPC that any of the servers have received this term, so they all respond with their votes (the green dots with crosses in them).
- S1 then becomes the leader (the solid black border that appears around it in lieu of a timeout arc, since leaders don’t have timeouts for hearing from the leader!)
- S1 sends out its first heartbeat (the solid orange dots at about 0.230s).
- When each of the other servers receives the heartbeat, their grey arcs increase to demonstrate that the election timeout resets each time a follower receives a valid AppendEntries RPC. This will prevent them from starting new elections.
Time to make it work!
First, I needed to implement election timeout, which I do in this commit. There are three steps here: give each server a timeout, start a timer that counts down from that timeout, and start an election when that timer hits zero. I set the timeout and start the timer when the server gets instantiated, and I call a method
start_election when that timer gets to zero.
As you can see, the
start_election method doesn’t start an election—it just impersonates Jason Mendoza.
Right now, it’s going to do that just once, when the original timeout happens, and never again, and there’s nothing any other server can do to stop it.
So I reset the timer each time a follower server receives an
append_entries call, which means there’s a server leader acting as the leader somewhere.
Great, so we have our election timeouts set up. Now it’s time to start an election.
So what does it mean to start an election?
Once again, I’ll defer to the paper:
To begin an election, a follower increments its current term and transitions to candidate state. It then votes for itself and issues RequestVote RPCs in parallel to each of the other servers in the cluster. A candidate continues in this state until one of three things happens:
(a) it wins the election,
(b) another server establishes itself as leader, or
(c) a period of time goes by with no winner. These outcomes are discussed separately in the paragraphs below.
We’ll get to the three things that happen later. First, let’s get
start_election doing something more useful (albeit, perhaps, less entertaining.)
First, we increment the term (line 45), put the server in the candidate state (line 46), have the server vote for itself (line 48), and send out a new request to all the other servers asking for their vote (lines 49-51).
In the server initializer, we’ll instantiate a dictionary to store votes. We will use this dictionary to determine when a majority of servers have voted for this server as a leader, much in the way we store responses to
append_entries to determine when a majority have replicated the logs.
I also changed the timeout, which previously ranged from 5 to 30 seconds, to range between 10 and 18 seconds. I did this so I knew I’d have more than 5 seconds to start up all of my servers before one timed out, but if I were waiting for timeouts to happen, I wouldn’t have to stare at my screen for 30 seconds.
In a “real” implementation, these timeouts would be hundreds of milliseconds, not several seconds. I have things happening very slowly in this implementation so I don’t have to scour copious logs to find an error that I just saw fly by. The point here is to implement and learn, not to actually build a production-ready system for multi-master replication across SVN repositories for Google Code (though apparently they used Paxos for that).
So I made up this new call,
can_I_count_on_your_vote_in_term. Servers need to be able to respond to that. They also need to be able to interpret a response to that. So we add two new items to our response conditional:
First, if a server receives a request for a vote (line 267), it looks at the term for which the other server is requesting a vote (line 269). First, it determines if the server is out of date by ensuring that it doesn’t think the next term is one that has already passed.
It also makes sure that the server requesting the vote has a log at least as up to date as the voting server’s log. As per the paper:
Raft determines which of two logs is more up-to-date by comparing the index and term of the last entries in the logs. If the logs have last entries with different terms, then the log with the later term is more up-to-date. If the logs end with the same term, then whichever log is longer is more up-to-date.
So, we check that the latest term in the candidate’s log is at least as high as the latest term in the voting server’s log (line 270), and then that the candidate’s log is at least as long as the voting server’s log (line 271). (Since the indices increment for every line in the log, we can use the index as an indicator of the size of the log).
Finally, the voting serve2 determines whether it has voted yet in this term (line 272) . Each server only votes for one server per term—the first one to contact it, or itself, if it’s the one starting the election. If not, it votes for the server who requested the vote and sets a boolean to remember that it has voted this term (line 275). Anyone else who contacts it for this term will now get a “Sorry, already voted” response (line 273).
Later, I remove
already_voted from the server altogether by incrementing each server’s term at the time that it votes for another server. That’s what’s prescribed by Raft, and it results in any other servers requesting votes for this term getting denied because they’re no longer requesting votes for a term greater than the one the voting server is currently in. You can see that change in this commit. The juicy addition:
To make it less onerous to include the candidate term, log term, and log index in the request vote call, I extracted the stringification and hydration of that call into its own class similar to
When a server that requested votes receives those votes (line 279), it updates its dictionary where it keeps track of who voted for it (line 280). This implementation looks similar to the one we used for determining when to commit entries to the state machine:
Every vote triggers a check for whether a majority of servers have voted for this one (line 133), which we get by comparing the number of keys that point to ‘True’ values with the number of keys that point to ‘False’ values. Is it super efficient? No. Am I super concerned about the efficiency of loop-filtering over a dictionary with 5 key-value pairs? Also no.
The server wins! What do we do next (besides, of course, issue an appropriately courteous and triumphant public statement)?
Here’s what we do on a newly-elected server:
- Update the state machine to reflect the logs (line 135). We haven’t been doing that continuously because followers don’t access their state machines (as discussed over here), but now that this server is a leader, that state machine needs to be ready to respond to client requests.
- Set this server to be the leader (line 136). Now, anywhere that behavior is conditional on leadership, it will know what it is.
- Send out an
assert dominancelet the other servers know that they have a leader they can count on (line 138). This will prevent them from starting new elections.
- Reset the voting dictionary so that everything is false (lines 140-142). That way, if this server somehow loses leadership and then starts a new election, its dictionary will be ready.
At this point, we can get to step 5 in our manual steps above. But when we restart our former leader, things get off the rails. That’s because that server doesn’t have the updated term, and the other servers aren’t checking for an updated term. If that server tried to start an election, it would get “Sorry, already voted” from everyone. Meanwhile, if a server with an outdated term purported to be leader (perhaps upon reconnecting following a partition), it could successfully send
append_entries calls because the followers don’t check the term before naively accepting these commands.
We’re about to fix those things, but this post is getting long, so we’ll put that in the next post.
If you liked this piece, you might also like:
The teaching tag, if you’re learning something from this series and want to see more about how I teach
How to Jump-Start a New Programming Language, or maybe, even, gain a more concrete mental model of the one you already use!
How does git detect renames?—This piece is about how git detects renames, but it’s also about how to approach questions and novel problems in programming in general.