Raft 11: Leader Elections, Part 2

Reading Time: 7 minutes

In December, I took a course in which we attempted to implement the Raft distributed consensus algorithm from this paper. Parts 1-5 of this series share insights from the course. From now on, I’m guiding you through my continued work implementing Raft “for fun” (I know. I don’t understand me, either).

Here’s where you can see all the posts so far.

In the previous post, we implemented leader elections. When we left off, though, we weren’t catching up servers on what term we’re currently in. Followers also weren’t checking whether a purported leader knew the current term before accepting its append_entries commands. Let’s fix that.

In Raft, state belongs in the messages.

A system that is resilient to servers going down cannot rely on the state of those servers as a source of truth. So servers need to include all relevant context in their messages to each other.1 In this case, the current term is relevant context. So the append_entries calls need to include it, as per this part of the Raft paper:

While waiting for votes, a candidate may receive an AppendEntries RPC from another server claiming to be leader. If the leader’s term (included in its RPC) is at least as large as the candidate’s current term, then the candidate recognizes the leader as legitimate and returns to follower state. If the term in the RPC is smaller than the candidate’s current term, then the candidate rejects the RPC and continues in candidate state. ‘

We add the term to the append_entries commands in this commit.

Conveniently, we extracted the AppendEntries call to its own object that helpfully converts all the different components to and from the string format for us, so adding the current term to the call isn’t too onerous:

Screen Shot 2020-08-28 at 10.57.23 PM

We then check this latest term when a server receives an append_entries call:

Screen Shot 2020-08-28 at 10.59.41 PM

Everything we previously did in an append_entries response—canceling the election timer, checking that the logs are up to date, appending any entries provided by the call—only happens in an else block below the conditional on line 184.

I also added a heartbeat timer here. This is because only the leader should send heartbeats. If a purported leader or candidate server receives an append_entries call from a server that turns out to have a later term than it does, it reverts to the follower state. So if it was previously sending heartbeats, it now needs to stop.

I start that up when the server becomes the leader:

Screen Shot 2020-08-28 at 11.04.31 PM

and I cancel it in the event of an up-to-date append_entries call:

Screen Shot 2020-08-28 at 11.03.37 PM

At this point, we can get through our ten-step manual test list…most of the time. For reference, here’s that list:

  1. Fire up a leader and two follower servers, with the leader having a log with multiple commands and the followers having empty logs.
  2. Once the leader has started issuing heartbeats to the followers, shut it down.
  3. Ensure that the two remaining followers choose a leader between themselves.
  4. Start up a client and issue some read commands to that leader. Ensure that its state matches the totality of its current logs.
  5. Issue some write commands to that leader. Ensure that it replicates those commands on the follower.
  6. Restart the former leader server, this time as a follower. Ensure that the new leader catches up the restarted server on the commands issued while it was down.
  7. Shut down the current leader server.
  8. Ensure that the remaining servers again choose a leader.
  9. Write some commands to this leader, and ensure they are replicated.
  10. Again restart the downed server and ensure the new leader brings it back up to date.

We’ve almost got the pieces in place. There’s just one more thing.

What if two servers call an election simultaneously, each garner half of the votes, and no one is elected? They’ll ultimately hit their election timeouts and start new elections. But if their timeouts are the same, this could happen indefinitely.

Sounds like an edge case, right? Well, it’s an easy one to hit when you’re running a three server cluster and you bring down the leader. You then have two follower servers, both of whom have timeouts that are some integer between 8 and 18 (inclusive). There’s a 1 in 121 chance that they have the same timeout, which translates to a 99% chance that this is going to happen at some point within about 500 trials. It happened to me within 10, because I’m just lucky that way 🙃.

Since both servers vote for themselves first, it’s not even a race to see who can get their vote requests to another follower the fastest: they’ll keep voting for themselves, declining to vote for the other server, timing out, and starting over.

So what’s the solution? Reset the election timeouts at the start of every election:

Each candidate restarts its randomized election timeout at the start of an election, and it waits for that timeout to elapse before starting the next election; this reduces the likelihood of another split vote in the new election. 

We implement that in this commit:

Screen Shot 2020-08-28 at 11.25.54 PM

I also removed the “candidate” attribute here because, it turns out, I wasn’t using it for anything. Instead, servers exist in either the leader or the follower state, and they communicate their candidacy by sending out RequestVote RPCs as followers.

This concludes the implementation of all the big pieces. We have a couple of small changes remaining, which I’ll show you in the next piece, and finally we’ll wrap up this whole thing 🙂.

1. There’s a lesson here, by the way, for working in teams. Imagine, for a moment, that the Raft servers are all team members, and the log is their collaborative project. Servers (team members) might join the cluster (get hired), leave the cluster (configuration changes), or temporarily become unavailable (illness, vacation, competing responsibilities) at any time. Under these circumstances, we know that Raft needs to store context, not with any one server, but rather in their shared project itself (the log) or in their messages to one another (the RPCs). So why are we perfectly content on teams to have one person “own” something that they don’t communicate, or inadequately communicate? Furthermore, on software teams, not only are teammates usually not evaluated on whether they have communicated—indeed, communication is actively de-prioritized, and developers learn that they receive accolades for cranking features—not by making their work discoverable to anyone else. I have written some pieces about innovative context-sharing for software teams, but ultimately, will anyone use these techniques before they’re professionally incentivized and rewarded to do so?

If you liked this piece, you might also like:

The debugging posts (a toolkit to help you respond to problems in software)

The Listening Series (Prepare to question much of what you know about how to be good at your job.)

Skills for working on distributed teams (including communication skills that will make your job easier)

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.