Distributed Consensus is one of the hardest problems in the distributed computing domain and there have been many designs and algorithms built for solving it.
In this article, we are going to talk about Raft distributed consensus algorithm which is one of the modern and most popular consensus algorithms out there.
What is Distributed Consensus?
Before we begin the discussion, let’s try to understand what Distributed Consensus means. When we are designing distributed systems, we have to create systems with software applications that are distributed across multiple nodes. It’s is often required for these runtimes running on different nodes to collaboratively make decisions and share data.
So, in distributed consensus, a collection of nodes works as a coherent group and it can survive the failure of some of its members.
Raft is a modern, reliable and relatively less complicated distributed consensus algorithm that is frequently used in modern software solutions such as Consul, etcd, RabbitMQ and so on.
Before the inception of Raft, Paxos (designed by Leslie Lamport) was one of the most popular algorithms which are designed for the distributed consensus. Hence many distributed consensus algorithms were either based on Paxos or inspired from it. However, Paxos is known to be a really complex algorithm that is hard to implement to meet its performance and correctness requirements.
One of the main motives behind Raft is to design a distributed consensus algorithm that enhances understandability without compromising performance and correctness. Raft authors(Diego Ongaro and John Ousterhout) argue that Raft is superior to Multi-Paxos protocol as it enhances understandability by maintaining performance and correctness.
Before diving deep, let’s look at some of the fundamentals of the Raft consensus algorithms.
Raft is based on a leader is driven consensus model where a distinguished leader will be elected and that leader is fully responsible for managing the cluster. The leader manages a replicated log across all the nodes that form the Raft cluster.
To understand the basic terminology of the Raft algorithm, let’s look at a consensus scenario between five nodes and see how Raft implements it. So, as shown in figure 1, a leader(S1) for the cluster will be elected during the startup and serves all the commands/requests coming from the clients. All nodes of the Raft cluster maintain a distributed log(replicated log) to store and commit commands(or log entries)issued by the clients. The leader accepts log entries coming from the client and replicates them across all the followers (S2, S3, S4, S5) who are part of the Raft cluster.
In a Raft cluster, there’s a minimum number of nodes that are required to provide the expected level consensus guarantees. This is known as the Quorum. The minimum number of votes required to perform an operation in a Raft cluster is (N/2 + 1) where N is the number of total members in the group. Therefore in our above example, we need to have at least 3 nodes to have the consensus guarantees.
If a quorum of nodes is unavailable for any reason, the cluster becomes unavailable and no new logs can be committed.
Raft addresses the distributed consensus problem by solving three main sub-problems around Leader Election, managing the Distributed Log and the Safety features of the algorithm.
Electing a new leader for the Raft cluster is the process of leader election. When we start a new Raft cluster or when a leader is unavailable, a new leader will be elected through an election between all the member nodes of the cluster. So, at a given instance, the nodes of the Raft cluster can be in any of the following states (shown in Figure 2); Follower, Candidate or Leader.
The Leader is fully responsible for managing the Raft cluster and it handles all the client commands coming in. The Followers are passive and they merely respond to the requests of the leader or candidates. When there’s no leader in the Raft cluster, any follower can become a Candidate that requests for votes from the rest of the members in the cluster.
In the Raft algorithm, we divide time into arbitrary intervals known as Terms. As shown in Figure 3, each term begins with an election and the elected leader serves till the end of the term unless there’s a failure in the leader node.
In certain cases, during the elections two candidates may get equal votes(split vote scenarios) then a new election needs to be started immediately afterward. The terms are represented as numbers that are monotonically increased over time and the term values are used when inter-process communication takes place between the nodes. All these nodes in the Raft cluster communicate using the remote procedure calls(RPC).
Starting an election
Raft uses a heartbeat based RPC mechanism to detect when to start a new election. During the normal operation, the leader periodically sends(Figure 4) heartbeat messages to all the available followers. So, nodes started up in follower state, remains on it as long as it received periodic heartbeats from the current leader.
These heartbeats are based on the same message type that the leader uses to send new log entries to the follower. So, the leader uses AppendEntries RPC with no log entries as the heartbeat message.
If a follower has not received heartbeat messages over some time, then it triggers an election timeout which initiates a new election to choose a leader.
When the follower reaches its timeout, then it starts the election procedure by:
- Incrementing the current term.
- Votes for itself and sends ‘RequestVote’ RPC to all the other in the cluster.
As shown in figure 5, in our example, the follower S1 has reached the election timeout first and then it becomes the candidate. Then it requests other nodes to vote for it as the leader.
Based on the responses that the candidate receives from the other nodes in the cluster there can be three outcomes of the election.
- Candidate S1 wins the election if a majority of the nodes voted for the RequestVote request with “yes”.
2. While S1 is waiting it may receive an AppendEntries RPC from another node claiming to be the leader. If S1's candidate term is lower than the term of the received term of the AppendEntries RPC, then the candidate S1 gives up and accepts that the other node as a legitimate leader.
3. Split vote scenario: When there are multiple followers who become the candidates at the same time, no candidate can obtain the majority. This is known as a split vote situation. In such cases, each candidate will timeout and a new election will be triggered.
To minimize the split vote situations, Raft uses a randomized election timeouts mechanism, where it allocates randomized timeout values to each node.
Once a node becomes the leader, it can receive commands/log entries from the clients. The log entries are sent using the AppendEntries RPC (Figure 6).
Upon the receipt of a command from the client, the leader assigns a term and an index to the command. Then the leader tries to replicate the command across the majority of the nodes in the cluster. If the replication is successful, then the command is committed to the cluster and the response is sent back to the client.
In figure 7, what is illustrated is a sample distributed log of Raft cluster with five nodes.
- Each row of the diagram represents the log of a given node and log is indexed to uniquely identify each command that the leader received and replicated across the rest of the nodes.
- The command is represented using the syntax (x ← value) along with the associated term of the leader (here each term are represented with different colors for clarity).
- The term number is used to identify the inconsistencies of the logs.
Let’s try to see how the log replication works when there’s a new command (y ← 9) sent from the client to the leader. So as shown in Figure 8, at the time the new command is received all nodes are inconsistent state as they have the same log entries and commit the index at 2.
- The leader appends the command to the log and broadcasts AppendEntries RPC with that command.
- Each node applies for the entries locally and replies with success.
- When a majority of the follower nodes have successfully committed the log entry locally, then the leader commits the command and sends a success response back to the client.
- When the leader commits the log entry, it also updates the commit index (3 in this example) and the next AppendEntries broadcast message replicates the updated commit index to all the follower nodes.
- When an entry is committed by the leader, it also commit all the entire prior to the current log index.
In the event of some nodes are not available to receive log entries or the message is lost in-flight, then there can be inconsistencies in the logs. The leader is responsible for reconciling such inconsistencies. When the follower receives the AppendEntries RPC, that also contains the term and the log index of the previous log entry. It this doesn’t match with the last entry of the follower’s log, then the follower sends an unsuccessful response. So the leader knows that this particular node is having inconsistencies in its logs.
The leader resolves log inconsistencies by keeping track of the nextIndex of each follower. When a given follower has an inconsistent log(i.e. when it sends anunsuccessful response), then the leader decrements the nextIndex value and retries the AppendEntries RPC. This procedure continues until the follower log becomes consistent with the leader. Also, each node keeps the local log, in a persisted storage.
To ensure the above leader election and log replication to work in all conditions, we need to augment some safety conditions to the Raft algorithm.
Election Restriction — “Extra condition on leadership”
The election restriction condition mandates that a candidate’s log is at least as up-to-date as the other nodes. If not the follower nodes will not vote for the candidate.
This means that every committed entry must be present in at least one of those servers. If the candidate’s log is at least as up-to-date as any other logs in that majority, then it will hold all the committed entries.
Therefore the RequestVote RPC includes information about the candidate’s log, and the voter denies its vote if its own log is more up-to-date than that of the candidate.
Committing entries from previous terms — “Extra condition on commitment”
This condition is about restricting the leader to only commit entries from any previous term if the same leader has already successfully replicated an entry from the current term.
Raft algorithm provides the following guarantees when it comes to distributed consensus.
- Election Safety: at most one leader can be elected in a given
- Leader Append-Only: a leader never overwrites or deletes
entries in its log; it only appends new entries.
- Log Matching: if two logs contain an entry with the same
index and term, then the logs are identical in all entries up to
and including the given index.
- Leader Completeness: if a log entry is committed in a given
the term, then that entry will be present in the logs of the leaders
for all higher-numbered terms.
- State Machine Safety: if a node has applied a log entry at a
given index to its state machine, no other node will ever apply
a different log entry for the same index
In this post, we discuss some of the key fundamentals of the Raft consensus algorithm. There are Raft implementations for a multitude of programming languages and you can find more details of those implementations here. If you want to dive deep into the Raft protocol implementation, then I highly recommend the following resources to master the algorithm.
- Raft In Search of an Understandable Consensus Algorithm — This is the extended version of the original Raft paper by Diego Ongaro and John Ousterhout
- Raft simulation — Try this out to get more familiarized with the Raft protocol
- Raft Refloated: Do We Have Consensus? by Heidi Howard Malte Schwarzkopf Anil Madhavapeddy Jon Crowcroft