In distributed systems, it is often of interest to find a distinguished node or process. For example, consensus algorithms like Raft has leader election as a step in maintaining a distributed and consistent log.
This post describes my prototyped implementation of the Hirschberg and Sinclair algorithm in Scala using Finagle for RPCs. This was done during hack week at Spotify to help me better understand distributed systems.
This was particularly interesting in comparison to other ways I’ve computed in a distributed environment before, and I hope to write comparisons of these different modes of distributed computing in the future.
The actual implementation is definitely far off from being production ready, though I did learn some lessons in distributed computing, so thought I’d share some things I learned both on the theory and implementation side.
There are a few variations on what even constitutes a leader election algorithm. The algorithm we implement does the following: Given a ring of nodes, have one node decide that it is the leader. Other algorithms could, for example, have all nodes know who is the current leader, not assume a ring, etc. The Hirscberg and Sinclar (HS) algorithm is efficient in that it optimizes for passing as little data from nodes during the process as possible.
We assume that each node has a unique integer id associated with it, and the HS algorithm gives each node a set of instructions to determine if it is the leader. Without anything unique like an id, there is no algorithm to decide who is the leader.The leader, by definition, is the node with the largest id.
The difficulty is that each node will have the same set of instructions to determine if it is the leader, and we want to minimize the number of network calls for most nodes to realize they are not the leader and for one node to realize that it is the leader.
The algorithm is as follows:
(1) The algorithm is iterative and a variable for which step we are in is called “phase”, in the base case phase = 1. Each node tells it’s left and right neighbor it’s Id, and if the node’s Id is larger than that of it’s neighbors, the neighbors pass that information back to the node.
(2) If a node still thinks it might be the leader after phase 1, it proceeds to phase 2, in where it will send it’s Id outward to its left and right neighbors, and those will send it to their respective left or right neighbors. If the Id of the original node is larger than the 2nd degree neighbors, the Id is sent back inward.
(3) The phase is increased until either the node realizes it is not the leader, or during the outgoing phase, the node compares it’s Id to itself and declares itself the leader.
This is visualized in the diagrams below. The first diagram shows what happens how node 42, which is not the leader, realizes that it is not the leader after phase 2 of the iterations.
Once node 17 passes info asking node 84 if node 42 is the leader, the process stops since 84 is larger than 42.
Lets now see what happens for phases 1, 3 and 5 of how the node with Id 84 realizes it is the leader.
Finagle was a convenient choice for doing RPCs in Scala to quickly prototype the HS algorithm. The RPCs are done asynchronously, in contrast to usual textbook descriptions of the HS algorithm based on synchronous message passing. In the prototype, all “nodes” are independent processes on localhost, with different ports. Each node has Client code and Server code.
The client code asks left and right neighboring servers if it is the leader by passing its Id and the current phase it is in. The phase is an integer specifying how far out to go in the ring for the current iteration. Before diving in to more details, here is a diagram of how a node asks if it is the leader.
The server handles incoming or outgoing RPCs, in the sense of whether movement around the ring is outward or coming inward.
Outgoing requests decide if we should continue sending RPCs outward by comparing the Id of the initial asking node to the server Id. They also determine whether the current outward iteration is over and whether to start passing RPCs back inward. In the event the Id of the asking node matches the servers Id, the server stores info that that node is the leader.
Inward RPCs only make more RPCs inward until we are back to the initial node making the initial outward request, and in in that case the RPC stores info for the client to go to the next iteration.
Here is the main part of the RPCs methods of the server code:
<a href="https://medium.com/media/ad368a15bfce9bc087467851fc975a7d/href">https://medium.com/media/ad368a15bfce9bc087467851fc975a7d/href</a>
The client code starts with phase = 1 and passes out to the left and right neighbors and asks if it is the leader. If it passes that test, it ups the phase to 2 so that the left and right servers ask their neighboring servers if the node is the leader.
This is repeated until either the client realizes it is not the leader, or — once the phase is big enough — the corresponding server on the same node as the client will ask itself if it is the leader. The server will store information to disk that the client can look up to realize if it is the leader or whether to advance to the next phase.
Here is the main part of the client code:
<a href="https://medium.com/media/56f567822735d4303f8ef49380cb9e7e/href">https://medium.com/media/56f567822735d4303f8ef49380cb9e7e/href</a>
For the full code, check out the Github repo.