add blog post on leaderless consensus

author: Alex Auvolat <alex@adnab.me> 2023-11-30 13:41:38 +0100
committer: Alex Auvolat <alex@adnab.me> 2023-11-30 13:41:38 +0100
commit: c8514a37935492cc3657d317d26c8bf1974de4f6 (patch)
tree: a51592e733c7ea2893c9668481c7cdceccb276cd /content/blog
parent: 16fa02f53a020e59360df688bb3fc7583b52dc92 (diff)
download: garagehq.deuxfleurs.fr-c8514a37935492cc3657d317d26c8bf1974de4f6.tar.gz
garagehq.deuxfleurs.fr-c8514a37935492cc3657d317d26c8bf1974de4f6.zip
2 files changed, 196 insertions, 0 deletions
diff --git a/content/blog/2023-11-thoughts-on-leaderless-consensus/2023-Leaderless_consensus_JPDC.pdf b/content/blog/2023-11-thoughts-on-leaderless-consensus/2023-Leaderless_consensus_JPDC.pdf
new file mode 100644
index 0000000..0e6292e
--- /dev/null
+++ b/content/blog/2023-11-thoughts-on-leaderless-consensus/2023-Leaderless_consensus_JPDC.pdf
diff --git a/content/blog/2023-11-thoughts-on-leaderless-consensus/index.md b/content/blog/2023-11-thoughts-on-leaderless-consensus/index.md
new file mode 100644
index 0000000..1dce0e6
--- /dev/null
+++ b/content/blog/2023-11-thoughts-on-leaderless-consensus/index.md
@@ -0,0 +1,196 @@
++++
+title='Thoughts on "Leaderless Consensus"'
+date=2023-11-30
++++
+
+*Consensus algorithms such as Raft and Paxos, which are used in many distributed databases,
+have notoriously unpredictable performance in low-quality networks that suffer from
+latency, jitter, packet loss and/or unavailable nodes, which is why Garage does not use
+them and uses only CRDTs. A new paper by Antoniadis et al., [*Leaderless Consensus*](https://www.sciencedirect.com/science/article/abs/pii/S0743731523000151),
+introduces a new category of algorithms that better tolerate the frequent 
+unavailability of a subset of nodes. However, additional research and practical work is required before
+these results can be put into practice. Read for more details.*
+
+<!-- more -->
+
+---
+
+As I have said many times when presenting Garage, we have made a point of not
+using any consensus algorithm in Garage and using only CRDTs, for several
+reasons.  The first, and most important reason, is that all of the consensus
+algorithms that we know of[^1] (in particular Raft, which is very popular in
+distributed databases) suffer from unpredictable performance when nodes or the
+network are unreliable.  Even in relatively stable conditions, Raft-like
+algorithms can still be much slower than CRDTs (as we have shown in some
+[benchmarks](https://garagehq.deuxfleurs.fr/documentation/design/benchmarks/#on-a-complex-simulated-network))
+because they elect a leader node and require all operations to pass through the
+leader, which can become a bottleneck.  Other than performance issues, Raft is
+a complex algorithm and implementing it correctly is a challenging software
+engineering endeavor that we did not wish to undertake, preferring instead
+simplicity as a foundational principle to help us write correct software.
+
+However, writing a distributed system such as Garage can be challenging when
+consensus is not available, as we can only use CRDTs (conflict-free replicated
+data types) in the code, and we cannot rely on state machine replication.  This
+means that the specific semantics of CRDTs have to be taken into account
+everywhere in the code, which is often not a problem but sometimes adds some
+complexity.  More importantly, this means that a whole class of features cannot
+be implemented in Garage, like those that would require some form of locking or
+exclusive access.  In practice, this has been causing us issues on the
+CreateBucket endpoint, which by definition is meant to exclusively associate a
+bucket name to a newly created bucket. In current Garage versions, concurrent
+calls to CreateBucket with the same name may create several buckets and leave
+Garage in an inconsistent state.
+
+This leads naturally to the following question: is it possible to implement a
+consensus algorithm that eschews the shortcomings of Raft-like algorithms in
+unreliable systems?  And in particular, is it possible to implement a consensus
+algorithm that does not elect a leader, and is therefore not sensitive to
+temporary slowdowns or unavailabilities of individual nodes?  A new paper by
+Antoniadis et al., [*Leaderless
+Consensus*](https://www.sciencedirect.com/science/article/abs/pii/S0743731523000151)
+[[PDF](/blog/2023-11-thoughts-on-leaderless-consensus/2023-Leaderless_consensus_JPDC.pdf)],
+suggests that the answer is *yes*.  However, as with all new research, putting
+it into practice will take some time and a lot of work.  I will discuss in this
+article practical questions posed by the *Leaderless Consensus* paper, and
+further steps that could be taken to advance on these issues.
+
+Please note that the entire content of this article is **purely speculative**
+and does not include any *positive results*. Note also that we are not
+discussing Byzantine-tolerant systems, which seem to be the main focus of
+*Leaderless Consensus*, even though the authors also propose an algorithm for
+non-Byzantine systems (the one we are interested in).
+
+---
+
+## Main takeaways of *Leaderless Consensus*
+
+To be able to meaningfully say that an algorithm is *leaderless*, one has to first
+determine what *leaderless* precisely means. The paper starts by offering such
+a definition, using a network model they call *synchronous-k* ("synchronous minus *k*"),
+where *n* nodes are running in synchronous steps where at most *k* nodes might be
+offline, paused, or otherwise unavailable, at each step.
+The *synchronous-k* model has a variant called *eventually synchronous-k* which seems
+to better model the behaviour of WAN links on the Internet, although I am not sure
+of the precise difference between the two. Once the *synchronous-k* network model
+is defined, a leaderless consensus algorithm is simply defined as a consensus algorithm
+that still works (i.e. it terminates, giving a decision), in a *synchronous-1* system.
+Concretely, this means that at any given time, a random node in the network may be
+disconnected (not always the same one), and the consensus algorithm will be impacted
+only minimally. In other words, we can say that a leaderless consensus algorithm
+degrades gracefully in the presence of transient node failures.
+This "graceful degradation" property, which Raft does not have,
+seems to be exactly what we are looking for in a potential consensus algorithm that
+could be added to Garage.
+
+Having given this definition, the paper continues by offering concrete
+algorithms to implement leaderless consensus. Of particular interest to us, the
+paper presents in Section 5 a leaderless consensus algorithm, which they call
+OFT-Archipelago, which works in message passing systems without Byzantine
+nodes, where the only faults that can occur are message omissions (like
+messages being dropped by the network, or temporary node crashes).  This is
+exactly the premise made by Garage, so this algorithm could be a good candidate
+for us. Interestingly, while leaderless consensus is formally defined as a
+consensus algorithm that works in a *synchronous-1* system (i.e.  tolerating
+only one failed node at each step), Archipelago works with up to *f < n/2*
+unavailable nodes at each time steps.
+
+According to the benchmarks in the leaderless consensus paper, while
+Archipelago has very good throughput (around 50kops/s), the latency of
+individual operations is generally between 1 or 2 seconds. This seems to be
+acceptable for application in Garage if used only for administrative operations
+on buckets and access keys which are relatively rare. From a theoretical point
+of view, OFT-Archipelago can terminate in 3 RTT in the optimal scenario,
+however it is not clear to me whether there is an upper bound on the
+termination time, or whether there is a probabilistic analysis of the
+termination delays that could be made.  It is also not very clear to me the
+link between this algorithm and the FLP impossibility theorem: since
+Archipelago seems to do things that are forbidden by FLP, it means that the
+premise of a *synchronous-k* system is probably in fact much stronger that the
+network asynchrony assumed by FLP.
+
+Among the other advantages of OFT-Archipelago is the fact that the algorithm
+seems to be very simple, much more than Raft, as it is described in the paper
+in only 42 lines of very understandable pseudocode.  There is also a BFT
+variant of Archipelago, which is not of interest to us in the context of Garage
+as we are making the hypothesis that all nodes are trusted.
+
+---
+
+## Where to go from now?
+
+Before an algorithm such as OFT-Archipelago can be added to Garage, a few fundamental
+questions need to be answered, among which:
+
+- How should Archipelago interact with Garage's use of CRDT data types? Do we
+  have to create a fully separate subsystem for things that are managed under
+  consensus, or can we hopefully share some logic?  More precisely, can we use
+  a consensus algorithm simply as a total order broadcast primitive that
+  becomes a mandatory passing point for all modification requests on a set of
+  metadata tables, with those tables still being based on the CRDT table
+  replication and synchronisation library which is currently in use in Garage?
+  In this situation, nodes that come back from a crash can simply catch up on
+  old changes using the Merkle tree algorithm synchronisation algorithm that we
+  already have.  Or must we use the consensus algorithm as the only way to
+  broadcast operations and data for the tables that are managed by it? This
+  would mean that we must add specific logic to handle the case of a node
+  coming back from a crash, where it must either download all the log of
+  operations since it was last up, or an entire snapshot of the metadata tables
+  in question. I think this is mostly related to the reason we want to add
+  consensus, and the exact consistency guarantees we are expecting it to
+  provide to us.
+
+- Can Archipelago be made correct under cluster reconfiguration scenarios? This
+  is linked to the work done for task 3 of the 2023 NLnet project
+  ([#495](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/495),
+  [#667](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/667)), which focuses
+  on making the Quorum-based algorithm for CRDT updates reliable even when the
+  cluster layout is updated. I will be writing more about this topic in a
+  future blog post, but in a nutshell, the NLnet task is mainly focused on
+  maintaining read-after-write consistency in Garage at all times, which has
+  led us to develop a relatively general framework for modeling algorithm based
+  on quorums. Since Archipelago also guarantees its correctness using a
+  non-empty-intersection-of-quorums property, it could benefit from the work
+  that was originally made on quorums for the CRDT algorithms.
+
+If we obtain satisfactory answers to these questions, the remaining work will be
+the technical implementation of Archipelago in Garage and its validation:
+
+- Determine more precisely how the pipelined version of Archipelago is made,
+  as its complete description is not given in the leaderless consensus paper,
+  only a few basic pointers (Section 8.1 of the JPDC version).
+
+- Implement Archipelago in Rust, ideally under the form of a generic reusable crate
+  that could be used outside of the context of Garage.
+
+- Do a benchmark of Archipelago vs. existing Raft implementations (for instance
+  the async-raft crate).  We should benchmark the algorithms in the following
+  scenarios: stable networking, high latency and jitter, evolutive situation
+  with different phases.  My hypothesis is that Archipelago could be slower (in
+  terms of latency, not necessarily in throughput) than Raft in the stable
+  networking scenario, but the other two scenarios would force Raft to
+  reconfigure often (i.e. change leaders), which could be the source of huge
+  performance penalties, which Archipelago would not suffer from.
+
+- Integrate Archipelago with Garage to solve the CreateBucket issue.
+
+- To validate our implementation, we would want to test it using automated
+  testing frameworks such as Jepsen. I've been using Jepsen for the NLnet task
+  3 and I'm starting to understand quite well how it works, so this could be
+  relatively easy.
+
+- If we want to go further, there is always the possibility of formalizing a
+  proof of our implementation, however I don't know what are the good tools to
+  do this, and in all cases it would be an extreme amount of work.
+
+
+Please send your comments and feedback to
+[garagehq@deuxfleurs.fr](mailto:garagehq@deuxfleurs.fr) if you have any.
+
+---
+
+<sup id="1">1</sup>: We are concerned only with consensus algorithms in the
+context of closed, trusted systems such as distributed databases, and not in
+large trustless networks such as blockchains.
+
+Written by [Alex Auvolat](https://adnab.me).
author	Alex Auvolat <alex@adnab.me>	2023-11-30 13:41:38 +0100
committer	Alex Auvolat <alex@adnab.me>	2023-11-30 13:41:38 +0100
commit	c8514a37935492cc3657d317d26c8bf1974de4f6 (patch)
tree	a51592e733c7ea2893c9668481c7cdceccb276cd /content/blog
parent	16fa02f53a020e59360df688bb3fc7583b52dc92 (diff)
download	garagehq.deuxfleurs.fr-c8514a37935492cc3657d317d26c8bf1974de4f6.tar.gz garagehq.deuxfleurs.fr-c8514a37935492cc3657d317d26c8bf1974de4f6.zip