wip: blog post on read-after-write and nlnet task 3

author: Alex Auvolat <alex@adnab.me> 2023-12-01 13:30:29 +0100
committer: Alex Auvolat <alex@adnab.me> 2023-12-01 13:30:29 +0100
commit: 96b7e909764f3f2fc7a49fd3b738129128519dd9 (patch)
tree: bf15b797cf8f8cdbba327baa052ff8f4d5694354
parent: c8514a37935492cc3657d317d26c8bf1974de4f6 (diff)
download: garagehq.deuxfleurs.fr-96b7e909764f3f2fc7a49fd3b738129128519dd9.tar.gz
garagehq.deuxfleurs.fr-96b7e909764f3f2fc7a49fd3b738129128519dd9.zip
1 files changed, 128 insertions, 0 deletions
diff --git a/content/blog/2023-12-preserving-read-after-write-consistency/index.md b/content/blog/2023-12-preserving-read-after-write-consistency/index.md
new file mode 100644
index 0000000..f83d1cb
--- /dev/null
+++ b/content/blog/2023-12-preserving-read-after-write-consistency/index.md
@@ -0,0 +1,128 @@
++++
+title="Maintaining Read-after-Write consistency in all circumstances"
+date=2023-12-25
++++
+
+*Garage is a data storage system that is based on CRDTs internally. It does not
+use a consensus algorithm such as Raft, therefore maintaining consistency in a
+cluster has to be done by other means. Since its inception, Garage has made use
+of read and write quorums to guarantee read-after-write consistency, the only
+consistency guarantee it provides. However, as of Garage v0.9.0, this guarantee
+is not maintained when the composition of a cluster is updated and data is
+moved between storage nodes. As part of our current NLnet-funded project, we
+are developping a solution to this problem, that is briefly explained in this
+blog post.*
+
+<!-- more -->
+
+---
+
+Garage provides mainly one consistency guarantee, read-after-write for objects, which can be described as follows:
+
+**Read-after-write consistency.** *If a client A writes an object x (e.g. using PutObject) and receives a `HTTP 200 OK` response, and later a client B tries to read object x (e.g. using GetObject), then B will read the version written by A, or a more recent version.*
+
+The consistency guarantee offered by Garage is slightly more general than this
+simplistic formulation, as it also applies to other S3 endpoints such as
+ListObjects, which are always guaranteed to reflect the latest version of
+objects inserted in a bucket. 
+
+This consistency guarantee at the level of objects in the S3 API is in fact a
+reflection of read-after-write consistency in the internal metadata engine of
+Garage (which is a distributed key/value store with CRDT values).  Reads and
+writes to metadata tables use quorums of 2 out of 3 nodes for each operation,
+ensuring that if operation B starts after operation A has completed, then there
+is at least one node that is handling both operation A and B. In the case where
+A is a write (an update) and B is a read, that node will have the opportunity
+to return the value written in A to the reading client B.  A visual depiction
+of this process can be found in [this
+presentation](https://git.deuxfleurs.fr/Deuxfleurs/garage/src/commit/a8b0e01f88b947bc34c05d818d51860b4d171967/doc/talks/2023-09-20-ocp/talk.pdf)
+on slide 32 (pages 57-64), and the algorithm is written down on slide 33 (page 54).
+
+Note that read-after-write guarantees [are broken and have always
+been](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/147) for the bucket
+and access key tables, which might not be something we can fix due to different
+requirements on the quorums. 
+
+## Current issues with Read-after-Write consistency
+
+Maintaining Read-after-Write consistency depends crucially on the intersection
+of the quorums being non-empty. There is however a scenario where these quorums
+may be empty: when the set of nodes affected to storing some entries changes,
+for instance when nodes are added or removed and data is being rebalanced
+between nodes.
+
+### A concrete example
+
+Take the case of a partition (a subset of the data stored by Garage) which is
+stored on nodes A, B and C. At some point, a layout change occurs in the
+cluster, and after the change, nodes A, D and E are responsible for storing the
+partition. All read and write operations that were initiated before the layout
+change, or by nodes that were not yet aware of the new layout version, will be
+directed to nodes A, B and C, and will be handled by a quorum of two nodes among
+those three.  However, once the new layout is introduced in the cluster, read
+and write operations will start being directed to nodes A, D and E, expecting a
+quorum of two nodes among this new set of three nodes. 
+
+Crucially,  coordinating when operations start being directed to the new layout
+is a hard problem, and in all cases we must assume that due to some network
+asynchrony, there can still be some nodes that keep sending requests to nodes
+A, B and C for a long time even after everyone else is aware of the new layout.
+Moreover, data will be progressively moved from nodes B and C to nodes D and E,
+which can take a long time depending on the quantity of data. This creates a
+period of uncertainty as to where exactly the data is stored in the cluster.
+Overall, this basically means that there is no way to guarantee the
+intersection-of-quorums property, which is necessary for read-after-write, with
+such a simplistic scheme.
+
+Concretely, here is a very simple scenario in which read-after-write is broken:
+
+1. A write operation is directed to nodes A, B and C (the old layout), and
+   receives OK responses from nodes B and C, forming a quorum, so the write
+   completes successfully.  The written data sent to node A is lost or delayed
+   for a long time.
+
+2. The new layout version is introduced in the cluster.
+
+3. Before nodes D and E have had the chance to retrieve the data that was
+   stored on nodes B and C, a read operation for the same key is directed to
+   nodes A, D and E.  This request receives OK responses from nodes D and E,
+   both containing no data but still forming a quorum of 2 responses. So the
+   read returns a null value instead of the value that was written before, even
+   though the write operation reported a success.
+
+
+### Evidencing the issue with Jepsen testing
+
+The first thing that I had to do for the NLnet project was to develop a testing
+framework to show that read-after-write consistency issues could in fact arise
+in Garage when the cluster layout was updated.
+
+To make such tests, I chose to use the Jepsen testing framework, which helps us
+put distributed software in complex adverse scenarios and verify whether they
+respect some claimed consistency guarantees or not. I will not enter into too
+much detail on the testing procedure, but suffice to say that issues were
+found. More precisely, I was able to show that Garage did guarantee
+read-after-write in a variety of adverse scenarios such as network partitions,
+node crashes and clock scrambling, but that it was unable to do so as soon as
+regular layout updates were introduced.
+
+The progress of the Jepsen testing work is tracked in [PR
+#544](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/544)
+
+
+## Fixing Read-after-Write consistency when layouts change
+
+To solve this issue, we will have to track the progress of the transfer of data from nodes of layout version n-1 to nodes of layout version n. As long as that transfer has not finished, we will have to use a dual-quorum strategy to ensure consistency:
+
+- use write quorums amongst nodes of layout version n
+- use two read quorums, one amongst nodes of layout version n-1 and one amongst nodes of layout version n
+
+This can be flipped the other way around, which might make more sense if we assume that reads are the most frequent operations and need to complete fast, however it might be a bit more tricky to implement:
+
+- use two write quorums, one amongst nodes of layout version n-1 and one amongst nodes of layout version n
+- use read quorum amongst nodes of layout n-1
+
+We will also have to add more synchronization to ensure that data is not saved to nodes that are no longer responsible for a given data partition, as nodes may not be informed of the layout change at exactly the same time and small inconsistencies may appear in this interval.
+
+
+## Current status and future work
author	Alex Auvolat <alex@adnab.me>	2023-12-01 13:30:29 +0100
committer	Alex Auvolat <alex@adnab.me>	2023-12-01 13:30:29 +0100
commit	96b7e909764f3f2fc7a49fd3b738129128519dd9 (patch)
tree	bf15b797cf8f8cdbba327baa052ff8f4d5694354
parent	c8514a37935492cc3657d317d26c8bf1974de4f6 (diff)
download	garagehq.deuxfleurs.fr-96b7e909764f3f2fc7a49fd3b738129128519dd9.tar.gz garagehq.deuxfleurs.fr-96b7e909764f3f2fc7a49fd3b738129128519dd9.zip