1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
|
+++
title="Maintaining Read-after-Write consistency in all circumstances"
date=2023-12-25
+++
*Garage is a data storage system that is based on CRDTs internally. It does not
use a consensus algorithm such as Raft, therefore maintaining consistency in a
cluster has to be done by other means. Since its inception, Garage has made use
of read and write quorums to guarantee read-after-write consistency, the only
consistency guarantee it provides. However, as of Garage v0.9.0, this guarantee
is not maintained when the composition of a cluster is updated and data is
moved between storage nodes. As part of our current NLnet-funded project, we
are developping a solution to this problem, that is briefly explained in this
blog post.*
<!-- more -->
---
Garage provides mainly one consistency guarantee, read-after-write for objects, which can be described as follows:
**Read-after-write consistency.** *If a client A writes an object x (e.g. using PutObject) and receives a `HTTP 200 OK` response, and later a client B tries to read object x (e.g. using GetObject), then B will read the version written by A, or a more recent version.*
The consistency guarantee offered by Garage is slightly more general than this
simplistic formulation, as it also applies to other S3 endpoints such as
ListObjects, which are always guaranteed to reflect the latest version of
objects inserted in a bucket.
This consistency guarantee at the level of objects in the S3 API is in fact a
reflection of read-after-write consistency in the internal metadata engine of
Garage (which is a distributed key/value store with CRDT values). Reads and
writes to metadata tables use quorums of 2 out of 3 nodes for each operation,
ensuring that if operation B starts after operation A has completed, then there
is at least one node that is handling both operation A and B. In the case where
A is a write (an update) and B is a read, that node will have the opportunity
to return the value written in A to the reading client B. A visual depiction
of this process can be found in [this
presentation](https://git.deuxfleurs.fr/Deuxfleurs/garage/src/commit/a8b0e01f88b947bc34c05d818d51860b4d171967/doc/talks/2023-09-20-ocp/talk.pdf)
on slide 32 (pages 57-64), and the algorithm is written down on slide 33 (page 54).
Note that read-after-write guarantees [are broken and have always
been](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/147) for the bucket
and access key tables, which might not be something we can fix due to different
requirements on the quorums.
## Current issues with Read-after-Write consistency
Maintaining Read-after-Write consistency depends crucially on the intersection
of the quorums being non-empty. There is however a scenario where these quorums
may be empty: when the set of nodes affected to storing some entries changes,
for instance when nodes are added or removed and data is being rebalanced
between nodes.
### A concrete example
Take the case of a partition (a subset of the data stored by Garage) which is
stored on nodes A, B and C. At some point, a layout change occurs in the
cluster, and after the change, nodes A, D and E are responsible for storing the
partition. All read and write operations that were initiated before the layout
change, or by nodes that were not yet aware of the new layout version, will be
directed to nodes A, B and C, and will be handled by a quorum of two nodes among
those three. However, once the new layout is introduced in the cluster, read
and write operations will start being directed to nodes A, D and E, expecting a
quorum of two nodes among this new set of three nodes.
Crucially, coordinating when operations start being directed to the new layout
is a hard problem, and in all cases we must assume that due to some network
asynchrony, there can still be some nodes that keep sending requests to nodes
A, B and C for a long time even after everyone else is aware of the new layout.
Moreover, data will be progressively moved from nodes B and C to nodes D and E,
which can take a long time depending on the quantity of data. This creates a
period of uncertainty as to where exactly the data is stored in the cluster.
Overall, this basically means that there is no way to guarantee the
intersection-of-quorums property, which is necessary for read-after-write, with
such a simplistic scheme.
Concretely, here is a very simple scenario in which read-after-write is broken:
1. A write operation is directed to nodes A, B and C (the old layout), and
receives OK responses from nodes B and C, forming a quorum, so the write
completes successfully. The written data sent to node A is lost or delayed
for a long time.
2. The new layout version is introduced in the cluster.
3. Before nodes D and E have had the chance to retrieve the data that was
stored on nodes B and C, a read operation for the same key is directed to
nodes A, D and E. This request receives OK responses from nodes D and E,
both containing no data but still forming a quorum of 2 responses. So the
read returns a null value instead of the value that was written before, even
though the write operation reported a success.
### Evidencing the issue with Jepsen testing
The first thing that I had to do for the NLnet project was to develop a testing
framework to show that read-after-write consistency issues could in fact arise
in Garage when the cluster layout was updated.
To make such tests, I chose to use the Jepsen testing framework, which helps us
put distributed software in complex adverse scenarios and verify whether they
respect some claimed consistency guarantees or not. I will not enter into too
much detail on the testing procedure, but suffice to say that issues were
found. More precisely, I was able to show that Garage did guarantee
read-after-write in a variety of adverse scenarios such as network partitions,
node crashes and clock scrambling, but that it was unable to do so as soon as
regular layout updates were introduced.
The progress of the Jepsen testing work is tracked in [PR
#544](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/544)
## Fixing Read-after-Write consistency when layouts change
To solve this issue, we will have to track the progress of the transfer of data from nodes of layout version n-1 to nodes of layout version n. As long as that transfer has not finished, we will have to use a dual-quorum strategy to ensure consistency:
- use write quorums amongst nodes of layout version n
- use two read quorums, one amongst nodes of layout version n-1 and one amongst nodes of layout version n
This can be flipped the other way around, which might make more sense if we assume that reads are the most frequent operations and need to complete fast, however it might be a bit more tricky to implement:
- use two write quorums, one amongst nodes of layout version n-1 and one amongst nodes of layout version n
- use read quorum amongst nodes of layout n-1
We will also have to add more synchronization to ensure that data is not saved to nodes that are no longer responsible for a given data partition, as nodes may not be informed of the layout change at exactly the same time and small inconsistencies may appear in this interval.
## Current status and future work
|