content/blog/2023-12-preserving-read-after-write-consistency/index.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281

+++
title="Maintaining read-after-write consistency in all circumstances"
date=2023-12-06
+++

*Garage is a data storage system that is based on CRDTs internally. It does not
use a consensus algorithm such as Raft, therefore maintaining consistency in a
cluster has to be done by other means. Since its inception, Garage has made use
of read and write quorums to guarantee read-after-write consistency, the only
consistency guarantee it provides. However, as of Garage v0.9.0, this guarantee
is not maintained when the composition of a cluster is updated and data is
moved between storage nodes. As part of our current NLnet-funded project, we
are developing a solution to this problem. This blog post proposes a
high-level overview of the proposed solution.*

<!-- more -->

---

Garage provides mainly one consistency guarantee, read-after-write for objects, which can be described as follows:

**Read-after-write consistency.** *If a client A writes an object x (e.g. using
PutObject) and receives a `HTTP 200 OK` response, and later a client B tries to
read object x (e.g. using GetObject), then B will read the version written by
A, or a more recent version.*

The consistency guarantee offered by Garage is slightly more general than this
simplistic formulation, as it also applies to other S3 endpoints such as
ListObjects, which are always guaranteed to reflect the latest version of
objects inserted in a bucket.  Note that Amazon calls this guarantee [*strong*
read-after-write consistency](https://aws.amazon.com/s3/consistency/) (they
also have it on AWS), to differentiate it from [another definition of
read-after-write
consistency](https://avikdas.com/2020/04/13/scalability-concepts-read-after-write-consistency.html)
that only applies to data that is read by the same client that wrote it.  Since
that weaker form is also called
[read-your-writes](https://jepsen.io/consistency/models/read-your-writes), I
will always be referring to the strong version when using the term
"read-after-write consistency".

In Garage, this consistency guarantee at the level of objects in the S3 API is
in fact a reflection of read-after-write consistency in the internal metadata
engine (which is a distributed key/value store with CRDT values).  Reads and
writes to metadata tables use quorums of 2 out of 3 nodes for each operation,
ensuring that if operation B starts after operation A has completed, then there
is at least one node that is handling both operation A and B. In the case where
A is a write (an update) and B is a read, that node will have the opportunity
to return the value written in A to the reading client B.  A visual depiction
of this process can be found in [this
presentation](https://git.deuxfleurs.fr/Deuxfleurs/garage/src/commit/a8b0e01f88b947bc34c05d818d51860b4d171967/doc/talks/2023-09-20-ocp/talk.pdf)
on slide 32 (pages 57-64), and the algorithm is written down on slide 33 (page
54).

Note that read-after-write guarantees [are broken and have always
been](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/147) for metadata
related to buckets and access keys, which might not be something we can fix due
to different requirements on the quorums for the related metadata tables. 


## Current issues with read-after-write consistency

Maintaining read-after-write consistency depends crucially on the intersection
of the quorums being non-empty. There is however a scenario where these quorums
may be empty: when the set of nodes affected to storing some entries changes,
for instance when nodes are added or removed and data is being rebalanced
between nodes.

### A concrete example

Take the case of a partition (a subset of the data stored by Garage) which is
stored on nodes A, B and C. At some point, a layout change occurs in the
cluster, and after the change, nodes A, D and E are responsible for storing the
partition. All read and write operations that were initiated before the layout
change, or by nodes that were not yet aware of the new layout version, will be
directed to nodes A, B and C, and will be handled by a quorum of two nodes among
those three.  However, once the new layout is introduced in the cluster, read
and write operations will start being directed to nodes A, D and E, expecting a
quorum of two nodes among this new set of three nodes. 

Crucially,  coordinating when operations start being directed to the new layout
is a hard problem, and in all cases we must assume that due to some network
asynchrony, there can still be some nodes that keep sending requests to nodes
A, B and C for a long time even after everyone else is aware of the new layout.
Moreover, data will be progressively moved from nodes B and C to nodes D and E,
which can take a long time depending on the quantity of data. This creates a
period of uncertainty as to where exactly the data is stored in the cluster.
Overall, this basically means that this simplistic scheme gives us no way to
guarantee the intersection-of-quorums property, which is necessary for
read-after-write.

Concretely, here is a very simple scenario in which read-after-write is broken:

1. A write operation is directed to nodes A, B and C (the old layout), and
   receives OK responses from nodes B and C, forming a quorum, so the write
   completes successfully.  The written data then arrives to node A as well.

2. The new layout version is introduced in the cluster.

3. Before nodes D and E have had the chance to retrieve the data that was
   stored on nodes B and C, a read operation for the same key is directed to
   nodes A, D and E. D and E both return an OK response with no data (a null
   value), because they is not yet up-to-date.  An answer from node A is not
   received in time.  The two responses from nodes D and E, that contain no
   data, still form a quorum, so the read returns a null value instead of the
   value that was written before, even though the write operation reported a
   success.


### Evidencing the issue with Jepsen testing

The first thing that I had to do for the NLnet project was to develop a testing
framework to show that read-after-write consistency issues could in fact arise
in Garage when the cluster layout was updated.  To make such tests, I chose to
use the [Jepsen](https://jepsen.io/) testing framework, which helps us put
distributed software in complex adverse scenarios and verify whether they
respect some claimed consistency guarantees or not.

I will not enter into too much detail on the testing procedure, but suffice to
say that issues were found. More precisely, I was able to show that Garage
*did* guarantee read-after-write in a variety of adverse scenarios such as
network partitions, node crashes and clock scrambling, but that it was unable
to do so as soon as regular layout updates were introduced.

The progress of the Jepsen testing work is tracked in [PR
#544](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/544)


## Fixing read-after-write consistency when layouts change

To solve this issue, we will have to keep track of several pieces of
information in the cluster.  We will also have to adapt our read/write quorums
and our data transfer strategy during rebalancing to make sure that data can be
found when it is requested.

First of all, we adapted Garage's code to be able to handle *several versions
of the cluster layout* that can be live in the cluster at the same time, to
keep track of multiple possible locations for data that is currently being
transferred between nodes.  When multiple cluster layout versions are live,
write operations are directed to all of the nodes responsible for storing the
data in all the live versions. This ensures that the nodes in the oldest live
layout version always have an up-to-date view of the data, and that a read
quorum among those nodes is always a safe way to ensure read-after-write
consistency.

Nodes will progressively synchronize data so that the nodes in the newest live
layout version will catch up with data stored by nodes in the older live layout
version.  Once nodes in the newer layout versions also have an up-to-date view
of the data, read operations will progressively start using a quorum of nodes
in the new layout version instead of the old one.

Once all nodes are reading from newer layout versions, the oldest live versions
can be pruned. This means that writes will stop being directed to those nodes,
and the nodes will delete the data they were storing.  Obviously, in the (very
common) case where some nodes are both in the old and new layout versions,
those nodes will not delete their data and they will continue to receive
writes.

### Performance impacts

When multiple layout versions are live, writes are sent to all nodes
responsible for the partition of the requested key in all live layout
versions, and will return OK only when they receive a quorum of OK responses
for each of the live layout versions.  This means that writes could be a bit
slower when a layout change is being synchronized in the cluster. Typically if
only one node is changing between the old and the new layout version, the write
operation will await for 3 responses among 4 requests, instead of the classical
2 responses among 3 requests.

Concerning reads, they are still sent to only three nodes.  Indeed,  they are
sent to the nodes of the newest live layout version for which nodes have
completed a sync to catch up on existing data, and they only expect a quorum of
2 responses among the three nodes of that layout version.  This way, reads
always stay as performant as when no layout change is being processed.

### Ensuring that new nodes are up-to-date

An additional coordination mechanism is necessary for the data synchronization
procedure, to ensure that it is not started too early and that after it
completes, the nodes in the new layout indeed contains an up-to-date view of
the data.

Indeed, imagine the following adverse scenario, which we want to avoid: a new
layout version is introduced in the cluster, and nodes immediately start
copying the data to the new nodes. However, some write operations that were
initiated before the new layout was introduced (or that were handled by a node
not yet aware of the layout) could be delayed, and the written data was not yet
received by the old nodes when they sent their copy of everything.  When the
sync reports completion, and read operations start being directed to nodes of
the new layout, the written data might be missing from the nodes handling the
read, and read-after-write consistency could be violated.

To avoid this situation, the synchronization operation is not initiated until
all cluster nodes have reported an "acknowledge" of the new layout version,
indicating that they have received the new layout version, and that they are no
longer processing write operations that were only addressed to nodes of the
previous layout versions.  This makes sure that no data will be missed by the
sync: once the sync has started, no more data can be written only to old layout
versions.  All of the writes will also be directed to the new nodes. More
exactly: all data that the source nodes of the sync does not yet contain when
the sync starts, is written by a write operation that is also directed at a
quorum of nodes among the new ones. This means that at the end of the sync, a
read quorum among the new nodes will necessarily return an up-to-date copy of
all of the data.

### Details on update trackers

As you can see, the previous algorithm needs to keep track of a lot of
information in the cluster. This information is kept in three "layout update trackers",
which keep track of the following information:

- The `ack` layout tracker keeps track of nodes receiving the latest layout
  versions and indicating that they are no longer processing writes addressed
  only to older layout versions. Once all nodes have acknowledged a new
  version, we know that all in-progress and future write operations that are
  made in the cluster are directed to the nodes that were added in this layout
  version as well.

- The `sync` layout tracker keeps track of nodes finishing a full metadata table
  sync, that was started after all nodes `ack`'ed the new layout version.

- The `sync_ack` layout tracker keeps track of nodes receiving the `sync`
  tracker update for all cluster nodes, and thus starting to direct reads to
  the newly synchronized layout version. This makes it possible to know when no
  more nodes are reading from an old version, at which point the corresponding
  data can be deleted.

In the simplest scenario, only two layout versions are live, and these trackers
therefore can only have the values `n` (the new layout version) and `n-1` (the
old one).  However this mechanism handles the general case where several
successive layout updates are being processed and more than two layout versions
are live simultaneously. The layout update trackers can take as values the
version numbers of any currently live layout version.

### What about dead nodes?

In this post I have used many times the phrases "once all nodes have
acknowledged a new layout version", or "once all nodes have completed a sync".
This obviously means that if some nodes are dead or unresponsive, the
processing of the layout update can be delayed indefinitely, and nodes in the
old layout versions will keep receiving writes and storing unnecessary data.
This is an unfortunate fact with the method proposed here. To cover for these
situations, the following workarounds can be made:

- A layout change is generally a supervised operation, meaning that a system
  administrator may manually intervene to inform the cluster that certain nodes
  are dead and that their layout tracker values should not be taken into
  account.

- For the `sync` update tracker, we don't actually need to wait for all of the
  synchronizations to terminate, quorums can be used instead as they should be
  sufficient to ensure that the copied data is up-to-date.

- For the `ack` and `sync_ack` update trackers, we can automatically increase
  them for all nodes (even dead ones) after a certain time delay, as there is
  no reason for the changes taking more than e.g. 10 minutes to propagate in
  regular conditions. We might not enable this behaviour by default, though,
  due to its possible impacts on consistency.


## Current status and future work

The work described in this blog post is currently almost complete but it still
needs to be ironed out.  I have made a first run of Jepsen testing on the new
code that showed that the changes seem to be fixing the issue.  I will be
running longer and more intensive runs of Jepsen testing once the code is
finished, to make sure everything is fine.  The changes will require a major
update of Garage: this will be the v0.10.0 release, which will probably be
finished in January or February of 2024. This update will be a very safe and
transparent update, as only the layout data structure is changed and nothing
related to object storage itself is touched.

If I had the time to do so, I would write the algorithm described in this post
in a formal way, in the form of a scientific paper. I believe such a paper
would be worthy of presenting at a scientific conference or journal, especially
given the fact that it is motivated by a very concrete use case and has been
validated quite thoroughly (with Jepsen).  Unfortunately, this is not my
highest priority at the moment.

---

Written by [Alex Auvolat](https://adnab.me).