aboutsummaryrefslogtreecommitdiff
path: root/content/documentation/design/internals.md
diff options
context:
space:
mode:
authorsptaule <lecas83@gmail.com>2022-01-24 20:58:36 +0100
committersptaule <lecas83@gmail.com>2022-01-24 20:58:36 +0100
commitdc932f973593747697b98aa612a308df889741ab (patch)
treed1b88306af8373d650161e845c26690a52e32d29 /content/documentation/design/internals.md
parent336fd3f7566c7872c17b30f7cd150e1d16899205 (diff)
downloadgaragehq.deuxfleurs.fr-dc932f973593747697b98aa612a308df889741ab.tar.gz
garagehq.deuxfleurs.fr-dc932f973593747697b98aa612a308df889741ab.zip
Added doc book
Diffstat (limited to 'content/documentation/design/internals.md')
-rw-r--r--content/documentation/design/internals.md98
1 files changed, 98 insertions, 0 deletions
diff --git a/content/documentation/design/internals.md b/content/documentation/design/internals.md
new file mode 100644
index 0000000..0b31584
--- /dev/null
+++ b/content/documentation/design/internals.md
@@ -0,0 +1,98 @@
+# Internals
+
+## Overview
+
+TODO: write this section
+
+- The Dynamo ring (see [this paper](https://dl.acm.org/doi/abs/10.1145/1323293.1294281) and [that paper](https://www.usenix.org/conference/nsdi16/technical-sessions/presentation/eisenbud))
+
+- CRDTs (see [this paper](https://link.springer.com/chapter/10.1007/978-3-642-24550-3_29))
+
+- Consistency model of Garage tables
+
+In the meantime, you can find some information at the following links:
+
+- [this presentation (in French)](https://git.deuxfleurs.fr/Deuxfleurs/garage/src/branch/main/doc/talks/2020-12-02_wide-team/talk.pdf)
+
+- [an old design draft](/working_documents/design_draft.md)
+
+
+## Garbage collection
+
+A faulty garbage collection procedure has been the cause of
+[critical bug #39](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/39).
+This precise bug was fixed in the code, however there are potentially more
+general issues with the garbage collector being too eager and deleting things
+too early. This has been the subject of
+[PR #135](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/135).
+This section summarizes the discussions on this topic.
+
+Rationale: we want to ensure Garage's safety by making sure things don't get
+deleted from disk if they are still needed. Two aspects are involved in this.
+
+### 1. Garbage collection of table entries (in `meta/` directory)
+
+The `Entry` trait used for table entries (defined in `tables/schema.rs`)
+defines a function `is_tombstone()` that returns `true` if that entry
+represents an entry that is deleted in the table. CRDT semantics by default
+keep all tombstones, because they are necessary for reconciliation: if node A
+has a tombstone that supersedes a value `x`, and node B has value `x`, A has to
+keep the tombstone in memory so that the value `x` can be properly deleted at
+node `B`. Otherwise, due to the CRDT reconciliation rule, the value `x` from B
+would flow back to A and a deleted item would reappear in the system.
+
+Here, we have some control on the nodes involved in storing Garage data.
+Therefore we have a garbage collector that is able to delete tombstones UNDER
+CERTAIN CONDITIONS. This garbage collector is implemented in `table/gc.rs`. To
+delete a tombstone, the following condition has to be met:
+
+- All nodes responsible for storing this entry are aware of the existence of
+ the tombstone, i.e. they cannot hold another version of the entry that is
+ superseeded by the tombstone. This ensures that deleting the tombstone is
+ safe and that no deleted value will come back in the system.
+
+Garage makes use of Sled's atomic operations (such as compare-and-swap and
+transactions) to ensure that only tombstones that have been correctly
+propagated to other nodes are ever deleted from the local entry tree.
+
+This GC is safe in the following sense: no non-tombstone data is ever deleted
+from Garage tables.
+
+**However**, there is an issue with the way this interacts with data
+rebalancing in the case when a partition is moving between nodes. If a node has
+some data of a partition for which it is not responsible, it has to offload it.
+However that offload process takes some time. In that interval, the GC does not
+check with that node if it has the tombstone before deleting the tombstone, so
+perhaps it doesn't have it and when the offload finally happens, old data comes
+back in the system.
+
+**PR 135 mostly fixes this** by implementing a 24-hour delay before anything is
+garbage collected in a table. This works under the assumption that rebalances
+that follow data shuffling terminate in less than 24 hours.
+
+**However**, in distributed systems, it is generally considered a bad practice
+to make assumptions that information propagates in a certain time interval:
+this consists in making a synchrony assumption, meaning that we are basically
+assuming a computing model that has much stronger properties than otherwise. To
+maximize the applicability of Garage, we would like to remove this assumption,
+and implement a system where time does not play a role. To do this, we would
+need to find a way to safely disable the GC when data is being shuffled around,
+and safely detect that the shuffling has terminated and thus the GC can be
+resumed. This introduces some complexity to the protocol and hasn't been
+tackled yet.
+
+### 2. Garbage collection of data blocks (in `data/` directory)
+
+Blocks in the data directory are reference-counted. In Garage versions before
+PR #135, blocks could get deleted from local disk as soon as their reference
+counter reached zero. We had a mechanism to not trigger this immediately at the
+rc-reaches-zero event, but the cleanup could be triggered by other means (for
+example by a block repair operation...). PR #135 added a safety measure so that
+blocks never get deleted in a 10 minute interval following the time when the RC
+reaches zero. This is a measure to make impossible race conditions such as #39.
+We would have liked to use a larger delay (e.g. 24 hours), but in the case of a
+rebalance of data, this would have led to the disk utilization to explode
+during the rebalancing, only to shrink again after 24 hours. The 10-minute
+delay is a compromise that gives good security while not having this problem of
+disk space explosion on rebalance.
+