aboutsummaryrefslogtreecommitdiff
path: root/content/documentation/design
diff options
context:
space:
mode:
Diffstat (limited to 'content/documentation/design')
-rw-r--r--content/documentation/design/benchmarks.md81
-rw-r--r--content/documentation/design/goals.md53
-rw-r--r--content/documentation/design/img/endpoint-latency-dc.pngbin0 -> 131776 bytes
-rw-r--r--content/documentation/design/img/endpoint-latency.pngbin0 -> 127369 bytes
-rw-r--r--content/documentation/design/index.md26
-rw-r--r--content/documentation/design/internals.md98
-rw-r--r--content/documentation/design/related_work.md77
7 files changed, 335 insertions, 0 deletions
diff --git a/content/documentation/design/benchmarks.md b/content/documentation/design/benchmarks.md
new file mode 100644
index 0000000..6e5580e
--- /dev/null
+++ b/content/documentation/design/benchmarks.md
@@ -0,0 +1,81 @@
+# Benchmarks
+
+With Garage, we wanted to build a software defined storage service that follow the [KISS principle](https://en.wikipedia.org/wiki/KISS_principle),
+ that is suitable for geo-distributed deployments and more generally that would work well for community hosting (like a Mastodon instance).
+
+In our benchmarks, we aim to quantify how Garage performs on these goals compared to the other available solutions.
+
+## Geo-distribution
+
+The main challenge in a geo-distributed setup is latency between nodes of the cluster.
+The more a user request will require intra-cluster requests to complete, the more its latency will increase.
+This is especially true for sequential requests: requests that must wait the result of another request to be sent.
+We designed Garage without consensus algorithms (eg. Paxos or Raft) to minimize the number of sequential and parallel requests.
+
+This serie of benchmarks quantifies the impact of this design choice.
+
+### On a simple simulated network
+
+We start with a controlled environment, all the instances are running on the same (powerful enough) machine.
+
+To control the network latency, we simulate the network with [mknet](https://git.deuxfleurs.fr/trinity-1686a/mknet) (a tool we developped, based on `tc` and the linux network stack).
+To mesure S3 endpoints latency, we use our own tool [s3lat](https://git.deuxfleurs.fr/quentin/s3lat/) to observe only the intra-cluster latency and not some contention on the nodes (CPU, RAM, disk I/O, network bandwidth, etc.).
+Compared to other benchmark tools, S3Lat sends only one (small) request at the same time and measures its latency.
+We selected 5 standard endpoints that are often in the critical path: ListBuckets, ListObjects, GetObject, PutObject and RemoveObject.
+
+In this first benchmark, we consider 5 instances that are located in a different place each. To simulate the distance, we configure mknet with a RTT between each node of 100 ms +/- 20 ms of jitter. We get the following graph, where the colored bars represent the mean latency while the error bars the minimum and maximum one:
+
+![Comparison of endpoints latency for minio and garage](./img/endpoint-latency.png)
+
+Compared to garage, minio latency drastically increases on 3 endpoints: GetObject, PutObject, RemoveObject.
+
+We suppose that these requests on minio make transactions over Raft, involving 4 sequential requests: 1) sending the message to the leader, 2) having the leader dispatch it to the other nodes, 3) waiting for the confirmation of followers and finally 4) commiting it. With our current configuration, one Raft transaction will take around 400 ms. GetObject seems to correlate to 1 transaction while PutObject and RemoveObject seems to correlate to 2 or 3. Reviewing minio code would be required to confirm this hypothesis.
+
+Conversely, garage uses an architecture similar to DynamoDB and never require global cluster coordination to answer a request.
+Instead, garage can always contact the right node in charge of the requested data, and can answer in as low as one request in the case of GetObject and PutObject. We also observed that Garage latency, while often lower to minio, is more dispersed: garage is still in beta and has not received any performance optimization yet.
+
+As a conclusion, Garage performs well in such setup while minio will be hard to use, especially for interactive use cases.
+
+### On a complex simulated network
+
+This time we consider a more heterogeneous network with 6 servers spread in 3 datacenter, giving us 2 servers per datacenters.
+We consider that intra-DC communications are now very cheap with a latency of 0.5ms and without any jitter.
+The inter-DC remains costly with the same value as before (100ms +/- 20ms of jitter).
+We plot a similar graph as before:
+
+![Comparison of endpoints latency for minio and garage with 6 nodes in 3 DC](./img/endpoint-latency-dc.png)
+
+This new graph is very similar to the one before, neither minio or garage seems to benefit from this new topology, but they also do not suffer from it.
+
+Considering garage, this is expected: nodes in the same DC are put in the same zone, and then data are spread on different zones for data resiliency and availaibility.
+Then, in the default mode, requesting data requires to query at least 2 zones to be sure that we have the most up to date information.
+These requests will involve at least one inter-DC communication.
+In other words, we prioritize data availability and synchronization over raw performances.
+
+Minio's case is a bit different as by default a minio cluster is not location aware, so we can't explain its performances through location awareness.
+*We know that minio has a multi site mode but it is definitely not a first class citizen: data are asynchronously replicated from one minio cluster to another.*
+We suppose that, due to the consensus, for many of its requests minio will wait for a response of the majority of the server, also involving inter-DC communications.
+
+As a conclusion, our new topology did not influence garage or minio performances, confirming that in presence of latency, garage is the best fit.
+
+### On a real world deployment
+
+*TODO*
+
+
+## Performance stability
+
+A storage cluster will encounter different scenario over its life, many of them will not be predictable.
+In this context, we argue that, more than peak performances, we should seek predictable and stable performances to ensure data availability.
+
+### Reference
+
+*TODO*
+
+### On a degraded cluster
+
+*TODO*
+
+### At scale
+
+*TODO*
diff --git a/content/documentation/design/goals.md b/content/documentation/design/goals.md
new file mode 100644
index 0000000..10ef6a8
--- /dev/null
+++ b/content/documentation/design/goals.md
@@ -0,0 +1,53 @@
+# Goals and use cases
+
+## Goals and non-goals
+
+Garage is a lightweight geo-distributed data store that implements the
+[Amazon S3](https://docs.aws.amazon.com/AmazonS3/latest/API/Welcome.html)
+object storage protocole. It enables applications to store large blobs such
+as pictures, video, images, documents, etc., in a redundant multi-node
+setting. S3 is versatile enough to also be used to publish a static
+website.
+
+Garage is an opinionated object storage solutoin, we focus on the following **desirable properties**:
+
+ - **Self-contained & lightweight**: works everywhere and integrates well in existing environments to target [hyperconverged infrastructures](https://en.wikipedia.org/wiki/Hyper-converged_infrastructure).
+ - **Highly resilient**: highly resilient to network failures, network latency, disk failures, sysadmin failures.
+ - **Simple**: simple to understand, simple to operate, simple to debug.
+ - **Internet enabled**: made for multi-sites (eg. datacenters, offices, households, etc.) interconnected through regular Internet connections.
+
+We also noted that the pursuit of some other goals are detrimental to our initial goals.
+The following has been identified as **non-goals** (if these points matter to you, you should not use Garage):
+
+ - **Extreme performances**: high performances constrain a lot the design and the infrastructure; we seek performances through minimalism only.
+ - **Feature extensiveness**: we do not plan to add additional features compared to the ones provided by the S3 API.
+ - **Storage optimizations**: erasure coding or any other coding technique both increase the difficulty of placing data and synchronizing; we limit ourselves to duplication.
+ - **POSIX/Filesystem compatibility**: we do not aim at being POSIX compatible or to emulate any kind of filesystem. Indeed, in a distributed environment, such synchronizations are translated in network messages that impose severe constraints on the deployment.
+
+## Use-cases
+
+*Are you also using Garage in your organization? [Open a PR](https://git.deuxfleurs.fr/Deuxfleurs/garage) to add your use case here!*
+
+### Deuxfleurs
+
+[Deuxfleurs](https://deuxfleurs.fr) is an experimental non-profit hosting
+organization that develops Garage. Deuxfleurs is focused on building highly
+available infrastructure through redundancy in multiple geographical
+locations. They use Garage themselves for the following tasks:
+
+- Hosting of [main website](https://deuxfleurs.fr), [this website](https://garagehq.deuxfleurs.fr), as well as the personal website of many of the members of the organization
+
+- As a [Matrix media backend](https://github.com/matrix-org/synapse-s3-storage-provider)
+
+- To store personal data and shared documents through [Bagage](https://git.deuxfleurs.fr/Deuxfleurs/bagage), a homegrown WebDav-to-S3 proxy
+
+- In the Drone continuous integration platform to store task logs
+
+- As a Nix binary cache
+
+- As a backup target using `rclone`
+
+The Deuxfleurs Garage cluster is a multi-site cluster currently composed of
+4 nodes in 2 physical locations. In the future it will be expanded to at
+least 3 physical locations to fully exploit Garage's potential for high
+availability.
diff --git a/content/documentation/design/img/endpoint-latency-dc.png b/content/documentation/design/img/endpoint-latency-dc.png
new file mode 100644
index 0000000..7c7411c
--- /dev/null
+++ b/content/documentation/design/img/endpoint-latency-dc.png
Binary files differ
diff --git a/content/documentation/design/img/endpoint-latency.png b/content/documentation/design/img/endpoint-latency.png
new file mode 100644
index 0000000..741539a
--- /dev/null
+++ b/content/documentation/design/img/endpoint-latency.png
Binary files differ
diff --git a/content/documentation/design/index.md b/content/documentation/design/index.md
new file mode 100644
index 0000000..2e3b5fd
--- /dev/null
+++ b/content/documentation/design/index.md
@@ -0,0 +1,26 @@
+# Design
+
+The design section helps you to see Garage from a "big picture"
+perspective. It will allow you to understand if Garage is a good fit for
+you, how to better use it, how to contribute to it, what can Garage could
+and could not do, etc.
+
+- **[Goals and use cases](goals.md):** This page explains why Garage was concieved and what practical use cases it targets.
+
+- **[Related work](related_work.md):** This pages presents the theoretical background on which Garage is built, and describes other software storage solutions and why they didn't work for us.
+
+- **[Internals](internals.md):** This page enters into more details on how Garage manages data internally.
+
+## Talks
+
+We love to talk and hear about Garage, that's why we keep a log here:
+
+ - [(fr, 2021-11-13, video) Garage : Mille et une façons de stocker vos données](https://video.tedomum.net/w/moYKcv198dyMrT8hCS5jz9) and [slides (html)](https://rfid.deuxfleurs.fr/presentations/2021-11-13/garage/) - during [RFID#1](https://rfid.deuxfleurs.fr/programme/2021-11-13/) event
+
+ - [(en, 2021-04-28) Distributed object storage is centralised](https://git.deuxfleurs.fr/Deuxfleurs/garage/raw/commit/b1f60579a13d3c5eba7f74b1775c84639ea9b51a/doc/talks/2021-04-28_spirals-team/talk.pdf)
+
+ - [(fr, 2020-12-02) Garage : jouer dans la cour des grands quand on est un hébergeur associatif](https://git.deuxfleurs.fr/Deuxfleurs/garage/raw/commit/b1f60579a13d3c5eba7f74b1775c84639ea9b51a/doc/talks/2020-12-02_wide-team/talk.pdf)
+
+*Did you write or talk about Garage? [Open a pull request](https://git.deuxfleurs.fr/Deuxfleurs/garage/) to add a link here!*
+
+
diff --git a/content/documentation/design/internals.md b/content/documentation/design/internals.md
new file mode 100644
index 0000000..0b31584
--- /dev/null
+++ b/content/documentation/design/internals.md
@@ -0,0 +1,98 @@
+# Internals
+
+## Overview
+
+TODO: write this section
+
+- The Dynamo ring (see [this paper](https://dl.acm.org/doi/abs/10.1145/1323293.1294281) and [that paper](https://www.usenix.org/conference/nsdi16/technical-sessions/presentation/eisenbud))
+
+- CRDTs (see [this paper](https://link.springer.com/chapter/10.1007/978-3-642-24550-3_29))
+
+- Consistency model of Garage tables
+
+In the meantime, you can find some information at the following links:
+
+- [this presentation (in French)](https://git.deuxfleurs.fr/Deuxfleurs/garage/src/branch/main/doc/talks/2020-12-02_wide-team/talk.pdf)
+
+- [an old design draft](/working_documents/design_draft.md)
+
+
+## Garbage collection
+
+A faulty garbage collection procedure has been the cause of
+[critical bug #39](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/39).
+This precise bug was fixed in the code, however there are potentially more
+general issues with the garbage collector being too eager and deleting things
+too early. This has been the subject of
+[PR #135](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/135).
+This section summarizes the discussions on this topic.
+
+Rationale: we want to ensure Garage's safety by making sure things don't get
+deleted from disk if they are still needed. Two aspects are involved in this.
+
+### 1. Garbage collection of table entries (in `meta/` directory)
+
+The `Entry` trait used for table entries (defined in `tables/schema.rs`)
+defines a function `is_tombstone()` that returns `true` if that entry
+represents an entry that is deleted in the table. CRDT semantics by default
+keep all tombstones, because they are necessary for reconciliation: if node A
+has a tombstone that supersedes a value `x`, and node B has value `x`, A has to
+keep the tombstone in memory so that the value `x` can be properly deleted at
+node `B`. Otherwise, due to the CRDT reconciliation rule, the value `x` from B
+would flow back to A and a deleted item would reappear in the system.
+
+Here, we have some control on the nodes involved in storing Garage data.
+Therefore we have a garbage collector that is able to delete tombstones UNDER
+CERTAIN CONDITIONS. This garbage collector is implemented in `table/gc.rs`. To
+delete a tombstone, the following condition has to be met:
+
+- All nodes responsible for storing this entry are aware of the existence of
+ the tombstone, i.e. they cannot hold another version of the entry that is
+ superseeded by the tombstone. This ensures that deleting the tombstone is
+ safe and that no deleted value will come back in the system.
+
+Garage makes use of Sled's atomic operations (such as compare-and-swap and
+transactions) to ensure that only tombstones that have been correctly
+propagated to other nodes are ever deleted from the local entry tree.
+
+This GC is safe in the following sense: no non-tombstone data is ever deleted
+from Garage tables.
+
+**However**, there is an issue with the way this interacts with data
+rebalancing in the case when a partition is moving between nodes. If a node has
+some data of a partition for which it is not responsible, it has to offload it.
+However that offload process takes some time. In that interval, the GC does not
+check with that node if it has the tombstone before deleting the tombstone, so
+perhaps it doesn't have it and when the offload finally happens, old data comes
+back in the system.
+
+**PR 135 mostly fixes this** by implementing a 24-hour delay before anything is
+garbage collected in a table. This works under the assumption that rebalances
+that follow data shuffling terminate in less than 24 hours.
+
+**However**, in distributed systems, it is generally considered a bad practice
+to make assumptions that information propagates in a certain time interval:
+this consists in making a synchrony assumption, meaning that we are basically
+assuming a computing model that has much stronger properties than otherwise. To
+maximize the applicability of Garage, we would like to remove this assumption,
+and implement a system where time does not play a role. To do this, we would
+need to find a way to safely disable the GC when data is being shuffled around,
+and safely detect that the shuffling has terminated and thus the GC can be
+resumed. This introduces some complexity to the protocol and hasn't been
+tackled yet.
+
+### 2. Garbage collection of data blocks (in `data/` directory)
+
+Blocks in the data directory are reference-counted. In Garage versions before
+PR #135, blocks could get deleted from local disk as soon as their reference
+counter reached zero. We had a mechanism to not trigger this immediately at the
+rc-reaches-zero event, but the cleanup could be triggered by other means (for
+example by a block repair operation...). PR #135 added a safety measure so that
+blocks never get deleted in a 10 minute interval following the time when the RC
+reaches zero. This is a measure to make impossible race conditions such as #39.
+We would have liked to use a larger delay (e.g. 24 hours), but in the case of a
+rebalance of data, this would have led to the disk utilization to explode
+during the rebalancing, only to shrink again after 24 hours. The 10-minute
+delay is a compromise that gives good security while not having this problem of
+disk space explosion on rebalance.
+
diff --git a/content/documentation/design/related_work.md b/content/documentation/design/related_work.md
new file mode 100644
index 0000000..da3f807
--- /dev/null
+++ b/content/documentation/design/related_work.md
@@ -0,0 +1,77 @@
+# Related work
+
+## Context
+
+Data storage is critical: it can lead to data loss if done badly and/or on hardware failure.
+Filesystems + RAID can help on a single machine but a machine failure can put the whole storage offline.
+Moreover, it put a hard limit on scalability. Often this limit can be pushed back far away by buying expensive machines.
+But here we consider non specialized off the shelf machines that can be as low powered and subject to failures as a raspberry pi.
+
+Distributed storage may help to solve both availability and scalability problems on these machines.
+Many solutions were proposed, they can be categorized as block storage, file storage and object storage depending on the abstraction they provide.
+
+## Overview
+
+Block storage is the most low level one, it's like exposing your raw hard drive over the network.
+It requires very low latencies and stable network, that are often dedicated.
+However it provides disk devices that can be manipulated by the operating system with the less constraints: it can be partitioned with any filesystem, meaning that it supports even the most exotic features.
+We can cite [iSCSI](https://en.wikipedia.org/wiki/ISCSI) or [Fibre Channel](https://en.wikipedia.org/wiki/Fibre_Channel).
+Openstack Cinder proxy previous solution to provide an uniform API.
+
+File storage provides a higher abstraction, they are one filesystem among others, which means they don't necessarily have all the exotic features of every filesystem.
+Often, they relax some POSIX constraints while many applications will still be compatible without any modification.
+As an example, we are able to run MariaDB (very slowly) over GlusterFS...
+We can also mention CephFS (read [RADOS](https://ceph.com/wp-content/uploads/2016/08/weil-rados-pdsw07.pdf) whitepaper), Lustre, LizardFS, MooseFS, etc.
+OpenStack Manila proxy previous solutions to provide an uniform API.
+
+Finally object storages provide the highest level abstraction.
+They are the testimony that the POSIX filesystem API is not adapted to distributed filesystems.
+Especially, the strong concistency has been dropped in favor of eventual consistency which is way more convenient and powerful in presence of high latencies and unreliability.
+We often read about S3 that pioneered the concept that it's a filesystem for the WAN.
+Applications must be adapted to work for the desired object storage service.
+Today, the S3 HTTP REST API acts as a standard in the industry.
+However, Amazon S3 source code is not open but alternatives were proposed.
+We identified Minio, Pithos, Swift and Ceph.
+Minio/Ceph enforces a total order, so properties similar to a (relaxed) filesystem.
+Swift and Pithos are probably the most similar to AWS S3 with their consistent hashing ring.
+However Pithos is not maintained anymore. More precisely the company that published Pithos version 1 has developped a second version 2 but has not open sourced it.
+Some tests conducted by the [ACIDES project](https://acides.org/) have shown that Openstack Swift consumes way more resources (CPU+RAM) that we can afford. Furthermore, people developing Swift have not designed their software for geo-distribution.
+
+There were many attempts in research too. I am only thinking to [LBFS](https://pdos.csail.mit.edu/papers/lbfs:sosp01/lbfs.pdf) that was used as a basis for Seafile. But none of them have been effectively implemented yet.
+
+## Existing software
+
+**[MinIO](https://min.io/):** MinIO shares our *Self-contained & lightweight* goal but selected two of our non-goals: *Storage optimizations* through erasure coding and *POSIX/Filesystem compatibility* through strong consistency.
+However, by pursuing these two non-goals, MinIO do not reach our desirable properties.
+Firstly, it fails on the *Simple* property: due to the erasure coding, MinIO has severe limitations on how drives can be added or deleted from a cluster.
+Secondly, it fails on the *Internet enabled* property: due to its strong consistency, MinIO is latency sensitive.
+Furthermore, MinIO has no knowledge of "sites" and thus can not distribute data to minimize the failure of a given site.
+
+**[Openstack Swift](https://docs.openstack.org/swift/latest/):**
+OpenStack Swift at least fails on the *Self-contained & lightweight* goal.
+Starting it requires around 8GB of RAM, which is too much especially in an hyperconverged infrastructure.
+We also do not classify Swift as *Simple*.
+
+**[Ceph](https://ceph.io/ceph-storage/object-storage/):**
+This review holds for the whole Ceph stack, including the RADOS paper, Ceph Object Storage module, the RADOS Gateway, etc.
+At its core, Ceph has been designed to provide *POSIX/Filesystem compatibility* which requires strong consistency, which in turn
+makes Ceph latency-sensitive and fails our *Internet enabled* goal.
+Due to its industry oriented design, Ceph is also far from being *Simple* to operate and from being *Self-contained & lightweight* which makes it hard to integrate it in an hyperconverged infrastructure.
+In a certain way, Ceph and MinIO are closer together than they are from Garage or OpenStack Swift.
+
+**[Pithos](https://github.com/exoscale/pithos):**
+Pithos has been abandonned and should probably not used yet, in the following we explain why we did not pick their design.
+Pithos was relying as a S3 proxy in front of Cassandra (and was working with Scylla DB too).
+From its designers' mouth, storing data in Cassandra has shown its limitations justifying the project abandonment.
+They built a closed-source version 2 that does not store blobs in the database (only metadata) but did not communicate further on it.
+We considered there v2's design but concluded that it does not fit both our *Self-contained & lightweight* and *Simple* properties. It makes the development, the deployment and the operations more complicated while reducing the flexibility.
+
+**[Riak CS](https://docs.riak.com/riak/cs/2.1.1/index.html):**
+*Not written yet*
+
+**[IPFS](https://ipfs.io/):**
+*Not written yet*
+
+## Specific research papers
+
+*Not yet written*