From 224c89ad6ed532d0d7d07309e72894dcdab1da1f Mon Sep 17 00:00:00 2001 From: Alex Auvolat Date: Mon, 6 Dec 2021 16:10:32 +0100 Subject: Reorganize and improve documentation --- doc/book/src/SUMMARY.md | 9 +- doc/book/src/connect/index.md | 27 +++-- doc/book/src/cookbook/gateways.md | 6 +- doc/book/src/cookbook/index.md | 23 ++-- doc/book/src/cookbook/real_world.md | 4 +- doc/book/src/design/design_draft.md | 162 ------------------------- doc/book/src/design/goals.md | 53 ++++++++ doc/book/src/design/index.md | 26 ++-- doc/book/src/design/internals.md | 11 +- doc/book/src/design/related_work.md | 10 +- doc/book/src/intro.md | 59 +++------ doc/book/src/working_documents/design_draft.md | 162 +++++++++++++++++++++++++ 12 files changed, 289 insertions(+), 263 deletions(-) delete mode 100644 doc/book/src/design/design_draft.md create mode 100644 doc/book/src/design/goals.md create mode 100644 doc/book/src/working_documents/design_draft.md diff --git a/doc/book/src/SUMMARY.md b/doc/book/src/SUMMARY.md index cbf6bb70..878c20b3 100644 --- a/doc/book/src/SUMMARY.md +++ b/doc/book/src/SUMMARY.md @@ -8,7 +8,7 @@ - [Multi-node deployment](./cookbook/real_world.md) - [Building from source](./cookbook/from_source.md) - [Integration with systemd](./cookbook/systemd.md) - - [Gateways](./cookbook/gateways.md) + - [Configuring a gateway node](./cookbook/gateways.md) - [Exposing buckets as websites](./cookbook/exposing_websites.md) - [Configuring a reverse proxy](./cookbook/reverse_proxy.md) - [Recovering from failures](./cookbook/recovering.md) @@ -30,9 +30,9 @@ - [S3 compatibility status](./reference_manual/s3_compatibility.md) - [Design](./design/index.md) - - [Related Work](./design/related_work.md) + - [Goals and use Cases](./design/goals.md) + - [Related work](./design/related_work.md) - [Internals](./design/internals.md) - - [Design draft](./design/design_draft.md) - [Development](./development/index.md) - [Setup your environment](./development/devenv.md) @@ -41,5 +41,6 @@ - [Miscellaneous notes](./development/miscellaneous_notes.md) - [Working Documents](./working_documents/index.md) - - [Load Balancing Data](./working_documents/load_balancing.md) + - [Load balancing data](./working_documents/load_balancing.md) - [Migrating from 0.3 to 0.4](./working_documents/migration_04.md) + - [Design draft](./working_documents/design_draft.md) diff --git a/doc/book/src/connect/index.md b/doc/book/src/connect/index.md index 56c41255..703b19d4 100644 --- a/doc/book/src/connect/index.md +++ b/doc/book/src/connect/index.md @@ -1,7 +1,19 @@ -# Connect it to +# Connect it to... -To configure an S3 client to interact with Garage, you will need the following -parameters: +Garage implements the Amazon S3 protocol, which makes it compatible with many existing software programs. + +In particular, you will find here instructions to connect it with: + + - [web applications](./apps.md) + - [website hosting](./websites.md) + - [software repositories](./repositories.md) + - [CLI tools](./cli.md) + - [your own code](./code.md) + +### Generic instructions + +To configure S3-compatible software to interact with Garage, +you will need the following parameters: - An **API endpoint**: this corresponds to the HTTP or HTTPS address used to contact the Garage server. When runing Garage locally this will usually @@ -27,12 +39,3 @@ provided that you follow the following guidelines: If this is not configured explicitly, clients usually try to talk to region `us-east-1`. Garage should normally redirect your client to the correct region, but in case your client does not support this you might have to configure it manually. - -We will now provide example configurations for the most common clients per category: - - - [Apps](./apps.md) - - [Websites](./websites.md) - - [Repositories](./repositories.md) - - [CLI tools](./cli.md) - - [Your code](./code.md) - diff --git a/doc/book/src/cookbook/gateways.md b/doc/book/src/cookbook/gateways.md index 7b286b65..f03671a4 100644 --- a/doc/book/src/cookbook/gateways.md +++ b/doc/book/src/cookbook/gateways.md @@ -6,11 +6,11 @@ Gateways allow you to expose Garage endpoints (S3 API and websites) without stor You can configure Garage as a gateway on all nodes that will consume your S3 API, it will provide you the following benefits: - - **It removes 1 or 2 network RTT** Instead of (querying your reverse proxy then) querying a random node of the cluster that will forward your request to the nodes effectively storing the data, your local gateway will directly knows which node to query. + - **It removes 1 or 2 network RTT.** Instead of (querying your reverse proxy then) querying a random node of the cluster that will forward your request to the nodes effectively storing the data, your local gateway will directly knows which node to query. - - **It ease server management** Instead of tracking in your reverse proxy and DNS what are the current Garage nodes, your gateway being part of the cluster keeps this information for you. In your software, you will always specify `http://localhost:3900`. + - **It eases server management.** Instead of tracking in your reverse proxy and DNS what are the current Garage nodes, your gateway being part of the cluster keeps this information for you. In your software, you will always specify `http://localhost:3900`. - - **It simplifies security** Instead of having to maintain and renew a TLS certificate, you leverage the Secret Handshake protocol we use for our cluster. The S3 API protocol will be in plain text but limited to your local machine. + - **It simplifies security.** Instead of having to maintain and renew a TLS certificate, you leverage the Secret Handshake protocol we use for our cluster. The S3 API protocol will be in plain text but limited to your local machine. ## Limitations diff --git a/doc/book/src/cookbook/index.md b/doc/book/src/cookbook/index.md index da915f85..792a5e6e 100644 --- a/doc/book/src/cookbook/index.md +++ b/doc/book/src/cookbook/index.md @@ -4,22 +4,23 @@ A cookbook, when you cook, is a collection of recipes. Similarly, Garage's cookbook contains a collection of recipes that are known to works well! This chapter could also be referred as "Tutorials" or "Best practices". -- **[Deploying Garage](real_world.md):** This page will walk you through all of the necessary +- **[Multi-node deployment](real_world.md):** This page will walk you through all of the necessary steps to deploy Garage in a real-world setting. -- **[Configuring S3 clients](clients.md):** This page will explain how to configure - popular S3 clients to interact with a Garage server. +- **[Building from source](from_source.md):** This page explains how to build Garage from + source in case a binary is not provided for your architecture, or if you want to + hack with us! -- **[Hosting a website](website.md):** This page explains how to use Garage +- **[Integration with Systemd](systemd.md):** This page explains how to run Garage + as a Systemd service (instead of as a Docker container). + +- **[Configuring a gateway node](gateways.md):** This page explains how to run a gateway node in a Garage cluster, i.e. a Garage node that doesn't store data but accelerates access to data present on the other nodes. + +- **[Hosting a website](exposing_websites.md):** This page explains how to use Garage to host a static website. +- **[Configuring a reverse-proxy](reverse_proxy.md):** This page explains how to configure a reverse-proxy to add TLS support to your S3 api endpoint. + - **[Recovering from failures](recovering.md):** Garage's first selling point is resilience to hardware failures. This section explains how to recover from such a failure in the best possible way. - -- **[Building from source](from_source.md):** This page explains how to build Garage from - source in case a binary is not provided for your architecture, or if you want to - hack with us! - -- **[Starting with Systemd](from_source.md):** This page explains how to run Garage - as a Systemd service (instead of as a Docker container). diff --git a/doc/book/src/cookbook/real_world.md b/doc/book/src/cookbook/real_world.md index 4b3fec2b..d1303d47 100644 --- a/doc/book/src/cookbook/real_world.md +++ b/doc/book/src/cookbook/real_world.md @@ -286,5 +286,5 @@ and is covered in the [quick start guide](../quick_start/index.md). Remember also that the CLI is self-documented thanks to the `--help` flag and the `help` subcommand (e.g. `garage help`, `garage key --help`). -Configuring an S3 client to interact with Garage is covered -[in the next section](clients.md). +Configuring S3-compatible applicatiosn to interact with Garage +is covered in the [Integrations](/connect/index.html) section. diff --git a/doc/book/src/design/design_draft.md b/doc/book/src/design/design_draft.md deleted file mode 100644 index 06ed46bd..00000000 --- a/doc/book/src/design/design_draft.md +++ /dev/null @@ -1,162 +0,0 @@ -# Design draft - -**WARNING: this documentation is a design draft which was written before Garage's actual implementation. -The general principle are similar, but details have not been updated.** - - -#### Modules - -- `membership/`: configuration, membership management (gossip of node's presence and status), ring generation --> what about Serf (used by Consul/Nomad) : https://www.serf.io/? Seems a huge library with many features so maybe overkill/hard to integrate -- `metadata/`: metadata management -- `blocks/`: block management, writing, GC and rebalancing -- `internal/`: server to server communication (HTTP server and client that reuses connections, TLS if we want, etc) -- `api/`: S3 API -- `web/`: web management interface - -#### Metadata tables - -**Objects:** - -- *Hash key:* Bucket name (string) -- *Sort key:* Object key (string) -- *Sort key:* Version timestamp (int) -- *Sort key:* Version UUID (string) -- Complete: bool -- Inline: bool, true for objects < threshold (say 1024) -- Object size (int) -- Mime type (string) -- Data for inlined objects (blob) -- Hash of first block otherwise (string) - -*Having only a hash key on the bucket name will lead to storing all file entries of this table for a specific bucket on a single node. At the same time, it is the only way I see to rapidly being able to list all bucket entries...* - -**Blocks:** - -- *Hash key:* Version UUID (string) -- *Sort key:* Offset of block in total file (int) -- Hash of data block (string) - -A version is defined by the existence of at least one entry in the blocks table for a certain version UUID. -We must keep the following invariant: if a version exists in the blocks table, it has to be referenced in the objects table. -We explicitly manage concurrent versions of an object: the version timestamp and version UUID columns are index columns, thus we may have several concurrent versions of an object. -Important: before deleting an older version from the objects table, we must make sure that we did a successfull delete of the blocks of that version from the blocks table. - -Thus, the workflow for reading an object is as follows: - -1. Check permissions (LDAP) -2. Read entry in object table. If data is inline, we have its data, stop here. - -> if several versions, take newest one and launch deletion of old ones in background -3. Read first block from cluster. If size <= 1 block, stop here. -4. Simultaneously with previous step, if size > 1 block: query the Blocks table for the IDs of the next blocks -5. Read subsequent blocks from cluster - -Workflow for PUT: - -1. Check write permission (LDAP) -2. Select a new version UUID -3. Write a preliminary entry for the new version in the objects table with complete = false -4. Send blocks to cluster and write entries in the blocks table -5. Update the version with complete = true and all of the accurate information (size, etc) -6. Return success to the user -7. Launch a background job to check and delete older versions - -Workflow for DELETE: - -1. Check write permission (LDAP) -2. Get current version (or versions) in object table -3. Do the deletion of those versions NOT IN A BACKGROUND JOB THIS TIME -4. Return succes to the user if we were able to delete blocks from the blocks table and entries from the object table - -To delete a version: - -1. List the blocks from Cassandra -2. For each block, delete it from cluster. Don't care if some deletions fail, we can do GC. -3. Delete all of the blocks from the blocks table -4. Finally, delete the version from the objects table - -Known issue: if someone is reading from a version that we want to delete and the object is big, the read might be interrupted. I think it is ok to leave it like this, we just cut the connection if data disappears during a read. - -("Soit P un problème, on s'en fout est une solution à ce problème") - -#### Block storage on disk - -**Blocks themselves:** - -- file path = /blobs/(first 3 hex digits of hash)/(rest of hash) - -**Reverse index for GC & other block-level metadata:** - -- file path = /meta/(first 3 hex digits of hash)/(rest of hash) -- map block hash -> set of version UUIDs where it is referenced - -Usefull metadata: - -- list of versions that reference this block in the Casandra table, so that we can do GC by checking in Cassandra that the lines still exist -- list of other nodes that we know have acknowledged a write of this block, usefull in the rebalancing algorithm - -Write strategy: have a single thread that does all write IO so that it is serialized (or have several threads that manage independent parts of the hash space). When writing a blob, write it to a temporary file, close, then rename so that a concurrent read gets a consistent result (either not found or found with whole content). - -Read strategy: the only read operation is get(hash) that returns either the data or not found (can do a corruption check as well and return corrupted state if it is the case). Can be done concurrently with writes. - -**Internal API:** - -- get(block hash) -> ok+data/not found/corrupted -- put(block hash & data, version uuid + offset) -> ok/error -- put with no data(block hash, version uuid + offset) -> ok/not found plz send data/error -- delete(block hash, version uuid + offset) -> ok/error - -GC: when last ref is deleted, delete block. -Long GC procedure: check in Cassandra that version UUIDs still exist and references this block. - -Rebalancing: takes as argument the list of newly added nodes. - -- List all blocks that we have. For each block: -- If it hits a newly introduced node, send it to them. - Use put with no data first to check if it has to be sent to them already or not. - Use a random listing order to avoid race conditions (they do no harm but we might have two nodes sending the same thing at the same time thus wasting time). -- If it doesn't hit us anymore, delete it and its reference list. - -Only one balancing can be running at a same time. It can be restarted at the beginning with new parameters. - -#### Membership management - -Two sets of nodes: - -- set of nodes from which a ping was recently received, with status: number of stored blocks, request counters, error counters, GC%, rebalancing% - (eviction from this set after say 30 seconds without ping) -- set of nodes that are part of the system, explicitly modified by the operator using the web UI (persisted to disk), - is a CRDT using a version number for the value of the whole set - -Thus, three states for nodes: - -- healthy: in both sets -- missing: not pingable but part of desired cluster -- unused/draining: currently present but not part of the desired cluster, empty = if contains nothing, draining = if still contains some blocks - -Membership messages between nodes: - -- ping with current state + hash of current membership info -> reply with same info -- send&get back membership info (the ids of nodes that are in the two sets): used when no local membership change in a long time and membership info hash discrepancy detected with first message (passive membership fixing with full CRDT gossip) -- inform of newly pingable node(s) -> no result, when receive new info repeat to all (reliable broadcast) -- inform of operator membership change -> no result, when receive new info repeat to all (reliable broadcast) - -Ring: generated from the desired set of nodes, however when doing read/writes on the ring, skip nodes that are known to be not pingable. -The tokens are generated in a deterministic fashion from node IDs (hash of node id + token number from 1 to K). -Number K of tokens per node: decided by the operator & stored in the operator's list of nodes CRDT. Default value proposal: with node status information also broadcast disk total size and free space, and propose a default number of tokens equal to 80%Free space / 10Gb. (this is all user interface) - - -#### Constants - -- Block size: around 1MB ? --> Exoscale use 16MB chunks -- Number of tokens in the hash ring: one every 10Gb of allocated storage -- Threshold for storing data directly in Cassandra objects table: 1kb bytes (maybe up to 4kb?) -- Ping timeout (time after which a node is registered as unresponsive/missing): 30 seconds -- Ping interval: 10 seconds -- ?? - -#### Links - -- CDC: -- Erasure coding: -- [Openstack Storage Concepts](https://docs.openstack.org/arch-design/design-storage/design-storage-concepts.html) -- [RADOS](https://ceph.com/wp-content/uploads/2016/08/weil-rados-pdsw07.pdf) diff --git a/doc/book/src/design/goals.md b/doc/book/src/design/goals.md new file mode 100644 index 00000000..10ef6a8f --- /dev/null +++ b/doc/book/src/design/goals.md @@ -0,0 +1,53 @@ +# Goals and use cases + +## Goals and non-goals + +Garage is a lightweight geo-distributed data store that implements the +[Amazon S3](https://docs.aws.amazon.com/AmazonS3/latest/API/Welcome.html) +object storage protocole. It enables applications to store large blobs such +as pictures, video, images, documents, etc., in a redundant multi-node +setting. S3 is versatile enough to also be used to publish a static +website. + +Garage is an opinionated object storage solutoin, we focus on the following **desirable properties**: + + - **Self-contained & lightweight**: works everywhere and integrates well in existing environments to target [hyperconverged infrastructures](https://en.wikipedia.org/wiki/Hyper-converged_infrastructure). + - **Highly resilient**: highly resilient to network failures, network latency, disk failures, sysadmin failures. + - **Simple**: simple to understand, simple to operate, simple to debug. + - **Internet enabled**: made for multi-sites (eg. datacenters, offices, households, etc.) interconnected through regular Internet connections. + +We also noted that the pursuit of some other goals are detrimental to our initial goals. +The following has been identified as **non-goals** (if these points matter to you, you should not use Garage): + + - **Extreme performances**: high performances constrain a lot the design and the infrastructure; we seek performances through minimalism only. + - **Feature extensiveness**: we do not plan to add additional features compared to the ones provided by the S3 API. + - **Storage optimizations**: erasure coding or any other coding technique both increase the difficulty of placing data and synchronizing; we limit ourselves to duplication. + - **POSIX/Filesystem compatibility**: we do not aim at being POSIX compatible or to emulate any kind of filesystem. Indeed, in a distributed environment, such synchronizations are translated in network messages that impose severe constraints on the deployment. + +## Use-cases + +*Are you also using Garage in your organization? [Open a PR](https://git.deuxfleurs.fr/Deuxfleurs/garage) to add your use case here!* + +### Deuxfleurs + +[Deuxfleurs](https://deuxfleurs.fr) is an experimental non-profit hosting +organization that develops Garage. Deuxfleurs is focused on building highly +available infrastructure through redundancy in multiple geographical +locations. They use Garage themselves for the following tasks: + +- Hosting of [main website](https://deuxfleurs.fr), [this website](https://garagehq.deuxfleurs.fr), as well as the personal website of many of the members of the organization + +- As a [Matrix media backend](https://github.com/matrix-org/synapse-s3-storage-provider) + +- To store personal data and shared documents through [Bagage](https://git.deuxfleurs.fr/Deuxfleurs/bagage), a homegrown WebDav-to-S3 proxy + +- In the Drone continuous integration platform to store task logs + +- As a Nix binary cache + +- As a backup target using `rclone` + +The Deuxfleurs Garage cluster is a multi-site cluster currently composed of +4 nodes in 2 physical locations. In the future it will be expanded to at +least 3 physical locations to fully exploit Garage's potential for high +availability. diff --git a/doc/book/src/design/index.md b/doc/book/src/design/index.md index 305f0501..2e3b5fd9 100644 --- a/doc/book/src/design/index.md +++ b/doc/book/src/design/index.md @@ -1,30 +1,22 @@ # Design -The design section helps you to see Garage from a "big picture" perspective. -It will allow you to understand if Garage is a good fit for you, -how to better use it, how to contribute to it, what can Garage could and could not do, etc. +The design section helps you to see Garage from a "big picture" +perspective. It will allow you to understand if Garage is a good fit for +you, how to better use it, how to contribute to it, what can Garage could +and could not do, etc. -## Goals and non-goals +- **[Goals and use cases](goals.md):** This page explains why Garage was concieved and what practical use cases it targets. -Garage is an opinionated object storage solutoin, we focus on the following **desirable properties**: +- **[Related work](related_work.md):** This pages presents the theoretical background on which Garage is built, and describes other software storage solutions and why they didn't work for us. - - **Self-contained & lightweight**: works everywhere and integrates well in existing environments to target [hyperconverged infrastructures](https://en.wikipedia.org/wiki/Hyper-converged_infrastructure). - - **Highly resilient**: highly resilient to network failures, network latency, disk failures, sysadmin failures. - - **Simple**: simple to understand, simple to operate, simple to debug. - - **Internet enabled**: made for multi-sites (eg. datacenters, offices, households, etc.) interconnected through regular Internet connections. - -We also noted that the pursuit of some other goals are detrimental to our initial goals. -The following has been identified as **non-goals** (if these points matter to you, you should not use Garage): - - - **Extreme performances**: high performances constrain a lot the design and the infrastructure; we seek performances through minimalism only. - - **Feature extensiveness**: we do not plan to add additional features compared to the ones provided by the S3 API. - - **Storage optimizations**: erasure coding or any other coding technique both increase the difficulty of placing data and synchronizing; we limit ourselves to duplication. - - **POSIX/Filesystem compatibility**: we do not aim at being POSIX compatible or to emulate any kind of filesystem. Indeed, in a distributed environment, such synchronizations are translated in network messages that impose severe constraints on the deployment. +- **[Internals](internals.md):** This page enters into more details on how Garage manages data internally. ## Talks We love to talk and hear about Garage, that's why we keep a log here: + - [(fr, 2021-11-13, video) Garage : Mille et une façons de stocker vos données](https://video.tedomum.net/w/moYKcv198dyMrT8hCS5jz9) and [slides (html)](https://rfid.deuxfleurs.fr/presentations/2021-11-13/garage/) - during [RFID#1](https://rfid.deuxfleurs.fr/programme/2021-11-13/) event + - [(en, 2021-04-28) Distributed object storage is centralised](https://git.deuxfleurs.fr/Deuxfleurs/garage/raw/commit/b1f60579a13d3c5eba7f74b1775c84639ea9b51a/doc/talks/2021-04-28_spirals-team/talk.pdf) - [(fr, 2020-12-02) Garage : jouer dans la cour des grands quand on est un hébergeur associatif](https://git.deuxfleurs.fr/Deuxfleurs/garage/raw/commit/b1f60579a13d3c5eba7f74b1775c84639ea9b51a/doc/talks/2020-12-02_wide-team/talk.pdf) diff --git a/doc/book/src/design/internals.md b/doc/book/src/design/internals.md index 255335fa..0b31584c 100644 --- a/doc/book/src/design/internals.md +++ b/doc/book/src/design/internals.md @@ -4,14 +4,17 @@ TODO: write this section -- The Dynamo ring +- The Dynamo ring (see [this paper](https://dl.acm.org/doi/abs/10.1145/1323293.1294281) and [that paper](https://www.usenix.org/conference/nsdi16/technical-sessions/presentation/eisenbud)) -- CRDTs +- CRDTs (see [this paper](https://link.springer.com/chapter/10.1007/978-3-642-24550-3_29)) - Consistency model of Garage tables -See this presentation (in French) for some first information: - +In the meantime, you can find some information at the following links: + +- [this presentation (in French)](https://git.deuxfleurs.fr/Deuxfleurs/garage/src/branch/main/doc/talks/2020-12-02_wide-team/talk.pdf) + +- [an old design draft](/working_documents/design_draft.md) ## Garbage collection diff --git a/doc/book/src/design/related_work.md b/doc/book/src/design/related_work.md index aaf10d7b..da3f807e 100644 --- a/doc/book/src/design/related_work.md +++ b/doc/book/src/design/related_work.md @@ -1,4 +1,4 @@ -# Related Work +# Related work ## Context @@ -55,21 +55,21 @@ We also do not classify Swift as *Simple*. **[Ceph](https://ceph.io/ceph-storage/object-storage/):** This review holds for the whole Ceph stack, including the RADOS paper, Ceph Object Storage module, the RADOS Gateway, etc. At its core, Ceph has been designed to provide *POSIX/Filesystem compatibility* which requires strong consistency, which in turn -makes Ceph latency-sensitive and fails our *Internet enabled* goal. +makes Ceph latency-sensitive and fails our *Internet enabled* goal. Due to its industry oriented design, Ceph is also far from being *Simple* to operate and from being *Self-contained & lightweight* which makes it hard to integrate it in an hyperconverged infrastructure. In a certain way, Ceph and MinIO are closer together than they are from Garage or OpenStack Swift. -**[Pithos](https://github.com/exoscale/pithos)** +**[Pithos](https://github.com/exoscale/pithos):** Pithos has been abandonned and should probably not used yet, in the following we explain why we did not pick their design. Pithos was relying as a S3 proxy in front of Cassandra (and was working with Scylla DB too). From its designers' mouth, storing data in Cassandra has shown its limitations justifying the project abandonment. They built a closed-source version 2 that does not store blobs in the database (only metadata) but did not communicate further on it. We considered there v2's design but concluded that it does not fit both our *Self-contained & lightweight* and *Simple* properties. It makes the development, the deployment and the operations more complicated while reducing the flexibility. -**[Riak CS](https://docs.riak.com/riak/cs/2.1.1/index.html)** +**[Riak CS](https://docs.riak.com/riak/cs/2.1.1/index.html):** *Not written yet* -**[IPFS](https://ipfs.io/) :** +**[IPFS](https://ipfs.io/):** *Not written yet* ## Specific research papers diff --git a/doc/book/src/intro.md b/doc/book/src/intro.md index 746f4d6a..10f9c0a2 100644 --- a/doc/book/src/intro.md +++ b/doc/book/src/intro.md @@ -15,60 +15,37 @@ # Data resiliency for everyone -OLD - -Garage is a lightweight geo-distributed data store that implements the -[Amazon S3](https://docs.aws.amazon.com/AmazonS3/latest/API/Welcome.html) -object storage protocole. It enables applications to store large blobs such -as pictures, video, images, documents, etc., in a redundant multi-node -setting. S3 is versatile enough to also be used to publish a static -website. - -Garage comes from the observation that despite the numerous existing -implementation of object stores, many people have broken data management -policies (backup/replication on a single site or none at all). To promote -better data management policies, we focused on the following **desirable -properties**: - -Non-goals: - - - **Extreme performances**: high performances constrain a lot the design and the infrastructure; we seek performances through minimalism only. - - **Feature extensiveness**: complete implementation of the S3 API or any other API to make Garage a drop-in replacement is not targeted as it could lead to decisions impacting our desirable properties. - - **Storage optimizations**: erasure coding or any other coding technique both increase the difficulty of placing data and synchronizing; we limit ourselves to duplication. - - **POSIX/Filesystem compatibility**: we do not aim at being POSIX compatible or to emulate any kind of filesystem. Indeed, in a distributed environment, such synchronizations are translated in network messages that impose severe constraints on the deployment. - -Use-cases: - -- **[Deuxfleurs](https://deuxfleurs.fr):** Garage is used by Deuxfleurs which - is a non-profit hosting organization. Especially, it is used to host their - main website, this documentation and some of its members' blogs. - Deuxfleurs also uses Garage as their [Matrix's media - backend](https://github.com/matrix-org/synapse-s3-storage-provider). - Deuxfleurs also uses it in its continuous integration platform to store - Drone's job logs and a Nix binary cache. - -ENDOLD - - -Garage is an **open-source** distributed **storage service** you can **self-host** to fullfill many needs. +Garage is an **open-source** distributed **storage service** you can **self-host** to fullfill many needs:

Summary of the possible usages with a related icon: host a website, store media and backup target

-Garage implements the **[Amazon S3 API](https://docs.aws.amazon.com/AmazonS3/latest/API/Welcome.html)** and thus is already **compatible** with many applications. +

+⮞ learn more about use cases ⮜ +

+ +Garage implements the **[Amazon S3 API](https://docs.aws.amazon.com/AmazonS3/latest/API/Welcome.html)** and thus is already **compatible** with many applications:

Garage is already compatible with Nextcloud, Mastodon, Matrix Synapse, Cyberduck, RClone and Peertube

+

+⮞ learn more about integrations ⮜ +

+ -Garage provides **data resiliency** by **replicating** data 3x over **distant** servers. +Garage provides **data resiliency** by **replicating** data 3x over **distant** servers:

An example deployment on a map with servers in 5 zones: UK, France, Belgium, Germany and Switzerland. Each chunk of data is replicated in 3 of these 5 zones.

+

+⮞ learn more about our design ⮜ +

+ Did you notice that *this website* is hosted and served by Garage? ## Keeping requirements low @@ -79,6 +56,7 @@ We worked hard to keep requirements as low as possible as we target the largest * **RAM:** 1GB * **Disk Space:** at least 16GB * **Network:** 200ms or less, 50 Mbps or more + * **Heterogeneous hardware:** build a cluster with whatever second-hand machines are available *For the network, as we do not use consensus algorithms like Paxos or Raft, Garage is not as latency sensitive.* *Thanks to Rust and its zero-cost abstractions, we keep CPU and memory low.* @@ -88,20 +66,15 @@ We worked hard to keep requirements as low as possible as we target the largest - [Dynamo: Amazon’s Highly Available Key-value Store ](https://dl.acm.org/doi/abs/10.1145/1323293.1294281) by DeCandia et al. - [Conflict-Free Replicated Data Types](https://link.springer.com/chapter/10.1007/978-3-642-24550-3_29) by Shapiro et al. - [Maglev: A Fast and Reliable Software Network Load Balancer](https://www.usenix.org/conference/nsdi16/technical-sessions/presentation/eisenbud) by Eisenbud et al. - - [Merkle Search Trees: Efficient State-Based CRDTs in Open Networks](https://ieeexplore.ieee.org/document/9049566) by Auvolat and Taïani ## Talks -We love to talk and hear about Garage, that's why we keep a log here: - - [(fr, 2021-11-13, video) Garage : Mille et une façons de stocker vos données](https://video.tedomum.net/w/moYKcv198dyMrT8hCS5jz9) and [slides (html)](https://rfid.deuxfleurs.fr/presentations/2021-11-13/garage/) - during [RFID#1](https://rfid.deuxfleurs.fr/programme/2021-11-13/) event - [(en, 2021-04-28, pdf) Distributed object storage is centralised](https://git.deuxfleurs.fr/Deuxfleurs/garage/raw/commit/b1f60579a13d3c5eba7f74b1775c84639ea9b51a/doc/talks/2021-04-28_spirals-team/talk.pdf) - [(fr, 2020-12-02, pdf) Garage : jouer dans la cour des grands quand on est un hébergeur associatif](https://git.deuxfleurs.fr/Deuxfleurs/garage/raw/commit/b1f60579a13d3c5eba7f74b1775c84639ea9b51a/doc/talks/2020-12-02_wide-team/talk.pdf) -*Did you write or talk about Garage? [Open a pull request](https://git.deuxfleurs.fr/Deuxfleurs/garage/) to add a link here!* - ## Community If you want to discuss with us, you can join our Matrix channel at [#garage:deuxfleurs.fr](https://matrix.to/#/#garage:deuxfleurs.fr). diff --git a/doc/book/src/working_documents/design_draft.md b/doc/book/src/working_documents/design_draft.md new file mode 100644 index 00000000..06ed46bd --- /dev/null +++ b/doc/book/src/working_documents/design_draft.md @@ -0,0 +1,162 @@ +# Design draft + +**WARNING: this documentation is a design draft which was written before Garage's actual implementation. +The general principle are similar, but details have not been updated.** + + +#### Modules + +- `membership/`: configuration, membership management (gossip of node's presence and status), ring generation --> what about Serf (used by Consul/Nomad) : https://www.serf.io/? Seems a huge library with many features so maybe overkill/hard to integrate +- `metadata/`: metadata management +- `blocks/`: block management, writing, GC and rebalancing +- `internal/`: server to server communication (HTTP server and client that reuses connections, TLS if we want, etc) +- `api/`: S3 API +- `web/`: web management interface + +#### Metadata tables + +**Objects:** + +- *Hash key:* Bucket name (string) +- *Sort key:* Object key (string) +- *Sort key:* Version timestamp (int) +- *Sort key:* Version UUID (string) +- Complete: bool +- Inline: bool, true for objects < threshold (say 1024) +- Object size (int) +- Mime type (string) +- Data for inlined objects (blob) +- Hash of first block otherwise (string) + +*Having only a hash key on the bucket name will lead to storing all file entries of this table for a specific bucket on a single node. At the same time, it is the only way I see to rapidly being able to list all bucket entries...* + +**Blocks:** + +- *Hash key:* Version UUID (string) +- *Sort key:* Offset of block in total file (int) +- Hash of data block (string) + +A version is defined by the existence of at least one entry in the blocks table for a certain version UUID. +We must keep the following invariant: if a version exists in the blocks table, it has to be referenced in the objects table. +We explicitly manage concurrent versions of an object: the version timestamp and version UUID columns are index columns, thus we may have several concurrent versions of an object. +Important: before deleting an older version from the objects table, we must make sure that we did a successfull delete of the blocks of that version from the blocks table. + +Thus, the workflow for reading an object is as follows: + +1. Check permissions (LDAP) +2. Read entry in object table. If data is inline, we have its data, stop here. + -> if several versions, take newest one and launch deletion of old ones in background +3. Read first block from cluster. If size <= 1 block, stop here. +4. Simultaneously with previous step, if size > 1 block: query the Blocks table for the IDs of the next blocks +5. Read subsequent blocks from cluster + +Workflow for PUT: + +1. Check write permission (LDAP) +2. Select a new version UUID +3. Write a preliminary entry for the new version in the objects table with complete = false +4. Send blocks to cluster and write entries in the blocks table +5. Update the version with complete = true and all of the accurate information (size, etc) +6. Return success to the user +7. Launch a background job to check and delete older versions + +Workflow for DELETE: + +1. Check write permission (LDAP) +2. Get current version (or versions) in object table +3. Do the deletion of those versions NOT IN A BACKGROUND JOB THIS TIME +4. Return succes to the user if we were able to delete blocks from the blocks table and entries from the object table + +To delete a version: + +1. List the blocks from Cassandra +2. For each block, delete it from cluster. Don't care if some deletions fail, we can do GC. +3. Delete all of the blocks from the blocks table +4. Finally, delete the version from the objects table + +Known issue: if someone is reading from a version that we want to delete and the object is big, the read might be interrupted. I think it is ok to leave it like this, we just cut the connection if data disappears during a read. + +("Soit P un problème, on s'en fout est une solution à ce problème") + +#### Block storage on disk + +**Blocks themselves:** + +- file path = /blobs/(first 3 hex digits of hash)/(rest of hash) + +**Reverse index for GC & other block-level metadata:** + +- file path = /meta/(first 3 hex digits of hash)/(rest of hash) +- map block hash -> set of version UUIDs where it is referenced + +Usefull metadata: + +- list of versions that reference this block in the Casandra table, so that we can do GC by checking in Cassandra that the lines still exist +- list of other nodes that we know have acknowledged a write of this block, usefull in the rebalancing algorithm + +Write strategy: have a single thread that does all write IO so that it is serialized (or have several threads that manage independent parts of the hash space). When writing a blob, write it to a temporary file, close, then rename so that a concurrent read gets a consistent result (either not found or found with whole content). + +Read strategy: the only read operation is get(hash) that returns either the data or not found (can do a corruption check as well and return corrupted state if it is the case). Can be done concurrently with writes. + +**Internal API:** + +- get(block hash) -> ok+data/not found/corrupted +- put(block hash & data, version uuid + offset) -> ok/error +- put with no data(block hash, version uuid + offset) -> ok/not found plz send data/error +- delete(block hash, version uuid + offset) -> ok/error + +GC: when last ref is deleted, delete block. +Long GC procedure: check in Cassandra that version UUIDs still exist and references this block. + +Rebalancing: takes as argument the list of newly added nodes. + +- List all blocks that we have. For each block: +- If it hits a newly introduced node, send it to them. + Use put with no data first to check if it has to be sent to them already or not. + Use a random listing order to avoid race conditions (they do no harm but we might have two nodes sending the same thing at the same time thus wasting time). +- If it doesn't hit us anymore, delete it and its reference list. + +Only one balancing can be running at a same time. It can be restarted at the beginning with new parameters. + +#### Membership management + +Two sets of nodes: + +- set of nodes from which a ping was recently received, with status: number of stored blocks, request counters, error counters, GC%, rebalancing% + (eviction from this set after say 30 seconds without ping) +- set of nodes that are part of the system, explicitly modified by the operator using the web UI (persisted to disk), + is a CRDT using a version number for the value of the whole set + +Thus, three states for nodes: + +- healthy: in both sets +- missing: not pingable but part of desired cluster +- unused/draining: currently present but not part of the desired cluster, empty = if contains nothing, draining = if still contains some blocks + +Membership messages between nodes: + +- ping with current state + hash of current membership info -> reply with same info +- send&get back membership info (the ids of nodes that are in the two sets): used when no local membership change in a long time and membership info hash discrepancy detected with first message (passive membership fixing with full CRDT gossip) +- inform of newly pingable node(s) -> no result, when receive new info repeat to all (reliable broadcast) +- inform of operator membership change -> no result, when receive new info repeat to all (reliable broadcast) + +Ring: generated from the desired set of nodes, however when doing read/writes on the ring, skip nodes that are known to be not pingable. +The tokens are generated in a deterministic fashion from node IDs (hash of node id + token number from 1 to K). +Number K of tokens per node: decided by the operator & stored in the operator's list of nodes CRDT. Default value proposal: with node status information also broadcast disk total size and free space, and propose a default number of tokens equal to 80%Free space / 10Gb. (this is all user interface) + + +#### Constants + +- Block size: around 1MB ? --> Exoscale use 16MB chunks +- Number of tokens in the hash ring: one every 10Gb of allocated storage +- Threshold for storing data directly in Cassandra objects table: 1kb bytes (maybe up to 4kb?) +- Ping timeout (time after which a node is registered as unresponsive/missing): 30 seconds +- Ping interval: 10 seconds +- ?? + +#### Links + +- CDC: +- Erasure coding: +- [Openstack Storage Concepts](https://docs.openstack.org/arch-design/design-storage/design-storage-concepts.html) +- [RADOS](https://ceph.com/wp-content/uploads/2016/08/weil-rados-pdsw07.pdf) -- cgit v1.2.3