From 002538f92c1d9f95f2d699337f7d891c6aa0c9a4 Mon Sep 17 00:00:00 2001 From: Quentin Dufour Date: Wed, 17 Mar 2021 16:15:18 +0100 Subject: Refactor file organization --- doc/book/src/S3_Compatibility.md | 84 --------- doc/book/src/SUMMARY.md | 23 +-- doc/book/src/compatibility.md | 1 - doc/book/src/cookbook/index.md | 1 + doc/book/src/cookbook/website.md | 1 + doc/book/src/design/index.md | 1 + doc/book/src/design/internals.md | 158 +++++++++++++++++ doc/book/src/design/related_work.md | 56 ++++++ doc/book/src/development/devenv.md | 17 ++ doc/book/src/development/index.md | 1 + doc/book/src/devenv.md | 17 -- doc/book/src/getting_started.md | 5 - doc/book/src/getting_started/index.md | 5 + doc/book/src/internals.md | 158 ----------------- doc/book/src/intro.md | 24 +++ doc/book/src/reference_manual/index.md | 1 + doc/book/src/reference_manual/s3_compatibility.md | 84 +++++++++ doc/book/src/related_work.md | 56 ------ doc/book/src/website.md | 1 - doc/book/src/working_documents/index.md | 7 + doc/book/src/working_documents/load_balancing.md | 197 ++++++++++++++++++++++ 21 files changed, 566 insertions(+), 332 deletions(-) delete mode 100644 doc/book/src/S3_Compatibility.md delete mode 100644 doc/book/src/compatibility.md create mode 100644 doc/book/src/cookbook/index.md create mode 100644 doc/book/src/cookbook/website.md create mode 100644 doc/book/src/design/index.md create mode 100644 doc/book/src/design/internals.md create mode 100644 doc/book/src/design/related_work.md create mode 100644 doc/book/src/development/devenv.md create mode 100644 doc/book/src/development/index.md delete mode 100644 doc/book/src/devenv.md delete mode 100644 doc/book/src/getting_started.md create mode 100644 doc/book/src/getting_started/index.md delete mode 100644 doc/book/src/internals.md create mode 100644 doc/book/src/reference_manual/index.md create mode 100644 doc/book/src/reference_manual/s3_compatibility.md delete mode 100644 doc/book/src/related_work.md delete mode 100644 doc/book/src/website.md create mode 100644 doc/book/src/working_documents/index.md create mode 100644 doc/book/src/working_documents/load_balancing.md (limited to 'doc/book') diff --git a/doc/book/src/S3_Compatibility.md b/doc/book/src/S3_Compatibility.md deleted file mode 100644 index c0fc2863..00000000 --- a/doc/book/src/S3_Compatibility.md +++ /dev/null @@ -1,84 +0,0 @@ -## S3 Compatibility status - -### Global S3 features - -Implemented: - -- path-style URLs (`garage.tld/bucket/key`) -- putting and getting objects in buckets -- multipart uploads -- listing objects -- access control on a per-key-per-bucket basis - -Not implemented: - -- vhost-style URLs (`bucket.garage.tld/key`) -- object-level ACL -- encryption -- most `x-amz-` headers - - -### Endpoint implementation - -All APIs that are not mentionned are not implemented and will return a 400 bad request. - -#### AbortMultipartUpload - -Implemented. - -#### CompleteMultipartUpload - -Implemented badly. Garage will not check that all the parts stored correspond to the list given by the client in the request body. This means that the multipart upload might be completed with an invalid size. This is a bug and will be fixed. - -#### CopyObject - -Implemented. - -#### CreateBucket - -Garage does not accept creating buckets or giving access using API calls, it has to be done using the CLI tools. CreateBucket will return a 200 if the bucket exists and user has write access, and a 403 Forbidden in all other cases. - -#### CreateMultipartUpload - -Implemented. - -#### DeleteBucket - -Garage does not accept deleting buckets using API calls, it has to be done using the CLI tools. This request will return a 403 Forbidden. - -#### DeleteObject - -Implemented. - -#### DeleteObjects - -Implemented. - -#### GetObject - -Implemented. - -#### HeadBucket - -Implemented. - -#### HeadObject - -Implemented. - -#### ListObjects - -Implemented, but there isn't a very good specification of what `encoding-type=url` covers so there might be some encoding bugs. In our implementation the url-encoded fields are in the same in ListObjects as they are in ListObjectsV2. - -#### ListObjectsV2 - -Implemented. - -#### PutObject - -Implemented. - -#### UploadPart - -Implemented. - diff --git a/doc/book/src/SUMMARY.md b/doc/book/src/SUMMARY.md index 5d01dbee..7697948b 100644 --- a/doc/book/src/SUMMARY.md +++ b/doc/book/src/SUMMARY.md @@ -2,25 +2,28 @@ [The Garage Data Store](./intro.md) -- [Getting Started](./getting_started.md) +- [Getting Started](./getting_started/index.md) - [Installation](./getting_started/install.md) - [Configure a cluster](./getting_started/cluster.md) - [Create buckets and keys](./getting_started/bucket.md) - [Handle files](./getting_started/files.md) -- [Cookbooks]() - - [Host a website](./website.md) +- [Cookbook](./cookbook/index.md) + - [Host a website](./cookbook/website.md) - [Integrate as a media backend]() - [Operate a cluster]() -- [Reference Manual]() +- [Reference Manual](./reference_manual/index.md) - [Garage CLI]() - - [S3 API](./compatibility.md) + - [S3 API](./reference_manual/s3_compatibility.md) -- [Design]() - - [Related Work](./related_work.md) - - [Internals](./internals.md) +- [Design](./design/index.md) + - [Related Work](./design/related_work.md) + - [Internals](./design/internals.md) -- [Development]() - - [Setup your environment](./devenv.md) +- [Development](./development/index.md) + - [Setup your environment](./development/devenv.md) - [Your first contribution]() + +- [Working Documents](./working_documents/index.md) + - [Load Balancing Data](./working_documents/load_balancing.md) diff --git a/doc/book/src/compatibility.md b/doc/book/src/compatibility.md deleted file mode 100644 index acf9968b..00000000 --- a/doc/book/src/compatibility.md +++ /dev/null @@ -1 +0,0 @@ -# S3 API diff --git a/doc/book/src/cookbook/index.md b/doc/book/src/cookbook/index.md new file mode 100644 index 00000000..741ecbe7 --- /dev/null +++ b/doc/book/src/cookbook/index.md @@ -0,0 +1 @@ +# Cookbook diff --git a/doc/book/src/cookbook/website.md b/doc/book/src/cookbook/website.md new file mode 100644 index 00000000..2ea82a9a --- /dev/null +++ b/doc/book/src/cookbook/website.md @@ -0,0 +1 @@ +# Host a website diff --git a/doc/book/src/design/index.md b/doc/book/src/design/index.md new file mode 100644 index 00000000..3d14cb7c --- /dev/null +++ b/doc/book/src/design/index.md @@ -0,0 +1 @@ +# Design diff --git a/doc/book/src/design/internals.md b/doc/book/src/design/internals.md new file mode 100644 index 00000000..e712ae07 --- /dev/null +++ b/doc/book/src/design/internals.md @@ -0,0 +1,158 @@ +**WARNING: this documentation is more a "design draft", which was written before Garage's actual implementation. The general principle is similar but details have not yet been updated.** + +#### Modules + +- `membership/`: configuration, membership management (gossip of node's presence and status), ring generation --> what about Serf (used by Consul/Nomad) : https://www.serf.io/? Seems a huge library with many features so maybe overkill/hard to integrate +- `metadata/`: metadata management +- `blocks/`: block management, writing, GC and rebalancing +- `internal/`: server to server communication (HTTP server and client that reuses connections, TLS if we want, etc) +- `api/`: S3 API +- `web/`: web management interface + +#### Metadata tables + +**Objects:** + +- *Hash key:* Bucket name (string) +- *Sort key:* Object key (string) +- *Sort key:* Version timestamp (int) +- *Sort key:* Version UUID (string) +- Complete: bool +- Inline: bool, true for objects < threshold (say 1024) +- Object size (int) +- Mime type (string) +- Data for inlined objects (blob) +- Hash of first block otherwise (string) + +*Having only a hash key on the bucket name will lead to storing all file entries of this table for a specific bucket on a single node. At the same time, it is the only way I see to rapidly being able to list all bucket entries...* + +**Blocks:** + +- *Hash key:* Version UUID (string) +- *Sort key:* Offset of block in total file (int) +- Hash of data block (string) + +A version is defined by the existence of at least one entry in the blocks table for a certain version UUID. +We must keep the following invariant: if a version exists in the blocks table, it has to be referenced in the objects table. +We explicitly manage concurrent versions of an object: the version timestamp and version UUID columns are index columns, thus we may have several concurrent versions of an object. +Important: before deleting an older version from the objects table, we must make sure that we did a successfull delete of the blocks of that version from the blocks table. + +Thus, the workflow for reading an object is as follows: + +1. Check permissions (LDAP) +2. Read entry in object table. If data is inline, we have its data, stop here. + -> if several versions, take newest one and launch deletion of old ones in background +3. Read first block from cluster. If size <= 1 block, stop here. +4. Simultaneously with previous step, if size > 1 block: query the Blocks table for the IDs of the next blocks +5. Read subsequent blocks from cluster + +Workflow for PUT: + +1. Check write permission (LDAP) +2. Select a new version UUID +3. Write a preliminary entry for the new version in the objects table with complete = false +4. Send blocks to cluster and write entries in the blocks table +5. Update the version with complete = true and all of the accurate information (size, etc) +6. Return success to the user +7. Launch a background job to check and delete older versions + +Workflow for DELETE: + +1. Check write permission (LDAP) +2. Get current version (or versions) in object table +3. Do the deletion of those versions NOT IN A BACKGROUND JOB THIS TIME +4. Return succes to the user if we were able to delete blocks from the blocks table and entries from the object table + +To delete a version: + +1. List the blocks from Cassandra +2. For each block, delete it from cluster. Don't care if some deletions fail, we can do GC. +3. Delete all of the blocks from the blocks table +4. Finally, delete the version from the objects table + +Known issue: if someone is reading from a version that we want to delete and the object is big, the read might be interrupted. I think it is ok to leave it like this, we just cut the connection if data disappears during a read. + +("Soit P un problème, on s'en fout est une solution à ce problème") + +#### Block storage on disk + +**Blocks themselves:** + +- file path = /blobs/(first 3 hex digits of hash)/(rest of hash) + +**Reverse index for GC & other block-level metadata:** + +- file path = /meta/(first 3 hex digits of hash)/(rest of hash) +- map block hash -> set of version UUIDs where it is referenced + +Usefull metadata: + +- list of versions that reference this block in the Casandra table, so that we can do GC by checking in Cassandra that the lines still exist +- list of other nodes that we know have acknowledged a write of this block, usefull in the rebalancing algorithm + +Write strategy: have a single thread that does all write IO so that it is serialized (or have several threads that manage independent parts of the hash space). When writing a blob, write it to a temporary file, close, then rename so that a concurrent read gets a consistent result (either not found or found with whole content). + +Read strategy: the only read operation is get(hash) that returns either the data or not found (can do a corruption check as well and return corrupted state if it is the case). Can be done concurrently with writes. + +**Internal API:** + +- get(block hash) -> ok+data/not found/corrupted +- put(block hash & data, version uuid + offset) -> ok/error +- put with no data(block hash, version uuid + offset) -> ok/not found plz send data/error +- delete(block hash, version uuid + offset) -> ok/error + +GC: when last ref is deleted, delete block. +Long GC procedure: check in Cassandra that version UUIDs still exist and references this block. + +Rebalancing: takes as argument the list of newly added nodes. + +- List all blocks that we have. For each block: +- If it hits a newly introduced node, send it to them. + Use put with no data first to check if it has to be sent to them already or not. + Use a random listing order to avoid race conditions (they do no harm but we might have two nodes sending the same thing at the same time thus wasting time). +- If it doesn't hit us anymore, delete it and its reference list. + +Only one balancing can be running at a same time. It can be restarted at the beginning with new parameters. + +#### Membership management + +Two sets of nodes: + +- set of nodes from which a ping was recently received, with status: number of stored blocks, request counters, error counters, GC%, rebalancing% + (eviction from this set after say 30 seconds without ping) +- set of nodes that are part of the system, explicitly modified by the operator using the web UI (persisted to disk), + is a CRDT using a version number for the value of the whole set + +Thus, three states for nodes: + +- healthy: in both sets +- missing: not pingable but part of desired cluster +- unused/draining: currently present but not part of the desired cluster, empty = if contains nothing, draining = if still contains some blocks + +Membership messages between nodes: + +- ping with current state + hash of current membership info -> reply with same info +- send&get back membership info (the ids of nodes that are in the two sets): used when no local membership change in a long time and membership info hash discrepancy detected with first message (passive membership fixing with full CRDT gossip) +- inform of newly pingable node(s) -> no result, when receive new info repeat to all (reliable broadcast) +- inform of operator membership change -> no result, when receive new info repeat to all (reliable broadcast) + +Ring: generated from the desired set of nodes, however when doing read/writes on the ring, skip nodes that are known to be not pingable. +The tokens are generated in a deterministic fashion from node IDs (hash of node id + token number from 1 to K). +Number K of tokens per node: decided by the operator & stored in the operator's list of nodes CRDT. Default value proposal: with node status information also broadcast disk total size and free space, and propose a default number of tokens equal to 80%Free space / 10Gb. (this is all user interface) + + +#### Constants + +- Block size: around 1MB ? --> Exoscale use 16MB chunks +- Number of tokens in the hash ring: one every 10Gb of allocated storage +- Threshold for storing data directly in Cassandra objects table: 1kb bytes (maybe up to 4kb?) +- Ping timeout (time after which a node is registered as unresponsive/missing): 30 seconds +- Ping interval: 10 seconds +- ?? + +#### Links + +- CDC: +- Erasure coding: +- [Openstack Storage Concepts](https://docs.openstack.org/arch-design/design-storage/design-storage-concepts.html) +- [RADOS](https://ceph.com/wp-content/uploads/2016/08/weil-rados-pdsw07.pdf) diff --git a/doc/book/src/design/related_work.md b/doc/book/src/design/related_work.md new file mode 100644 index 00000000..bae4691c --- /dev/null +++ b/doc/book/src/design/related_work.md @@ -0,0 +1,56 @@ +# Related Work + +## Context + +Data storage is critical: it can lead to data loss if done badly and/or on hardware failure. +Filesystems + RAID can help on a single machine but a machine failure can put the whole storage offline. +Moreover, it put a hard limit on scalability. Often this limit can be pushed back far away by buying expensive machines. +But here we consider non specialized off the shelf machines that can be as low powered and subject to failures as a raspberry pi. + +Distributed storage may help to solve both availability and scalability problems on these machines. +Many solutions were proposed, they can be categorized as block storage, file storage and object storage depending on the abstraction they provide. + +## Overview + +Block storage is the most low level one, it's like exposing your raw hard drive over the network. +It requires very low latencies and stable network, that are often dedicated. +However it provides disk devices that can be manipulated by the operating system with the less constraints: it can be partitioned with any filesystem, meaning that it supports even the most exotic features. +We can cite [iSCSI](https://en.wikipedia.org/wiki/ISCSI) or [Fibre Channel](https://en.wikipedia.org/wiki/Fibre_Channel). +Openstack Cinder proxy previous solution to provide an uniform API. + +File storage provides a higher abstraction, they are one filesystem among others, which means they don't necessarily have all the exotic features of every filesystem. +Often, they relax some POSIX constraints while many applications will still be compatible without any modification. +As an example, we are able to run MariaDB (very slowly) over GlusterFS... +We can also mention CephFS (read [RADOS](https://ceph.com/wp-content/uploads/2016/08/weil-rados-pdsw07.pdf) whitepaper), Lustre, LizardFS, MooseFS, etc. +OpenStack Manila proxy previous solutions to provide an uniform API. + +Finally object storages provide the highest level abstraction. +They are the testimony that the POSIX filesystem API is not adapted to distributed filesystems. +Especially, the strong concistency has been dropped in favor of eventual consistency which is way more convenient and powerful in presence of high latencies and unreliability. +We often read about S3 that pioneered the concept that it's a filesystem for the WAN. +Applications must be adapted to work for the desired object storage service. +Today, the S3 HTTP REST API acts as a standard in the industry. +However, Amazon S3 source code is not open but alternatives were proposed. +We identified Minio, Pithos, Swift and Ceph. +Minio/Ceph enforces a total order, so properties similar to a (relaxed) filesystem. +Swift and Pithos are probably the most similar to AWS S3 with their consistent hashing ring. +However Pithos is not maintained anymore. More precisely the company that published Pithos version 1 has developped a second version 2 but has not open sourced it. +Some tests conducted by the [ACIDES project](https://acides.org/) have shown that Openstack Swift consumes way more resources (CPU+RAM) that we can afford. Furthermore, people developing Swift have not designed their software for geo-distribution. + +There were many attempts in research too. I am only thinking to [LBFS](https://pdos.csail.mit.edu/papers/lbfs:sosp01/lbfs.pdf) that was used as a basis for Seafile. But none of them have been effectively implemented yet. + +## Existing software + +**[Pithos](https://github.com/exoscale/pithos) :** +Pithos has been abandonned and should probably not used yet, in the following we explain why we did not pick their design. +Pithos was relying as a S3 proxy in front of Cassandra (and was working with Scylla DB too). +From its designers' mouth, storing data in Cassandra has shown its limitations justifying the project abandonment. +They built a closed-source version 2 that does not store blobs in the database (only metadata) but did not communicate further on it. +We considered there v2's design but concluded that it does not fit both our *Self-contained & lightweight* and *Simple* properties. It makes the development, the deployment and the operations more complicated while reducing the flexibility. + +**[IPFS](https://ipfs.io/) :** +*Not written yet* + +## Specific research papers + +*Not yet written* diff --git a/doc/book/src/development/devenv.md b/doc/book/src/development/devenv.md new file mode 100644 index 00000000..6cb7c554 --- /dev/null +++ b/doc/book/src/development/devenv.md @@ -0,0 +1,17 @@ +# Setup your development environment + +We propose the following quickstart to setup a full dev. environment as quickly as possible: + + 1. Setup a rust/cargo environment. eg. `dnf install rust cargo` + 2. Install awscli v2 by following the guide [here](https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2.html). + 3. Run `cargo build` to build the project + 4. Run `./script/dev-cluster.sh` to launch a test cluster (feel free to read the script) + 5. Run `./script/dev-configure.sh` to configure your test cluster with default values (same datacenter, 100 tokens) + 6. Run `./script/dev-bucket.sh` to create a bucket named `eprouvette` and an API key that will be stored in `/tmp/garage.s3` + 7. Run `source ./script/dev-env-aws.sh` to configure your CLI environment + 8. You can use `garage` to manage the cluster. Try `garage --help`. + 9. You can use the `awsgrg` alias to add, remove, and delete files. Try `awsgrg help`, `awsgrg cp /proc/cpuinfo s3://eprouvette/cpuinfo.txt`, or `awsgrg ls s3://eprouvette`. `awsgrg` is a wrapper on the `aws s3` command pre-configured with the previously generated API key (the one in `/tmp/garage.s3`) and localhost as the endpoint. + +Now you should be ready to start hacking on garage! + + diff --git a/doc/book/src/development/index.md b/doc/book/src/development/index.md new file mode 100644 index 00000000..459110d3 --- /dev/null +++ b/doc/book/src/development/index.md @@ -0,0 +1 @@ +# Development diff --git a/doc/book/src/devenv.md b/doc/book/src/devenv.md deleted file mode 100644 index 6cb7c554..00000000 --- a/doc/book/src/devenv.md +++ /dev/null @@ -1,17 +0,0 @@ -# Setup your development environment - -We propose the following quickstart to setup a full dev. environment as quickly as possible: - - 1. Setup a rust/cargo environment. eg. `dnf install rust cargo` - 2. Install awscli v2 by following the guide [here](https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2.html). - 3. Run `cargo build` to build the project - 4. Run `./script/dev-cluster.sh` to launch a test cluster (feel free to read the script) - 5. Run `./script/dev-configure.sh` to configure your test cluster with default values (same datacenter, 100 tokens) - 6. Run `./script/dev-bucket.sh` to create a bucket named `eprouvette` and an API key that will be stored in `/tmp/garage.s3` - 7. Run `source ./script/dev-env-aws.sh` to configure your CLI environment - 8. You can use `garage` to manage the cluster. Try `garage --help`. - 9. You can use the `awsgrg` alias to add, remove, and delete files. Try `awsgrg help`, `awsgrg cp /proc/cpuinfo s3://eprouvette/cpuinfo.txt`, or `awsgrg ls s3://eprouvette`. `awsgrg` is a wrapper on the `aws s3` command pre-configured with the previously generated API key (the one in `/tmp/garage.s3`) and localhost as the endpoint. - -Now you should be ready to start hacking on garage! - - diff --git a/doc/book/src/getting_started.md b/doc/book/src/getting_started.md deleted file mode 100644 index 282f5034..00000000 --- a/doc/book/src/getting_started.md +++ /dev/null @@ -1,5 +0,0 @@ -# Getting Started - -Let's start your Garage journey! -In this chapter, we explain how to deploy a simple garage cluster and start interacting with it. -Our goal is to introduce you to Garage's workflows. diff --git a/doc/book/src/getting_started/index.md b/doc/book/src/getting_started/index.md new file mode 100644 index 00000000..282f5034 --- /dev/null +++ b/doc/book/src/getting_started/index.md @@ -0,0 +1,5 @@ +# Getting Started + +Let's start your Garage journey! +In this chapter, we explain how to deploy a simple garage cluster and start interacting with it. +Our goal is to introduce you to Garage's workflows. diff --git a/doc/book/src/internals.md b/doc/book/src/internals.md deleted file mode 100644 index e712ae07..00000000 --- a/doc/book/src/internals.md +++ /dev/null @@ -1,158 +0,0 @@ -**WARNING: this documentation is more a "design draft", which was written before Garage's actual implementation. The general principle is similar but details have not yet been updated.** - -#### Modules - -- `membership/`: configuration, membership management (gossip of node's presence and status), ring generation --> what about Serf (used by Consul/Nomad) : https://www.serf.io/? Seems a huge library with many features so maybe overkill/hard to integrate -- `metadata/`: metadata management -- `blocks/`: block management, writing, GC and rebalancing -- `internal/`: server to server communication (HTTP server and client that reuses connections, TLS if we want, etc) -- `api/`: S3 API -- `web/`: web management interface - -#### Metadata tables - -**Objects:** - -- *Hash key:* Bucket name (string) -- *Sort key:* Object key (string) -- *Sort key:* Version timestamp (int) -- *Sort key:* Version UUID (string) -- Complete: bool -- Inline: bool, true for objects < threshold (say 1024) -- Object size (int) -- Mime type (string) -- Data for inlined objects (blob) -- Hash of first block otherwise (string) - -*Having only a hash key on the bucket name will lead to storing all file entries of this table for a specific bucket on a single node. At the same time, it is the only way I see to rapidly being able to list all bucket entries...* - -**Blocks:** - -- *Hash key:* Version UUID (string) -- *Sort key:* Offset of block in total file (int) -- Hash of data block (string) - -A version is defined by the existence of at least one entry in the blocks table for a certain version UUID. -We must keep the following invariant: if a version exists in the blocks table, it has to be referenced in the objects table. -We explicitly manage concurrent versions of an object: the version timestamp and version UUID columns are index columns, thus we may have several concurrent versions of an object. -Important: before deleting an older version from the objects table, we must make sure that we did a successfull delete of the blocks of that version from the blocks table. - -Thus, the workflow for reading an object is as follows: - -1. Check permissions (LDAP) -2. Read entry in object table. If data is inline, we have its data, stop here. - -> if several versions, take newest one and launch deletion of old ones in background -3. Read first block from cluster. If size <= 1 block, stop here. -4. Simultaneously with previous step, if size > 1 block: query the Blocks table for the IDs of the next blocks -5. Read subsequent blocks from cluster - -Workflow for PUT: - -1. Check write permission (LDAP) -2. Select a new version UUID -3. Write a preliminary entry for the new version in the objects table with complete = false -4. Send blocks to cluster and write entries in the blocks table -5. Update the version with complete = true and all of the accurate information (size, etc) -6. Return success to the user -7. Launch a background job to check and delete older versions - -Workflow for DELETE: - -1. Check write permission (LDAP) -2. Get current version (or versions) in object table -3. Do the deletion of those versions NOT IN A BACKGROUND JOB THIS TIME -4. Return succes to the user if we were able to delete blocks from the blocks table and entries from the object table - -To delete a version: - -1. List the blocks from Cassandra -2. For each block, delete it from cluster. Don't care if some deletions fail, we can do GC. -3. Delete all of the blocks from the blocks table -4. Finally, delete the version from the objects table - -Known issue: if someone is reading from a version that we want to delete and the object is big, the read might be interrupted. I think it is ok to leave it like this, we just cut the connection if data disappears during a read. - -("Soit P un problème, on s'en fout est une solution à ce problème") - -#### Block storage on disk - -**Blocks themselves:** - -- file path = /blobs/(first 3 hex digits of hash)/(rest of hash) - -**Reverse index for GC & other block-level metadata:** - -- file path = /meta/(first 3 hex digits of hash)/(rest of hash) -- map block hash -> set of version UUIDs where it is referenced - -Usefull metadata: - -- list of versions that reference this block in the Casandra table, so that we can do GC by checking in Cassandra that the lines still exist -- list of other nodes that we know have acknowledged a write of this block, usefull in the rebalancing algorithm - -Write strategy: have a single thread that does all write IO so that it is serialized (or have several threads that manage independent parts of the hash space). When writing a blob, write it to a temporary file, close, then rename so that a concurrent read gets a consistent result (either not found or found with whole content). - -Read strategy: the only read operation is get(hash) that returns either the data or not found (can do a corruption check as well and return corrupted state if it is the case). Can be done concurrently with writes. - -**Internal API:** - -- get(block hash) -> ok+data/not found/corrupted -- put(block hash & data, version uuid + offset) -> ok/error -- put with no data(block hash, version uuid + offset) -> ok/not found plz send data/error -- delete(block hash, version uuid + offset) -> ok/error - -GC: when last ref is deleted, delete block. -Long GC procedure: check in Cassandra that version UUIDs still exist and references this block. - -Rebalancing: takes as argument the list of newly added nodes. - -- List all blocks that we have. For each block: -- If it hits a newly introduced node, send it to them. - Use put with no data first to check if it has to be sent to them already or not. - Use a random listing order to avoid race conditions (they do no harm but we might have two nodes sending the same thing at the same time thus wasting time). -- If it doesn't hit us anymore, delete it and its reference list. - -Only one balancing can be running at a same time. It can be restarted at the beginning with new parameters. - -#### Membership management - -Two sets of nodes: - -- set of nodes from which a ping was recently received, with status: number of stored blocks, request counters, error counters, GC%, rebalancing% - (eviction from this set after say 30 seconds without ping) -- set of nodes that are part of the system, explicitly modified by the operator using the web UI (persisted to disk), - is a CRDT using a version number for the value of the whole set - -Thus, three states for nodes: - -- healthy: in both sets -- missing: not pingable but part of desired cluster -- unused/draining: currently present but not part of the desired cluster, empty = if contains nothing, draining = if still contains some blocks - -Membership messages between nodes: - -- ping with current state + hash of current membership info -> reply with same info -- send&get back membership info (the ids of nodes that are in the two sets): used when no local membership change in a long time and membership info hash discrepancy detected with first message (passive membership fixing with full CRDT gossip) -- inform of newly pingable node(s) -> no result, when receive new info repeat to all (reliable broadcast) -- inform of operator membership change -> no result, when receive new info repeat to all (reliable broadcast) - -Ring: generated from the desired set of nodes, however when doing read/writes on the ring, skip nodes that are known to be not pingable. -The tokens are generated in a deterministic fashion from node IDs (hash of node id + token number from 1 to K). -Number K of tokens per node: decided by the operator & stored in the operator's list of nodes CRDT. Default value proposal: with node status information also broadcast disk total size and free space, and propose a default number of tokens equal to 80%Free space / 10Gb. (this is all user interface) - - -#### Constants - -- Block size: around 1MB ? --> Exoscale use 16MB chunks -- Number of tokens in the hash ring: one every 10Gb of allocated storage -- Threshold for storing data directly in Cassandra objects table: 1kb bytes (maybe up to 4kb?) -- Ping timeout (time after which a node is registered as unresponsive/missing): 30 seconds -- Ping interval: 10 seconds -- ?? - -#### Links - -- CDC: -- Erasure coding: -- [Openstack Storage Concepts](https://docs.openstack.org/arch-design/design-storage/design-storage-concepts.html) -- [RADOS](https://ceph.com/wp-content/uploads/2016/08/weil-rados-pdsw07.pdf) diff --git a/doc/book/src/intro.md b/doc/book/src/intro.md index ec77036f..02920f83 100644 --- a/doc/book/src/intro.md +++ b/doc/book/src/intro.md @@ -60,6 +60,30 @@ In a certain way, Ceph and Minio are closer togethers than they are from Garage *More comparisons are available in our [Related Work](design/related_work.md) chapter.* +## Other Resources + +This website is not the only source of information about Garage! +We reference here other places on the Internet where you can learn more about Garage. + +### Rust API (docs.rs) + +If you encounter a specific bug in Garage or plan to patch it, you may jump directly to the source code documentation! + + - [garage\_api](https://docs.rs/garage_api/latest/garage_api/) - contains the S3 standard API endpoint + - [garage\_model](https://docs.rs/garage_model/latest/garage_model/) - contains Garage's model built on the table abstraction + - [garage\_rpc](https://docs.rs/garage_rpc/latest/garage_rpc/) - contains Garage's federation protocol + - [garage\_table](https://docs.rs/garage_table/latest/garage_table/) - contains core Garage's CRDT datatypes + - [garage\_util](https://docs.rs/garage_util/latest/garage_util/) - contains garage entrypoints (daemon, cli) + - [garage\_web](https://docs.rs/garage_web/latest/garage_web/) - contains the S3 website endpoint + +### Talks + +We love to talk and hear about Garage, that's why we keep a log here: + + - [(fr, 2020-12-02) Garage : jouer dans la cour des grands quand on est un hébergeur associatif](https://git.deuxfleurs.fr/Deuxfleurs/garage/src/branch/master/doc/20201202_talk/talk.pdf) + +*Did you write or talk about Garage? [Open a pull request](https://git.deuxfleurs.fr/Deuxfleurs/garage/) to add a link here!* + ## Community If you want to discuss with us, you can join our Matrix channel at [#garage:deuxfleurs.fr](https://matrix.to/#/#garage:deuxfleurs.fr). diff --git a/doc/book/src/reference_manual/index.md b/doc/book/src/reference_manual/index.md new file mode 100644 index 00000000..a2f380f6 --- /dev/null +++ b/doc/book/src/reference_manual/index.md @@ -0,0 +1 @@ +# Reference Manual diff --git a/doc/book/src/reference_manual/s3_compatibility.md b/doc/book/src/reference_manual/s3_compatibility.md new file mode 100644 index 00000000..c0fc2863 --- /dev/null +++ b/doc/book/src/reference_manual/s3_compatibility.md @@ -0,0 +1,84 @@ +## S3 Compatibility status + +### Global S3 features + +Implemented: + +- path-style URLs (`garage.tld/bucket/key`) +- putting and getting objects in buckets +- multipart uploads +- listing objects +- access control on a per-key-per-bucket basis + +Not implemented: + +- vhost-style URLs (`bucket.garage.tld/key`) +- object-level ACL +- encryption +- most `x-amz-` headers + + +### Endpoint implementation + +All APIs that are not mentionned are not implemented and will return a 400 bad request. + +#### AbortMultipartUpload + +Implemented. + +#### CompleteMultipartUpload + +Implemented badly. Garage will not check that all the parts stored correspond to the list given by the client in the request body. This means that the multipart upload might be completed with an invalid size. This is a bug and will be fixed. + +#### CopyObject + +Implemented. + +#### CreateBucket + +Garage does not accept creating buckets or giving access using API calls, it has to be done using the CLI tools. CreateBucket will return a 200 if the bucket exists and user has write access, and a 403 Forbidden in all other cases. + +#### CreateMultipartUpload + +Implemented. + +#### DeleteBucket + +Garage does not accept deleting buckets using API calls, it has to be done using the CLI tools. This request will return a 403 Forbidden. + +#### DeleteObject + +Implemented. + +#### DeleteObjects + +Implemented. + +#### GetObject + +Implemented. + +#### HeadBucket + +Implemented. + +#### HeadObject + +Implemented. + +#### ListObjects + +Implemented, but there isn't a very good specification of what `encoding-type=url` covers so there might be some encoding bugs. In our implementation the url-encoded fields are in the same in ListObjects as they are in ListObjectsV2. + +#### ListObjectsV2 + +Implemented. + +#### PutObject + +Implemented. + +#### UploadPart + +Implemented. + diff --git a/doc/book/src/related_work.md b/doc/book/src/related_work.md deleted file mode 100644 index bae4691c..00000000 --- a/doc/book/src/related_work.md +++ /dev/null @@ -1,56 +0,0 @@ -# Related Work - -## Context - -Data storage is critical: it can lead to data loss if done badly and/or on hardware failure. -Filesystems + RAID can help on a single machine but a machine failure can put the whole storage offline. -Moreover, it put a hard limit on scalability. Often this limit can be pushed back far away by buying expensive machines. -But here we consider non specialized off the shelf machines that can be as low powered and subject to failures as a raspberry pi. - -Distributed storage may help to solve both availability and scalability problems on these machines. -Many solutions were proposed, they can be categorized as block storage, file storage and object storage depending on the abstraction they provide. - -## Overview - -Block storage is the most low level one, it's like exposing your raw hard drive over the network. -It requires very low latencies and stable network, that are often dedicated. -However it provides disk devices that can be manipulated by the operating system with the less constraints: it can be partitioned with any filesystem, meaning that it supports even the most exotic features. -We can cite [iSCSI](https://en.wikipedia.org/wiki/ISCSI) or [Fibre Channel](https://en.wikipedia.org/wiki/Fibre_Channel). -Openstack Cinder proxy previous solution to provide an uniform API. - -File storage provides a higher abstraction, they are one filesystem among others, which means they don't necessarily have all the exotic features of every filesystem. -Often, they relax some POSIX constraints while many applications will still be compatible without any modification. -As an example, we are able to run MariaDB (very slowly) over GlusterFS... -We can also mention CephFS (read [RADOS](https://ceph.com/wp-content/uploads/2016/08/weil-rados-pdsw07.pdf) whitepaper), Lustre, LizardFS, MooseFS, etc. -OpenStack Manila proxy previous solutions to provide an uniform API. - -Finally object storages provide the highest level abstraction. -They are the testimony that the POSIX filesystem API is not adapted to distributed filesystems. -Especially, the strong concistency has been dropped in favor of eventual consistency which is way more convenient and powerful in presence of high latencies and unreliability. -We often read about S3 that pioneered the concept that it's a filesystem for the WAN. -Applications must be adapted to work for the desired object storage service. -Today, the S3 HTTP REST API acts as a standard in the industry. -However, Amazon S3 source code is not open but alternatives were proposed. -We identified Minio, Pithos, Swift and Ceph. -Minio/Ceph enforces a total order, so properties similar to a (relaxed) filesystem. -Swift and Pithos are probably the most similar to AWS S3 with their consistent hashing ring. -However Pithos is not maintained anymore. More precisely the company that published Pithos version 1 has developped a second version 2 but has not open sourced it. -Some tests conducted by the [ACIDES project](https://acides.org/) have shown that Openstack Swift consumes way more resources (CPU+RAM) that we can afford. Furthermore, people developing Swift have not designed their software for geo-distribution. - -There were many attempts in research too. I am only thinking to [LBFS](https://pdos.csail.mit.edu/papers/lbfs:sosp01/lbfs.pdf) that was used as a basis for Seafile. But none of them have been effectively implemented yet. - -## Existing software - -**[Pithos](https://github.com/exoscale/pithos) :** -Pithos has been abandonned and should probably not used yet, in the following we explain why we did not pick their design. -Pithos was relying as a S3 proxy in front of Cassandra (and was working with Scylla DB too). -From its designers' mouth, storing data in Cassandra has shown its limitations justifying the project abandonment. -They built a closed-source version 2 that does not store blobs in the database (only metadata) but did not communicate further on it. -We considered there v2's design but concluded that it does not fit both our *Self-contained & lightweight* and *Simple* properties. It makes the development, the deployment and the operations more complicated while reducing the flexibility. - -**[IPFS](https://ipfs.io/) :** -*Not written yet* - -## Specific research papers - -*Not yet written* diff --git a/doc/book/src/website.md b/doc/book/src/website.md deleted file mode 100644 index 2ea82a9a..00000000 --- a/doc/book/src/website.md +++ /dev/null @@ -1 +0,0 @@ -# Host a website diff --git a/doc/book/src/working_documents/index.md b/doc/book/src/working_documents/index.md new file mode 100644 index 00000000..6eb7eccb --- /dev/null +++ b/doc/book/src/working_documents/index.md @@ -0,0 +1,7 @@ +# Working Documents + +Working documents are documents that reflect the fact that Garage is a software that evolves quickly. +They are a way to communicate our ideas, our changes, and so on. + +Ideally, while the feature/patch has been merged, the working document should serve as a source to +update the rest of the documentation. diff --git a/doc/book/src/working_documents/load_balancing.md b/doc/book/src/working_documents/load_balancing.md new file mode 100644 index 00000000..f0f1a4d4 --- /dev/null +++ b/doc/book/src/working_documents/load_balancing.md @@ -0,0 +1,197 @@ +## Load Balancing Data + +I have conducted a quick study of different methods to load-balance data over different Garage nodes using consistent hashing. + +### Requirements + +- *good balancing*: two nodes that have the same announced capacity should receive close to the same number of items + +- *multi-datacenter*: the replicas of a partition should be distributed over as many datacenters as possible + +- *minimal disruption*: when adding or removing a node, as few partitions as possible should have to move around + +- *order-agnostic*: the same set of nodes (each associated with a datacenter name + and a capacity) should always return the same distribution of partition + replicas, independently of the order in which nodes were added/removed (this + is to keep the implementation simple) + +### Methods + +#### Naive multi-DC ring walking strategy + +This strategy can be used with any ring-like algorithm to make it aware of the *multi-datacenter* requirement: + +In this method, the ring is a list of positions, each associated with a single node in the cluster. +Partitions contain all the keys between two consecutive items of the ring. +To find the nodes that store replicas of a given partition: + +- select the node for the position of the partition's lower bound +- go clockwise on the ring, skipping nodes that: + - we halve already selected + - are in a datacenter of a node we have selected, except if we already have nodes from all possible datacenters + +In this way the selected nodes will always be distributed over +`min(n_datacenters, n_replicas)` different datacenters, which is the best we +can do. + +This method was implemented in the first version of Garage, with the basic +ring construction from Dynamo DB that consists in associating `n_token` random positions to +each node (I know it's not optimal, the Dynamo paper already studies this). + +#### Better rings + +The ring construction that selects `n_token` random positions for each nodes gives a ring of positions that +is not well-balanced: the space between the tokens varies a lot, and some partitions are thus bigger than others. +This problem was demonstrated in the original Dynamo DB paper. + +To solve this, we want to apply a better second method for partitionning our dataset: + +1. fix an initially large number of partitions (say 1024) with evenly-spaced delimiters, + +2. attribute each partition randomly to a node, with a probability + proportionnal to its capacity (which `n_tokens` represented in the first + method) + +For now we continue using the multi-DC ring walking described above. + +I have studied two ways to do the attribution of partitions to nodes, in a way that is deterministic: + +- Min-hash: for each partition, select node that minimizes `hash(node, partition_number)` +- MagLev: see [here](https://blog.acolyer.org/2016/03/21/maglev-a-fast-and-reliable-software-network-load-balancer/) + +MagLev provided significantly better balancing, as it guarantees that the exact +same number of partitions is attributed to all nodes that have the same +capacity (and that this number is proportionnal to the node's capacity, except +for large values), however in both cases: + +- the distribution is still bad, because we use the naive multi-DC ring walking + that behaves strangely due to interactions between consecutive positions on + the ring + +- the disruption in case of adding/removing a node is not as low as it can be, + as we show with the following method. + +A quick description of MagLev (backend = node, lookup table = ring): + +> The basic idea of Maglev hashing is to assign a preference list of all the +> lookup table positions to each backend. Then all the backends take turns +> filling their most-preferred table positions that are still empty, until the +> lookup table is completely filled in. Hence, Maglev hashing gives an almost +> equal share of the lookup table to each of the backends. Heterogeneous +> backend weights can be achieved by altering the relative frequency of the +> backends’ turns… + +Here are some stats (run `scripts/simulate_ring.py` to reproduce): + +``` +##### Custom-ring (min-hash) ##### + +#partitions per node (capacity in parenthesis): +- datura (8) : 227 +- digitale (8) : 351 +- drosera (8) : 259 +- geant (16) : 476 +- gipsie (16) : 410 +- io (16) : 495 +- isou (8) : 231 +- mini (4) : 149 +- mixi (4) : 188 +- modi (4) : 127 +- moxi (4) : 159 + +Variance of load distribution for load normalized to intra-class mean +(a class being the set of nodes with the same announced capacity): 2.18% <-- REALLY BAD + +Disruption when removing nodes (partitions moved on 0/1/2/3 nodes): +removing atuin digitale : 63.09% 30.18% 6.64% 0.10% +removing atuin drosera : 72.36% 23.44% 4.10% 0.10% +removing atuin datura : 73.24% 21.48% 5.18% 0.10% +removing jupiter io : 48.34% 38.48% 12.30% 0.88% +removing jupiter isou : 74.12% 19.73% 6.05% 0.10% +removing grog mini : 84.47% 12.40% 2.93% 0.20% +removing grog mixi : 80.76% 16.60% 2.64% 0.00% +removing grog moxi : 83.59% 14.06% 2.34% 0.00% +removing grog modi : 87.01% 11.43% 1.46% 0.10% +removing grisou geant : 48.24% 37.40% 13.67% 0.68% +removing grisou gipsie : 53.03% 33.59% 13.09% 0.29% +on average: 69.84% 23.53% 6.40% 0.23% <-- COULD BE BETTER + +-------- + +##### MagLev ##### + +#partitions per node: +- datura (8) : 273 +- digitale (8) : 256 +- drosera (8) : 267 +- geant (16) : 452 +- gipsie (16) : 427 +- io (16) : 483 +- isou (8) : 272 +- mini (4) : 184 +- mixi (4) : 160 +- modi (4) : 144 +- moxi (4) : 154 + +Variance of load distribution: 0.37% <-- Already much better, but not optimal + +Disruption when removing nodes (partitions moved on 0/1/2/3 nodes): +removing atuin digitale : 62.60% 29.20% 7.91% 0.29% +removing atuin drosera : 65.92% 26.56% 7.23% 0.29% +removing atuin datura : 63.96% 27.83% 7.71% 0.49% +removing jupiter io : 44.63% 40.33% 14.06% 0.98% +removing jupiter isou : 63.38% 27.25% 8.98% 0.39% +removing grog mini : 72.46% 21.00% 6.35% 0.20% +removing grog mixi : 72.95% 22.46% 4.39% 0.20% +removing grog moxi : 74.22% 20.61% 4.98% 0.20% +removing grog modi : 75.98% 18.36% 5.27% 0.39% +removing grisou geant : 46.97% 36.62% 15.04% 1.37% +removing grisou gipsie : 49.22% 36.52% 12.79% 1.46% +on average: 62.94% 27.89% 8.61% 0.57% <-- WORSE THAN PREVIOUSLY +``` + +#### The magical solution: multi-DC aware MagLev + +Suppose we want to select three replicas for each partition (this is what we do in our simulation and in most Garage deployments). +We apply MagLev three times consecutively, one for each replica selection. +The first time is pretty much the same as normal MagLev, but for the following times, when a node runs through its preference +list to select a partition to replicate, we skip partitions for which adding this node would not bring datacenter-diversity. +More precisely, we skip a partition in the preference list if: + +- the node already replicates the partition (from one of the previous rounds of MagLev) +- the node is in a datacenter where a node already replicates the partition and there are other datacenters available + +Refer to `method4` in the simulation script for a formal definition. + +``` +##### Multi-DC aware MagLev ##### + +#partitions per node: +- datura (8) : 268 <-- NODES WITH THE SAME CAPACITY +- digitale (8) : 267 HAVE THE SAME NUM OF PARTITIONS +- drosera (8) : 267 (+- 1) +- geant (16) : 470 +- gipsie (16) : 472 +- io (16) : 516 +- isou (8) : 268 +- mini (4) : 136 +- mixi (4) : 136 +- modi (4) : 136 +- moxi (4) : 136 + +Variance of load distribution: 0.06% <-- CAN'T DO BETTER THAN THIS + +Disruption when removing nodes (partitions moved on 0/1/2/3 nodes): +removing atuin digitale : 65.72% 33.01% 1.27% 0.00% +removing atuin drosera : 64.65% 33.89% 1.37% 0.10% +removing atuin datura : 66.11% 32.62% 1.27% 0.00% +removing jupiter io : 42.97% 53.42% 3.61% 0.00% +removing jupiter isou : 66.11% 32.32% 1.56% 0.00% +removing grog mini : 80.47% 18.85% 0.68% 0.00% +removing grog mixi : 80.27% 18.85% 0.88% 0.00% +removing grog moxi : 80.18% 19.04% 0.78% 0.00% +removing grog modi : 79.69% 19.92% 0.39% 0.00% +removing grisou geant : 44.63% 52.15% 3.22% 0.00% +removing grisou gipsie : 43.55% 52.54% 3.91% 0.00% +on average: 64.94% 33.33% 1.72% 0.01% <-- VERY GOOD (VERY LOW VALUES FOR 2 AND 3 NODES) +``` -- cgit v1.2.3