From 0afc701a698c4891ea0f09fae668cb06b16757d7 Mon Sep 17 00:00:00 2001 From: Quentin Dufour Date: Wed, 17 Mar 2021 14:44:14 +0100 Subject: Doc skeleton + intro --- doc/Compatibility.md | 84 ----------------- doc/Internals.md | 158 -------------------------------- doc/Related Work.md | 38 -------- doc/book/.gitignore | 1 + doc/book/book.toml | 6 ++ doc/book/src/S3_Compatibility.md | 84 +++++++++++++++++ doc/book/src/SUMMARY.md | 26 ++++++ doc/book/src/compatibility.md | 1 + doc/book/src/devenv.md | 17 ++++ doc/book/src/getting_started.md | 5 + doc/book/src/getting_started/bucket.md | 1 + doc/book/src/getting_started/cluster.md | 1 + doc/book/src/getting_started/files.md | 1 + doc/book/src/getting_started/install.md | 3 + doc/book/src/img/logo.svg | 44 +++++++++ doc/book/src/internals.md | 158 ++++++++++++++++++++++++++++++++ doc/book/src/intro.md | 62 +++++++++++++ doc/book/src/quickstart_bucket.md | 71 ++++++++++++++ doc/book/src/related_work.md | 38 ++++++++ doc/book/src/website.md | 1 + 20 files changed, 520 insertions(+), 280 deletions(-) delete mode 100644 doc/Compatibility.md delete mode 100644 doc/Internals.md delete mode 100644 doc/Related Work.md create mode 100644 doc/book/.gitignore create mode 100644 doc/book/book.toml create mode 100644 doc/book/src/S3_Compatibility.md create mode 100644 doc/book/src/SUMMARY.md create mode 100644 doc/book/src/compatibility.md create mode 100644 doc/book/src/devenv.md create mode 100644 doc/book/src/getting_started.md create mode 100644 doc/book/src/getting_started/bucket.md create mode 100644 doc/book/src/getting_started/cluster.md create mode 100644 doc/book/src/getting_started/files.md create mode 100644 doc/book/src/getting_started/install.md create mode 100644 doc/book/src/img/logo.svg create mode 100644 doc/book/src/internals.md create mode 100644 doc/book/src/intro.md create mode 100644 doc/book/src/quickstart_bucket.md create mode 100644 doc/book/src/related_work.md create mode 100644 doc/book/src/website.md (limited to 'doc') diff --git a/doc/Compatibility.md b/doc/Compatibility.md deleted file mode 100644 index c0fc2863..00000000 --- a/doc/Compatibility.md +++ /dev/null @@ -1,84 +0,0 @@ -## S3 Compatibility status - -### Global S3 features - -Implemented: - -- path-style URLs (`garage.tld/bucket/key`) -- putting and getting objects in buckets -- multipart uploads -- listing objects -- access control on a per-key-per-bucket basis - -Not implemented: - -- vhost-style URLs (`bucket.garage.tld/key`) -- object-level ACL -- encryption -- most `x-amz-` headers - - -### Endpoint implementation - -All APIs that are not mentionned are not implemented and will return a 400 bad request. - -#### AbortMultipartUpload - -Implemented. - -#### CompleteMultipartUpload - -Implemented badly. Garage will not check that all the parts stored correspond to the list given by the client in the request body. This means that the multipart upload might be completed with an invalid size. This is a bug and will be fixed. - -#### CopyObject - -Implemented. - -#### CreateBucket - -Garage does not accept creating buckets or giving access using API calls, it has to be done using the CLI tools. CreateBucket will return a 200 if the bucket exists and user has write access, and a 403 Forbidden in all other cases. - -#### CreateMultipartUpload - -Implemented. - -#### DeleteBucket - -Garage does not accept deleting buckets using API calls, it has to be done using the CLI tools. This request will return a 403 Forbidden. - -#### DeleteObject - -Implemented. - -#### DeleteObjects - -Implemented. - -#### GetObject - -Implemented. - -#### HeadBucket - -Implemented. - -#### HeadObject - -Implemented. - -#### ListObjects - -Implemented, but there isn't a very good specification of what `encoding-type=url` covers so there might be some encoding bugs. In our implementation the url-encoded fields are in the same in ListObjects as they are in ListObjectsV2. - -#### ListObjectsV2 - -Implemented. - -#### PutObject - -Implemented. - -#### UploadPart - -Implemented. - diff --git a/doc/Internals.md b/doc/Internals.md deleted file mode 100644 index e712ae07..00000000 --- a/doc/Internals.md +++ /dev/null @@ -1,158 +0,0 @@ -**WARNING: this documentation is more a "design draft", which was written before Garage's actual implementation. The general principle is similar but details have not yet been updated.** - -#### Modules - -- `membership/`: configuration, membership management (gossip of node's presence and status), ring generation --> what about Serf (used by Consul/Nomad) : https://www.serf.io/? Seems a huge library with many features so maybe overkill/hard to integrate -- `metadata/`: metadata management -- `blocks/`: block management, writing, GC and rebalancing -- `internal/`: server to server communication (HTTP server and client that reuses connections, TLS if we want, etc) -- `api/`: S3 API -- `web/`: web management interface - -#### Metadata tables - -**Objects:** - -- *Hash key:* Bucket name (string) -- *Sort key:* Object key (string) -- *Sort key:* Version timestamp (int) -- *Sort key:* Version UUID (string) -- Complete: bool -- Inline: bool, true for objects < threshold (say 1024) -- Object size (int) -- Mime type (string) -- Data for inlined objects (blob) -- Hash of first block otherwise (string) - -*Having only a hash key on the bucket name will lead to storing all file entries of this table for a specific bucket on a single node. At the same time, it is the only way I see to rapidly being able to list all bucket entries...* - -**Blocks:** - -- *Hash key:* Version UUID (string) -- *Sort key:* Offset of block in total file (int) -- Hash of data block (string) - -A version is defined by the existence of at least one entry in the blocks table for a certain version UUID. -We must keep the following invariant: if a version exists in the blocks table, it has to be referenced in the objects table. -We explicitly manage concurrent versions of an object: the version timestamp and version UUID columns are index columns, thus we may have several concurrent versions of an object. -Important: before deleting an older version from the objects table, we must make sure that we did a successfull delete of the blocks of that version from the blocks table. - -Thus, the workflow for reading an object is as follows: - -1. Check permissions (LDAP) -2. Read entry in object table. If data is inline, we have its data, stop here. - -> if several versions, take newest one and launch deletion of old ones in background -3. Read first block from cluster. If size <= 1 block, stop here. -4. Simultaneously with previous step, if size > 1 block: query the Blocks table for the IDs of the next blocks -5. Read subsequent blocks from cluster - -Workflow for PUT: - -1. Check write permission (LDAP) -2. Select a new version UUID -3. Write a preliminary entry for the new version in the objects table with complete = false -4. Send blocks to cluster and write entries in the blocks table -5. Update the version with complete = true and all of the accurate information (size, etc) -6. Return success to the user -7. Launch a background job to check and delete older versions - -Workflow for DELETE: - -1. Check write permission (LDAP) -2. Get current version (or versions) in object table -3. Do the deletion of those versions NOT IN A BACKGROUND JOB THIS TIME -4. Return succes to the user if we were able to delete blocks from the blocks table and entries from the object table - -To delete a version: - -1. List the blocks from Cassandra -2. For each block, delete it from cluster. Don't care if some deletions fail, we can do GC. -3. Delete all of the blocks from the blocks table -4. Finally, delete the version from the objects table - -Known issue: if someone is reading from a version that we want to delete and the object is big, the read might be interrupted. I think it is ok to leave it like this, we just cut the connection if data disappears during a read. - -("Soit P un problème, on s'en fout est une solution à ce problème") - -#### Block storage on disk - -**Blocks themselves:** - -- file path = /blobs/(first 3 hex digits of hash)/(rest of hash) - -**Reverse index for GC & other block-level metadata:** - -- file path = /meta/(first 3 hex digits of hash)/(rest of hash) -- map block hash -> set of version UUIDs where it is referenced - -Usefull metadata: - -- list of versions that reference this block in the Casandra table, so that we can do GC by checking in Cassandra that the lines still exist -- list of other nodes that we know have acknowledged a write of this block, usefull in the rebalancing algorithm - -Write strategy: have a single thread that does all write IO so that it is serialized (or have several threads that manage independent parts of the hash space). When writing a blob, write it to a temporary file, close, then rename so that a concurrent read gets a consistent result (either not found or found with whole content). - -Read strategy: the only read operation is get(hash) that returns either the data or not found (can do a corruption check as well and return corrupted state if it is the case). Can be done concurrently with writes. - -**Internal API:** - -- get(block hash) -> ok+data/not found/corrupted -- put(block hash & data, version uuid + offset) -> ok/error -- put with no data(block hash, version uuid + offset) -> ok/not found plz send data/error -- delete(block hash, version uuid + offset) -> ok/error - -GC: when last ref is deleted, delete block. -Long GC procedure: check in Cassandra that version UUIDs still exist and references this block. - -Rebalancing: takes as argument the list of newly added nodes. - -- List all blocks that we have. For each block: -- If it hits a newly introduced node, send it to them. - Use put with no data first to check if it has to be sent to them already or not. - Use a random listing order to avoid race conditions (they do no harm but we might have two nodes sending the same thing at the same time thus wasting time). -- If it doesn't hit us anymore, delete it and its reference list. - -Only one balancing can be running at a same time. It can be restarted at the beginning with new parameters. - -#### Membership management - -Two sets of nodes: - -- set of nodes from which a ping was recently received, with status: number of stored blocks, request counters, error counters, GC%, rebalancing% - (eviction from this set after say 30 seconds without ping) -- set of nodes that are part of the system, explicitly modified by the operator using the web UI (persisted to disk), - is a CRDT using a version number for the value of the whole set - -Thus, three states for nodes: - -- healthy: in both sets -- missing: not pingable but part of desired cluster -- unused/draining: currently present but not part of the desired cluster, empty = if contains nothing, draining = if still contains some blocks - -Membership messages between nodes: - -- ping with current state + hash of current membership info -> reply with same info -- send&get back membership info (the ids of nodes that are in the two sets): used when no local membership change in a long time and membership info hash discrepancy detected with first message (passive membership fixing with full CRDT gossip) -- inform of newly pingable node(s) -> no result, when receive new info repeat to all (reliable broadcast) -- inform of operator membership change -> no result, when receive new info repeat to all (reliable broadcast) - -Ring: generated from the desired set of nodes, however when doing read/writes on the ring, skip nodes that are known to be not pingable. -The tokens are generated in a deterministic fashion from node IDs (hash of node id + token number from 1 to K). -Number K of tokens per node: decided by the operator & stored in the operator's list of nodes CRDT. Default value proposal: with node status information also broadcast disk total size and free space, and propose a default number of tokens equal to 80%Free space / 10Gb. (this is all user interface) - - -#### Constants - -- Block size: around 1MB ? --> Exoscale use 16MB chunks -- Number of tokens in the hash ring: one every 10Gb of allocated storage -- Threshold for storing data directly in Cassandra objects table: 1kb bytes (maybe up to 4kb?) -- Ping timeout (time after which a node is registered as unresponsive/missing): 30 seconds -- Ping interval: 10 seconds -- ?? - -#### Links - -- CDC: -- Erasure coding: -- [Openstack Storage Concepts](https://docs.openstack.org/arch-design/design-storage/design-storage-concepts.html) -- [RADOS](https://ceph.com/wp-content/uploads/2016/08/weil-rados-pdsw07.pdf) diff --git a/doc/Related Work.md b/doc/Related Work.md deleted file mode 100644 index c1a4eed4..00000000 --- a/doc/Related Work.md +++ /dev/null @@ -1,38 +0,0 @@ -## Context - -Data storage is critical: it can lead to data loss if done badly and/or on hardware failure. -Filesystems + RAID can help on a single machine but a machine failure can put the whole storage offline. -Moreover, it put a hard limit on scalability. Often this limit can be pushed back far away by buying expensive machines. -But here we consider non specialized off the shelf machines that can be as low powered and subject to failures as a raspberry pi. - -Distributed storage may help to solve both availability and scalability problems on these machines. -Many solutions were proposed, they can be categorized as block storage, file storage and object storage depending on the abstraction they provide. - -## Related work - -Block storage is the most low level one, it's like exposing your raw hard drive over the network. -It requires very low latencies and stable network, that are often dedicated. -However it provides disk devices that can be manipulated by the operating system with the less constraints: it can be partitioned with any filesystem, meaning that it supports even the most exotic features. -We can cite [iSCSI](https://en.wikipedia.org/wiki/ISCSI) or [Fibre Channel](https://en.wikipedia.org/wiki/Fibre_Channel). -Openstack Cinder proxy previous solution to provide an uniform API. - -File storage provides a higher abstraction, they are one filesystem among others, which means they don't necessarily have all the exotic features of every filesystem. -Often, they relax some POSIX constraints while many applications will still be compatible without any modification. -As an example, we are able to run MariaDB (very slowly) over GlusterFS... -We can also mention CephFS (read [RADOS](https://ceph.com/wp-content/uploads/2016/08/weil-rados-pdsw07.pdf) whitepaper), Lustre, LizardFS, MooseFS, etc. -OpenStack Manila proxy previous solutions to provide an uniform API. - -Finally object storages provide the highest level abstraction. -They are the testimony that the POSIX filesystem API is not adapted to distributed filesystems. -Especially, the strong concistency has been dropped in favor of eventual consistency which is way more convenient and powerful in presence of high latencies and unreliability. -We often read about S3 that pioneered the concept that it's a filesystem for the WAN. -Applications must be adapted to work for the desired object storage service. -Today, the S3 HTTP REST API acts as a standard in the industry. -However, Amazon S3 source code is not open but alternatives were proposed. -We identified Minio, Pithos, Swift and Ceph. -Minio/Ceph enforces a total order, so properties similar to a (relaxed) filesystem. -Swift and Pithos are probably the most similar to AWS S3 with their consistent hashing ring. -However Pithos is not maintained anymore. More precisely the company that published Pithos version 1 has developped a second version 2 but has not open sourced it. -Some tests conducted by the [ACIDES project](https://acides.org/) have shown that Openstack Swift consumes way more resources (CPU+RAM) that we can afford. Furthermore, people developing Swift have not designed their software for geo-distribution. - -There were many attempts in research too. I am only thinking to [LBFS](https://pdos.csail.mit.edu/papers/lbfs:sosp01/lbfs.pdf) that was used as a basis for Seafile. But none of them have been effectively implemented yet. diff --git a/doc/book/.gitignore b/doc/book/.gitignore new file mode 100644 index 00000000..7585238e --- /dev/null +++ b/doc/book/.gitignore @@ -0,0 +1 @@ +book diff --git a/doc/book/book.toml b/doc/book/book.toml new file mode 100644 index 00000000..3e163990 --- /dev/null +++ b/doc/book/book.toml @@ -0,0 +1,6 @@ +[book] +authors = ["Quentin Dufour"] +language = "en" +multilingual = false +src = "src" +title = "Garage Documentation" diff --git a/doc/book/src/S3_Compatibility.md b/doc/book/src/S3_Compatibility.md new file mode 100644 index 00000000..c0fc2863 --- /dev/null +++ b/doc/book/src/S3_Compatibility.md @@ -0,0 +1,84 @@ +## S3 Compatibility status + +### Global S3 features + +Implemented: + +- path-style URLs (`garage.tld/bucket/key`) +- putting and getting objects in buckets +- multipart uploads +- listing objects +- access control on a per-key-per-bucket basis + +Not implemented: + +- vhost-style URLs (`bucket.garage.tld/key`) +- object-level ACL +- encryption +- most `x-amz-` headers + + +### Endpoint implementation + +All APIs that are not mentionned are not implemented and will return a 400 bad request. + +#### AbortMultipartUpload + +Implemented. + +#### CompleteMultipartUpload + +Implemented badly. Garage will not check that all the parts stored correspond to the list given by the client in the request body. This means that the multipart upload might be completed with an invalid size. This is a bug and will be fixed. + +#### CopyObject + +Implemented. + +#### CreateBucket + +Garage does not accept creating buckets or giving access using API calls, it has to be done using the CLI tools. CreateBucket will return a 200 if the bucket exists and user has write access, and a 403 Forbidden in all other cases. + +#### CreateMultipartUpload + +Implemented. + +#### DeleteBucket + +Garage does not accept deleting buckets using API calls, it has to be done using the CLI tools. This request will return a 403 Forbidden. + +#### DeleteObject + +Implemented. + +#### DeleteObjects + +Implemented. + +#### GetObject + +Implemented. + +#### HeadBucket + +Implemented. + +#### HeadObject + +Implemented. + +#### ListObjects + +Implemented, but there isn't a very good specification of what `encoding-type=url` covers so there might be some encoding bugs. In our implementation the url-encoded fields are in the same in ListObjects as they are in ListObjectsV2. + +#### ListObjectsV2 + +Implemented. + +#### PutObject + +Implemented. + +#### UploadPart + +Implemented. + diff --git a/doc/book/src/SUMMARY.md b/doc/book/src/SUMMARY.md new file mode 100644 index 00000000..3e4618a1 --- /dev/null +++ b/doc/book/src/SUMMARY.md @@ -0,0 +1,26 @@ +# Summary + +[The Garage Data Store](./intro.md) + +- [Getting Started](./getting_started.md) + - [Installation](./getting_started/install.md) + - [Configure a cluster](./getting_started/cluster.md) + - [Create buckets and keys](./getting_started/bucket.md) + - [Handle files](./getting_started/files.md) + +# Cookbooks + +- [Host a website](./website.md) + +# Reference Manual + +- [S3 API](./compatibility.md) + +# Design + +- [Related Work](./related_work.md) +- [Internals](./internals.md) + +# Development + +- [Setup your environment](./devenv.md) diff --git a/doc/book/src/compatibility.md b/doc/book/src/compatibility.md new file mode 100644 index 00000000..acf9968b --- /dev/null +++ b/doc/book/src/compatibility.md @@ -0,0 +1 @@ +# S3 API diff --git a/doc/book/src/devenv.md b/doc/book/src/devenv.md new file mode 100644 index 00000000..6cb7c554 --- /dev/null +++ b/doc/book/src/devenv.md @@ -0,0 +1,17 @@ +# Setup your development environment + +We propose the following quickstart to setup a full dev. environment as quickly as possible: + + 1. Setup a rust/cargo environment. eg. `dnf install rust cargo` + 2. Install awscli v2 by following the guide [here](https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2.html). + 3. Run `cargo build` to build the project + 4. Run `./script/dev-cluster.sh` to launch a test cluster (feel free to read the script) + 5. Run `./script/dev-configure.sh` to configure your test cluster with default values (same datacenter, 100 tokens) + 6. Run `./script/dev-bucket.sh` to create a bucket named `eprouvette` and an API key that will be stored in `/tmp/garage.s3` + 7. Run `source ./script/dev-env-aws.sh` to configure your CLI environment + 8. You can use `garage` to manage the cluster. Try `garage --help`. + 9. You can use the `awsgrg` alias to add, remove, and delete files. Try `awsgrg help`, `awsgrg cp /proc/cpuinfo s3://eprouvette/cpuinfo.txt`, or `awsgrg ls s3://eprouvette`. `awsgrg` is a wrapper on the `aws s3` command pre-configured with the previously generated API key (the one in `/tmp/garage.s3`) and localhost as the endpoint. + +Now you should be ready to start hacking on garage! + + diff --git a/doc/book/src/getting_started.md b/doc/book/src/getting_started.md new file mode 100644 index 00000000..7d3efd8b --- /dev/null +++ b/doc/book/src/getting_started.md @@ -0,0 +1,5 @@ +# Gettin Started + +Let's start your Garage journey! +In this chapter, we explain how to deploy a simple garage cluster and start interacting with it. +Our goal is to introduce you to Garage's workflows. diff --git a/doc/book/src/getting_started/bucket.md b/doc/book/src/getting_started/bucket.md new file mode 100644 index 00000000..c3b39d8d --- /dev/null +++ b/doc/book/src/getting_started/bucket.md @@ -0,0 +1 @@ +# Create buckets and keys diff --git a/doc/book/src/getting_started/cluster.md b/doc/book/src/getting_started/cluster.md new file mode 100644 index 00000000..868379b9 --- /dev/null +++ b/doc/book/src/getting_started/cluster.md @@ -0,0 +1 @@ +# Configure a cluster diff --git a/doc/book/src/getting_started/files.md b/doc/book/src/getting_started/files.md new file mode 100644 index 00000000..c8042dd3 --- /dev/null +++ b/doc/book/src/getting_started/files.md @@ -0,0 +1 @@ +# Handle files diff --git a/doc/book/src/getting_started/install.md b/doc/book/src/getting_started/install.md new file mode 100644 index 00000000..39557f02 --- /dev/null +++ b/doc/book/src/getting_started/install.md @@ -0,0 +1,3 @@ +# Installation + + diff --git a/doc/book/src/img/logo.svg b/doc/book/src/img/logo.svg new file mode 100644 index 00000000..fb02c40b --- /dev/null +++ b/doc/book/src/img/logo.svg @@ -0,0 +1,44 @@ + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/doc/book/src/internals.md b/doc/book/src/internals.md new file mode 100644 index 00000000..e712ae07 --- /dev/null +++ b/doc/book/src/internals.md @@ -0,0 +1,158 @@ +**WARNING: this documentation is more a "design draft", which was written before Garage's actual implementation. The general principle is similar but details have not yet been updated.** + +#### Modules + +- `membership/`: configuration, membership management (gossip of node's presence and status), ring generation --> what about Serf (used by Consul/Nomad) : https://www.serf.io/? Seems a huge library with many features so maybe overkill/hard to integrate +- `metadata/`: metadata management +- `blocks/`: block management, writing, GC and rebalancing +- `internal/`: server to server communication (HTTP server and client that reuses connections, TLS if we want, etc) +- `api/`: S3 API +- `web/`: web management interface + +#### Metadata tables + +**Objects:** + +- *Hash key:* Bucket name (string) +- *Sort key:* Object key (string) +- *Sort key:* Version timestamp (int) +- *Sort key:* Version UUID (string) +- Complete: bool +- Inline: bool, true for objects < threshold (say 1024) +- Object size (int) +- Mime type (string) +- Data for inlined objects (blob) +- Hash of first block otherwise (string) + +*Having only a hash key on the bucket name will lead to storing all file entries of this table for a specific bucket on a single node. At the same time, it is the only way I see to rapidly being able to list all bucket entries...* + +**Blocks:** + +- *Hash key:* Version UUID (string) +- *Sort key:* Offset of block in total file (int) +- Hash of data block (string) + +A version is defined by the existence of at least one entry in the blocks table for a certain version UUID. +We must keep the following invariant: if a version exists in the blocks table, it has to be referenced in the objects table. +We explicitly manage concurrent versions of an object: the version timestamp and version UUID columns are index columns, thus we may have several concurrent versions of an object. +Important: before deleting an older version from the objects table, we must make sure that we did a successfull delete of the blocks of that version from the blocks table. + +Thus, the workflow for reading an object is as follows: + +1. Check permissions (LDAP) +2. Read entry in object table. If data is inline, we have its data, stop here. + -> if several versions, take newest one and launch deletion of old ones in background +3. Read first block from cluster. If size <= 1 block, stop here. +4. Simultaneously with previous step, if size > 1 block: query the Blocks table for the IDs of the next blocks +5. Read subsequent blocks from cluster + +Workflow for PUT: + +1. Check write permission (LDAP) +2. Select a new version UUID +3. Write a preliminary entry for the new version in the objects table with complete = false +4. Send blocks to cluster and write entries in the blocks table +5. Update the version with complete = true and all of the accurate information (size, etc) +6. Return success to the user +7. Launch a background job to check and delete older versions + +Workflow for DELETE: + +1. Check write permission (LDAP) +2. Get current version (or versions) in object table +3. Do the deletion of those versions NOT IN A BACKGROUND JOB THIS TIME +4. Return succes to the user if we were able to delete blocks from the blocks table and entries from the object table + +To delete a version: + +1. List the blocks from Cassandra +2. For each block, delete it from cluster. Don't care if some deletions fail, we can do GC. +3. Delete all of the blocks from the blocks table +4. Finally, delete the version from the objects table + +Known issue: if someone is reading from a version that we want to delete and the object is big, the read might be interrupted. I think it is ok to leave it like this, we just cut the connection if data disappears during a read. + +("Soit P un problème, on s'en fout est une solution à ce problème") + +#### Block storage on disk + +**Blocks themselves:** + +- file path = /blobs/(first 3 hex digits of hash)/(rest of hash) + +**Reverse index for GC & other block-level metadata:** + +- file path = /meta/(first 3 hex digits of hash)/(rest of hash) +- map block hash -> set of version UUIDs where it is referenced + +Usefull metadata: + +- list of versions that reference this block in the Casandra table, so that we can do GC by checking in Cassandra that the lines still exist +- list of other nodes that we know have acknowledged a write of this block, usefull in the rebalancing algorithm + +Write strategy: have a single thread that does all write IO so that it is serialized (or have several threads that manage independent parts of the hash space). When writing a blob, write it to a temporary file, close, then rename so that a concurrent read gets a consistent result (either not found or found with whole content). + +Read strategy: the only read operation is get(hash) that returns either the data or not found (can do a corruption check as well and return corrupted state if it is the case). Can be done concurrently with writes. + +**Internal API:** + +- get(block hash) -> ok+data/not found/corrupted +- put(block hash & data, version uuid + offset) -> ok/error +- put with no data(block hash, version uuid + offset) -> ok/not found plz send data/error +- delete(block hash, version uuid + offset) -> ok/error + +GC: when last ref is deleted, delete block. +Long GC procedure: check in Cassandra that version UUIDs still exist and references this block. + +Rebalancing: takes as argument the list of newly added nodes. + +- List all blocks that we have. For each block: +- If it hits a newly introduced node, send it to them. + Use put with no data first to check if it has to be sent to them already or not. + Use a random listing order to avoid race conditions (they do no harm but we might have two nodes sending the same thing at the same time thus wasting time). +- If it doesn't hit us anymore, delete it and its reference list. + +Only one balancing can be running at a same time. It can be restarted at the beginning with new parameters. + +#### Membership management + +Two sets of nodes: + +- set of nodes from which a ping was recently received, with status: number of stored blocks, request counters, error counters, GC%, rebalancing% + (eviction from this set after say 30 seconds without ping) +- set of nodes that are part of the system, explicitly modified by the operator using the web UI (persisted to disk), + is a CRDT using a version number for the value of the whole set + +Thus, three states for nodes: + +- healthy: in both sets +- missing: not pingable but part of desired cluster +- unused/draining: currently present but not part of the desired cluster, empty = if contains nothing, draining = if still contains some blocks + +Membership messages between nodes: + +- ping with current state + hash of current membership info -> reply with same info +- send&get back membership info (the ids of nodes that are in the two sets): used when no local membership change in a long time and membership info hash discrepancy detected with first message (passive membership fixing with full CRDT gossip) +- inform of newly pingable node(s) -> no result, when receive new info repeat to all (reliable broadcast) +- inform of operator membership change -> no result, when receive new info repeat to all (reliable broadcast) + +Ring: generated from the desired set of nodes, however when doing read/writes on the ring, skip nodes that are known to be not pingable. +The tokens are generated in a deterministic fashion from node IDs (hash of node id + token number from 1 to K). +Number K of tokens per node: decided by the operator & stored in the operator's list of nodes CRDT. Default value proposal: with node status information also broadcast disk total size and free space, and propose a default number of tokens equal to 80%Free space / 10Gb. (this is all user interface) + + +#### Constants + +- Block size: around 1MB ? --> Exoscale use 16MB chunks +- Number of tokens in the hash ring: one every 10Gb of allocated storage +- Threshold for storing data directly in Cassandra objects table: 1kb bytes (maybe up to 4kb?) +- Ping timeout (time after which a node is registered as unresponsive/missing): 30 seconds +- Ping interval: 10 seconds +- ?? + +#### Links + +- CDC: +- Erasure coding: +- [Openstack Storage Concepts](https://docs.openstack.org/arch-design/design-storage/design-storage-concepts.html) +- [RADOS](https://ceph.com/wp-content/uploads/2016/08/weil-rados-pdsw07.pdf) diff --git a/doc/book/src/intro.md b/doc/book/src/intro.md new file mode 100644 index 00000000..5455ae71 --- /dev/null +++ b/doc/book/src/intro.md @@ -0,0 +1,62 @@ +![Garage's Logo](img/logo.svg) + +# The Garage Geo-Distributed Data Store + +Garage is a lightweight geo-distributed data store. +It comes from the observation that despite numerous object stores +many people have broken data management policies (backup/replication on a single site or none at all). +To promote better data management policies, with focused on the following desirable properties: + + - **Self-contained & lightweight**: works everywhere and integrates well in existing environments to target hyperconverged infrastructures + - **Highly resilient**: highly resilient to network failures, network latency, disk failures, sysadmin failures + - **Simple**: simple to understand, simple to operate, simple to debug + - **Internet enabled**: Made for multi-sites (eg. datacenter, offices, etc.) interconnected through a regular internet connection. + +We also noted that the pursuit of some other goals are detrimental to our initial goals. +The following have been identified has non-goals, if it matters to you, you should not use Garage: + + - **Extreme performances**: high performances constrain a lot the design and the deployment. We always prioritize + - **Feature extensiveness**: Complete implementation of the S3 API + - **Storage optimizations**: Erasure coding (our replication model is simply to copy the data as is on several nodes, in different datacenters if possible) + - **POSIX/Filesystem compatibility**: We do not aim at being POSIX compatible or to emulate any kind of filesystem. Indeed, in a distributed environment, such syncronizations are translated in network messages that impose severe constraints on the deployment. + +## Integration in environments + +Garage speaks (or will speak) the following protocols: + + - [S3](https://docs.aws.amazon.com/AmazonS3/latest/API/Welcome.html) - *SUPPORTED* - Enable applications to store large blobs such as pictures, video, images, documents, etc. S3 is versatile enough to also be used to publish a static website. + - [IMAP](https://github.com/go-pluto/pluto) - *PLANNED* - email storage is quite complex to get good oerformances. +To keep performances optimals, most imap servers only support on-disk storage. +We plan to add logic to Garage to make it a viable solution for email storage. + - *More to come* + +## Use Cases + +**[Deuxfleurs](https://deuxfleurs.fr) :** Garage is used by Deuxfleurs which is a non-profit hosting organization. +Especially, it is used to host their main website, this documentation and some of its members's blogs. Additionally, +Garage is used as a [backend for Nextcloud](https://docs.nextcloud.com/server/20/admin_manual/configuration_files/primary_storage.html). Deuxfleurs also plans to use Garage as their [Matrix's media backend](https://github.com/matrix-org/synapse-s3-storage-provider) and has the backend of [OCIS](https://github.com/owncloud/ocis). + +*Are you using Garage? Open a pull request to add your organization here!* + +## Comparisons to existing software + +**[Minio](https://min.io/) :** Minio shares our *self-contained & lightweight* goal but selected two of our non-goals: *storage optimizations* through erasure coding and *POSIX/Filesystem compatibility* through strong consistency. +However, by pursuing these two non-goals, minio do not reach our desirable properties. +First, it fails on the *simple* property: due to the erasure coding, minio has severe limitations on how drives can be added or deleted from a cluster. +Second, it fails on the *interned enabled* property: due to its strong consistency, minio is latency sensitive. +Furthermore, minio has no knowledge of "sites" and thus can not distribute data to minimize the failure of a given site. + +**[Openstack Swift](https://docs.openstack.org/swift/latest/)** +OpenStack Swift at least fails on the *self-contained & lightweight* goal. +Starting it requires around 8Gb of RAM, which is too much especially in an hyperconverged infrastructure. +It seems also to be far from *Simple*. + +**[Pithos](https://github.com/exoscale/pithos)** +Pithos has been abandonned and should probably not used yet, in the following we explain why we did not pick their design. +Pithos was relying as a S3 proxy in front of Cassandra (and was working with Scylla DB too). +From its designers' mouth, storing data in Cassandra has shown its limitations justifying the project abandonment. +They built a closed-source version 2 that does not store blobs in the database (only metadata) but did not communicate further on it. +We considered there v2's design but concluded that it does not fit both our *Self-contained & lightweight* and *Simple* properties. It makes the development, the deployment and the operations more complicated while reducing the flexibility. + +**[IPFS](https://ipfs.io/)** +*Not written yet* diff --git a/doc/book/src/quickstart_bucket.md b/doc/book/src/quickstart_bucket.md new file mode 100644 index 00000000..bd574e14 --- /dev/null +++ b/doc/book/src/quickstart_bucket.md @@ -0,0 +1,71 @@ +# Configuring a cluster + +First, chances are that your garage deployment is secured by TLS. +All your commands must be prefixed with their certificates. +I will define an alias once and for all to ease future commands. +Please adapt the path of the binary and certificates to your installation! + +``` +alias grg="/garage/garage --ca-cert /secrets/garage-ca.crt --client-cert /secrets/garage.crt --client-key /secrets/garage.key" +``` + +Now we can check that everything is going well by checking our cluster status: + +``` +grg status +``` + +Don't forget that `help` command and `--help` subcommands can help you anywhere, the CLI tool is self-documented! Two examples: + +``` +grg help +grg bucket allow --help +``` + +Fine, now let's create a bucket (we imagine that you want to deploy nextcloud): + +``` +grg bucket create nextcloud-bucket +``` + +Check that everything went well: + +``` +grg bucket list +grg bucket info nextcloud-bucket +``` + +Now we will generate an API key to access this bucket. +Note that API keys are independent of buckets: one key can access multiple buckets, multiple keys can access one bucket. + +Now, let's start by creating a key only for our PHP application: + +``` +grg key new --name nextcloud-app-key +``` + +You will have the following output (this one is fake, `key_id` and `secret_key` were generated with the openssl CLI tool): + +``` +Key { key_id: "GK3515373e4c851ebaad366558", secret_key: "7d37d093435a41f2aab8f13c19ba067d9776c90215f56614adad6ece597dbb34", name: "nextcloud-app-key", name_timestamp: 1603280506694, deleted: false, authorized_buckets: [] } +``` + +Check that everything works as intended (be careful, info works only with your key identifier and not with its friendly name!): + +``` +grg key list +grg key info GK3515373e4c851ebaad366558 +``` + +Now that we have a bucket and a key, we need to give permissions to the key on the bucket! + +``` +grg bucket allow --read --write nextcloud-bucket --key GK3515373e4c851ebaad366558 +``` + +You can check at any times allowed keys on your bucket with: + +``` +grg bucket info nextcloud-bucket +``` + diff --git a/doc/book/src/related_work.md b/doc/book/src/related_work.md new file mode 100644 index 00000000..c1a4eed4 --- /dev/null +++ b/doc/book/src/related_work.md @@ -0,0 +1,38 @@ +## Context + +Data storage is critical: it can lead to data loss if done badly and/or on hardware failure. +Filesystems + RAID can help on a single machine but a machine failure can put the whole storage offline. +Moreover, it put a hard limit on scalability. Often this limit can be pushed back far away by buying expensive machines. +But here we consider non specialized off the shelf machines that can be as low powered and subject to failures as a raspberry pi. + +Distributed storage may help to solve both availability and scalability problems on these machines. +Many solutions were proposed, they can be categorized as block storage, file storage and object storage depending on the abstraction they provide. + +## Related work + +Block storage is the most low level one, it's like exposing your raw hard drive over the network. +It requires very low latencies and stable network, that are often dedicated. +However it provides disk devices that can be manipulated by the operating system with the less constraints: it can be partitioned with any filesystem, meaning that it supports even the most exotic features. +We can cite [iSCSI](https://en.wikipedia.org/wiki/ISCSI) or [Fibre Channel](https://en.wikipedia.org/wiki/Fibre_Channel). +Openstack Cinder proxy previous solution to provide an uniform API. + +File storage provides a higher abstraction, they are one filesystem among others, which means they don't necessarily have all the exotic features of every filesystem. +Often, they relax some POSIX constraints while many applications will still be compatible without any modification. +As an example, we are able to run MariaDB (very slowly) over GlusterFS... +We can also mention CephFS (read [RADOS](https://ceph.com/wp-content/uploads/2016/08/weil-rados-pdsw07.pdf) whitepaper), Lustre, LizardFS, MooseFS, etc. +OpenStack Manila proxy previous solutions to provide an uniform API. + +Finally object storages provide the highest level abstraction. +They are the testimony that the POSIX filesystem API is not adapted to distributed filesystems. +Especially, the strong concistency has been dropped in favor of eventual consistency which is way more convenient and powerful in presence of high latencies and unreliability. +We often read about S3 that pioneered the concept that it's a filesystem for the WAN. +Applications must be adapted to work for the desired object storage service. +Today, the S3 HTTP REST API acts as a standard in the industry. +However, Amazon S3 source code is not open but alternatives were proposed. +We identified Minio, Pithos, Swift and Ceph. +Minio/Ceph enforces a total order, so properties similar to a (relaxed) filesystem. +Swift and Pithos are probably the most similar to AWS S3 with their consistent hashing ring. +However Pithos is not maintained anymore. More precisely the company that published Pithos version 1 has developped a second version 2 but has not open sourced it. +Some tests conducted by the [ACIDES project](https://acides.org/) have shown that Openstack Swift consumes way more resources (CPU+RAM) that we can afford. Furthermore, people developing Swift have not designed their software for geo-distribution. + +There were many attempts in research too. I am only thinking to [LBFS](https://pdos.csail.mit.edu/papers/lbfs:sosp01/lbfs.pdf) that was used as a basis for Seafile. But none of them have been effectively implemented yet. diff --git a/doc/book/src/website.md b/doc/book/src/website.md new file mode 100644 index 00000000..2ea82a9a --- /dev/null +++ b/doc/book/src/website.md @@ -0,0 +1 @@ +# Host a website -- cgit v1.2.3 From c50113acf3fd61dcb77bc01bd6e9f226f813bf76 Mon Sep 17 00:00:00 2001 From: Quentin Dufour Date: Wed, 17 Mar 2021 15:44:29 +0100 Subject: Work on structure + Getting started is reworked --- doc/book/src/SUMMARY.md | 32 +++++++-------- doc/book/src/getting_started.md | 2 +- doc/book/src/getting_started/bucket.md | 70 ++++++++++++++++++++++++++++++++ doc/book/src/getting_started/install.md | 22 ++++++++++ doc/book/src/intro.md | 43 ++++++++++++-------- doc/book/src/quickstart_bucket.md | 71 --------------------------------- doc/book/src/related_work.md | 20 +++++++++- 7 files changed, 154 insertions(+), 106 deletions(-) delete mode 100644 doc/book/src/quickstart_bucket.md (limited to 'doc') diff --git a/doc/book/src/SUMMARY.md b/doc/book/src/SUMMARY.md index 3e4618a1..5d01dbee 100644 --- a/doc/book/src/SUMMARY.md +++ b/doc/book/src/SUMMARY.md @@ -8,19 +8,19 @@ - [Create buckets and keys](./getting_started/bucket.md) - [Handle files](./getting_started/files.md) -# Cookbooks - -- [Host a website](./website.md) - -# Reference Manual - -- [S3 API](./compatibility.md) - -# Design - -- [Related Work](./related_work.md) -- [Internals](./internals.md) - -# Development - -- [Setup your environment](./devenv.md) +- [Cookbooks]() + - [Host a website](./website.md) + - [Integrate as a media backend]() + - [Operate a cluster]() + +- [Reference Manual]() + - [Garage CLI]() + - [S3 API](./compatibility.md) + +- [Design]() + - [Related Work](./related_work.md) + - [Internals](./internals.md) + +- [Development]() + - [Setup your environment](./devenv.md) + - [Your first contribution]() diff --git a/doc/book/src/getting_started.md b/doc/book/src/getting_started.md index 7d3efd8b..282f5034 100644 --- a/doc/book/src/getting_started.md +++ b/doc/book/src/getting_started.md @@ -1,4 +1,4 @@ -# Gettin Started +# Getting Started Let's start your Garage journey! In this chapter, we explain how to deploy a simple garage cluster and start interacting with it. diff --git a/doc/book/src/getting_started/bucket.md b/doc/book/src/getting_started/bucket.md index c3b39d8d..8b05ee23 100644 --- a/doc/book/src/getting_started/bucket.md +++ b/doc/book/src/getting_started/bucket.md @@ -1 +1,71 @@ # Create buckets and keys + +First, chances are that your garage deployment is secured by TLS. +All your commands must be prefixed with their certificates. +I will define an alias once and for all to ease future commands. +Please adapt the path of the binary and certificates to your installation! + +``` +alias grg="/garage/garage --ca-cert /secrets/garage-ca.crt --client-cert /secrets/garage.crt --client-key /secrets/garage.key" +``` + +Now we can check that everything is going well by checking our cluster status: + +``` +grg status +``` + +Don't forget that `help` command and `--help` subcommands can help you anywhere, the CLI tool is self-documented! Two examples: + +``` +grg help +grg bucket allow --help +``` + +Fine, now let's create a bucket (we imagine that you want to deploy nextcloud): + +``` +grg bucket create nextcloud-bucket +``` + +Check that everything went well: + +``` +grg bucket list +grg bucket info nextcloud-bucket +``` + +Now we will generate an API key to access this bucket. +Note that API keys are independent of buckets: one key can access multiple buckets, multiple keys can access one bucket. + +Now, let's start by creating a key only for our PHP application: + +``` +grg key new --name nextcloud-app-key +``` + +You will have the following output (this one is fake, `key_id` and `secret_key` were generated with the openssl CLI tool): + +``` +Key { key_id: "GK3515373e4c851ebaad366558", secret_key: "7d37d093435a41f2aab8f13c19ba067d9776c90215f56614adad6ece597dbb34", name: "nextcloud-app-key", name_timestamp: 1603280506694, deleted: false, authorized_buckets: [] } +``` + +Check that everything works as intended (be careful, info works only with your key identifier and not with its friendly name!): + +``` +grg key list +grg key info GK3515373e4c851ebaad366558 +``` + +Now that we have a bucket and a key, we need to give permissions to the key on the bucket! + +``` +grg bucket allow --read --write nextcloud-bucket --key GK3515373e4c851ebaad366558 +``` + +You can check at any times allowed keys on your bucket with: + +``` +grg bucket info nextcloud-bucket +``` + diff --git a/doc/book/src/getting_started/install.md b/doc/book/src/getting_started/install.md index 39557f02..98af1283 100644 --- a/doc/book/src/getting_started/install.md +++ b/doc/book/src/getting_started/install.md @@ -1,3 +1,25 @@ # Installation +Currently, only two installations procedures are supported for Garage: from Docker (x86\_64 for Linux) and from source. +In the future, we plan to add a third one, by publishing a compiled binary (x86\_64 for Linux). +We did not test other architecture/operating system but, as long as your architecture/operating system is supported by Rust, you should be able to run Garage (feel free to report your tests!). +## From Docker + +Garage is a software that can be run only in a cluster and requires at least 3 instances. +If you plan to run the 3 instances on your machine for test purposes, we recommend a **docker-compose** deployment. +If you have 3 independent machines (or 3 VM on independent machines) that can communite together, a **simple docker** deployment is enough. +In any case, you first need to pick a Docker image version. + +Our docker image is currently named `lxpz/garage_amd64` and is stored on the [Docker Hub](https://hub.docker.com/r/lxpz/garage_amd64/tags?page=1&ordering=last_updated). +We encourage you to use a fixed tag (eg. `v0.1.1d`) and not the `latest` tag. +For this example, we will use the latest published version at the time of the writing which is `v0.1.1d` but it's up to you +to check [the most recent versions on the Docker Hub](https://hub.docker.com/r/lxpz/garage_amd64/tags?page=1&ordering=last_updated). + +### Single machine deployment with docker-compose + + + +### Multiple machine deployment with docker + +## From source diff --git a/doc/book/src/intro.md b/doc/book/src/intro.md index 5455ae71..ec77036f 100644 --- a/doc/book/src/intro.md +++ b/doc/book/src/intro.md @@ -10,17 +10,17 @@ To promote better data management policies, with focused on the following desira - **Self-contained & lightweight**: works everywhere and integrates well in existing environments to target hyperconverged infrastructures - **Highly resilient**: highly resilient to network failures, network latency, disk failures, sysadmin failures - **Simple**: simple to understand, simple to operate, simple to debug - - **Internet enabled**: Made for multi-sites (eg. datacenter, offices, etc.) interconnected through a regular internet connection. + - **Internet enabled**: made for multi-sites (eg. datacenter, offices, etc.) interconnected through a regular internet connection. We also noted that the pursuit of some other goals are detrimental to our initial goals. The following have been identified has non-goals, if it matters to you, you should not use Garage: - - **Extreme performances**: high performances constrain a lot the design and the deployment. We always prioritize - - **Feature extensiveness**: Complete implementation of the S3 API - - **Storage optimizations**: Erasure coding (our replication model is simply to copy the data as is on several nodes, in different datacenters if possible) - - **POSIX/Filesystem compatibility**: We do not aim at being POSIX compatible or to emulate any kind of filesystem. Indeed, in a distributed environment, such syncronizations are translated in network messages that impose severe constraints on the deployment. + - **Extreme performances**: high performances constrain a lot the design and the infrastructure; we seek performances through minimalism only. + - **Feature extensiveness**: complete implementation of the S3 API or any other API to make garage a drop-in replacement is not targeted as it could lead to decisions impacting our desirable properties. + - **Storage optimizations**: erasure coding or any other coding technique both increase the difficulty of placing data and synchronizing; we limit ourselves to duplication. + - **POSIX/Filesystem compatibility**: we do not aim at being POSIX compatible or to emulate any kind of filesystem. Indeed, in a distributed environment, such syncronizations are translated in network messages that impose severe constraints on the deployment. -## Integration in environments +## Supported and planned protocols Garage speaks (or will speak) the following protocols: @@ -36,9 +36,9 @@ We plan to add logic to Garage to make it a viable solution for email storage. Especially, it is used to host their main website, this documentation and some of its members's blogs. Additionally, Garage is used as a [backend for Nextcloud](https://docs.nextcloud.com/server/20/admin_manual/configuration_files/primary_storage.html). Deuxfleurs also plans to use Garage as their [Matrix's media backend](https://github.com/matrix-org/synapse-s3-storage-provider) and has the backend of [OCIS](https://github.com/owncloud/ocis). -*Are you using Garage? Open a pull request to add your organization here!* +*Are you using Garage? [Open a pull request](https://git.deuxfleurs.fr/Deuxfleurs/garage/) to add your organization here!* -## Comparisons to existing software +## Comparison to existing software **[Minio](https://min.io/) :** Minio shares our *self-contained & lightweight* goal but selected two of our non-goals: *storage optimizations* through erasure coding and *POSIX/Filesystem compatibility* through strong consistency. However, by pursuing these two non-goals, minio do not reach our desirable properties. @@ -46,17 +46,26 @@ First, it fails on the *simple* property: due to the erasure coding, minio has s Second, it fails on the *interned enabled* property: due to its strong consistency, minio is latency sensitive. Furthermore, minio has no knowledge of "sites" and thus can not distribute data to minimize the failure of a given site. -**[Openstack Swift](https://docs.openstack.org/swift/latest/)** +**[Openstack Swift](https://docs.openstack.org/swift/latest/) :** OpenStack Swift at least fails on the *self-contained & lightweight* goal. Starting it requires around 8Gb of RAM, which is too much especially in an hyperconverged infrastructure. It seems also to be far from *Simple*. -**[Pithos](https://github.com/exoscale/pithos)** -Pithos has been abandonned and should probably not used yet, in the following we explain why we did not pick their design. -Pithos was relying as a S3 proxy in front of Cassandra (and was working with Scylla DB too). -From its designers' mouth, storing data in Cassandra has shown its limitations justifying the project abandonment. -They built a closed-source version 2 that does not store blobs in the database (only metadata) but did not communicate further on it. -We considered there v2's design but concluded that it does not fit both our *Self-contained & lightweight* and *Simple* properties. It makes the development, the deployment and the operations more complicated while reducing the flexibility. +**[Ceph](https://ceph.io/ceph-storage/object-storage/) :** +This review holds for the whole Ceph stack, including the RADOS paper, Ceph Object Storage module, the RADOS Gateway, etc. +At is core, Ceph has been designed to provide *POSIX/Filesystem compatibility* which requires strong consistency, which in turn +makes Ceph latency sensitive and fails our *Internet enabled* goal. +Due to its industry oriented design, Ceph is also far from being *Simple* to operate and from being *self-contained & lightweight* which makes it hard to integrate it in an hyperconverged infrastructure. +In a certain way, Ceph and Minio are closer togethers than they are from Garage or OpenStack Swift. -**[IPFS](https://ipfs.io/)** -*Not written yet* +*More comparisons are available in our [Related Work](design/related_work.md) chapter.* + +## Community + +If you want to discuss with us, you can join our Matrix channel at [#garage:deuxfleurs.fr](https://matrix.to/#/#garage:deuxfleurs.fr). +Our code and our issue tracker, which is the place where you should report bugs, are managed on [Deuxfleurs' Gitea](https://git.deuxfleurs.fr/Deuxfleurs/garage). + +## License + +Garage, all the source code, is released under the [AGPL v3 License](https://www.gnu.org/licenses/agpl-3.0.en.html). +Please note that if you patch Garage and then use it to provide any service over a network, you must share your code! diff --git a/doc/book/src/quickstart_bucket.md b/doc/book/src/quickstart_bucket.md deleted file mode 100644 index bd574e14..00000000 --- a/doc/book/src/quickstart_bucket.md +++ /dev/null @@ -1,71 +0,0 @@ -# Configuring a cluster - -First, chances are that your garage deployment is secured by TLS. -All your commands must be prefixed with their certificates. -I will define an alias once and for all to ease future commands. -Please adapt the path of the binary and certificates to your installation! - -``` -alias grg="/garage/garage --ca-cert /secrets/garage-ca.crt --client-cert /secrets/garage.crt --client-key /secrets/garage.key" -``` - -Now we can check that everything is going well by checking our cluster status: - -``` -grg status -``` - -Don't forget that `help` command and `--help` subcommands can help you anywhere, the CLI tool is self-documented! Two examples: - -``` -grg help -grg bucket allow --help -``` - -Fine, now let's create a bucket (we imagine that you want to deploy nextcloud): - -``` -grg bucket create nextcloud-bucket -``` - -Check that everything went well: - -``` -grg bucket list -grg bucket info nextcloud-bucket -``` - -Now we will generate an API key to access this bucket. -Note that API keys are independent of buckets: one key can access multiple buckets, multiple keys can access one bucket. - -Now, let's start by creating a key only for our PHP application: - -``` -grg key new --name nextcloud-app-key -``` - -You will have the following output (this one is fake, `key_id` and `secret_key` were generated with the openssl CLI tool): - -``` -Key { key_id: "GK3515373e4c851ebaad366558", secret_key: "7d37d093435a41f2aab8f13c19ba067d9776c90215f56614adad6ece597dbb34", name: "nextcloud-app-key", name_timestamp: 1603280506694, deleted: false, authorized_buckets: [] } -``` - -Check that everything works as intended (be careful, info works only with your key identifier and not with its friendly name!): - -``` -grg key list -grg key info GK3515373e4c851ebaad366558 -``` - -Now that we have a bucket and a key, we need to give permissions to the key on the bucket! - -``` -grg bucket allow --read --write nextcloud-bucket --key GK3515373e4c851ebaad366558 -``` - -You can check at any times allowed keys on your bucket with: - -``` -grg bucket info nextcloud-bucket -``` - diff --git a/doc/book/src/related_work.md b/doc/book/src/related_work.md index c1a4eed4..bae4691c 100644 --- a/doc/book/src/related_work.md +++ b/doc/book/src/related_work.md @@ -1,3 +1,5 @@ +# Related Work + ## Context Data storage is critical: it can lead to data loss if done badly and/or on hardware failure. @@ -8,7 +10,7 @@ But here we consider non specialized off the shelf machines that can be as low p Distributed storage may help to solve both availability and scalability problems on these machines. Many solutions were proposed, they can be categorized as block storage, file storage and object storage depending on the abstraction they provide. -## Related work +## Overview Block storage is the most low level one, it's like exposing your raw hard drive over the network. It requires very low latencies and stable network, that are often dedicated. @@ -36,3 +38,19 @@ However Pithos is not maintained anymore. More precisely the company that publis Some tests conducted by the [ACIDES project](https://acides.org/) have shown that Openstack Swift consumes way more resources (CPU+RAM) that we can afford. Furthermore, people developing Swift have not designed their software for geo-distribution. There were many attempts in research too. I am only thinking to [LBFS](https://pdos.csail.mit.edu/papers/lbfs:sosp01/lbfs.pdf) that was used as a basis for Seafile. But none of them have been effectively implemented yet. + +## Existing software + +**[Pithos](https://github.com/exoscale/pithos) :** +Pithos has been abandonned and should probably not used yet, in the following we explain why we did not pick their design. +Pithos was relying as a S3 proxy in front of Cassandra (and was working with Scylla DB too). +From its designers' mouth, storing data in Cassandra has shown its limitations justifying the project abandonment. +They built a closed-source version 2 that does not store blobs in the database (only metadata) but did not communicate further on it. +We considered there v2's design but concluded that it does not fit both our *Self-contained & lightweight* and *Simple* properties. It makes the development, the deployment and the operations more complicated while reducing the flexibility. + +**[IPFS](https://ipfs.io/) :** +*Not written yet* + +## Specific research papers + +*Not yet written* -- cgit v1.2.3 From 002538f92c1d9f95f2d699337f7d891c6aa0c9a4 Mon Sep 17 00:00:00 2001 From: Quentin Dufour Date: Wed, 17 Mar 2021 16:15:18 +0100 Subject: Refactor file organization --- doc/Load_Balancing.md | 195 --------------------- doc/book/src/S3_Compatibility.md | 84 --------- doc/book/src/SUMMARY.md | 23 +-- doc/book/src/compatibility.md | 1 - doc/book/src/cookbook/index.md | 1 + doc/book/src/cookbook/website.md | 1 + doc/book/src/design/index.md | 1 + doc/book/src/design/internals.md | 158 +++++++++++++++++ doc/book/src/design/related_work.md | 56 ++++++ doc/book/src/development/devenv.md | 17 ++ doc/book/src/development/index.md | 1 + doc/book/src/devenv.md | 17 -- doc/book/src/getting_started.md | 5 - doc/book/src/getting_started/index.md | 5 + doc/book/src/internals.md | 158 ----------------- doc/book/src/intro.md | 24 +++ doc/book/src/reference_manual/index.md | 1 + doc/book/src/reference_manual/s3_compatibility.md | 84 +++++++++ doc/book/src/related_work.md | 56 ------ doc/book/src/website.md | 1 - doc/book/src/working_documents/index.md | 7 + doc/book/src/working_documents/load_balancing.md | 197 ++++++++++++++++++++++ 22 files changed, 566 insertions(+), 527 deletions(-) delete mode 100644 doc/Load_Balancing.md delete mode 100644 doc/book/src/S3_Compatibility.md delete mode 100644 doc/book/src/compatibility.md create mode 100644 doc/book/src/cookbook/index.md create mode 100644 doc/book/src/cookbook/website.md create mode 100644 doc/book/src/design/index.md create mode 100644 doc/book/src/design/internals.md create mode 100644 doc/book/src/design/related_work.md create mode 100644 doc/book/src/development/devenv.md create mode 100644 doc/book/src/development/index.md delete mode 100644 doc/book/src/devenv.md delete mode 100644 doc/book/src/getting_started.md create mode 100644 doc/book/src/getting_started/index.md delete mode 100644 doc/book/src/internals.md create mode 100644 doc/book/src/reference_manual/index.md create mode 100644 doc/book/src/reference_manual/s3_compatibility.md delete mode 100644 doc/book/src/related_work.md delete mode 100644 doc/book/src/website.md create mode 100644 doc/book/src/working_documents/index.md create mode 100644 doc/book/src/working_documents/load_balancing.md (limited to 'doc') diff --git a/doc/Load_Balancing.md b/doc/Load_Balancing.md deleted file mode 100644 index a348ebc4..00000000 --- a/doc/Load_Balancing.md +++ /dev/null @@ -1,195 +0,0 @@ -I have conducted a quick study of different methods to load-balance data over different Garage nodes using consistent hashing. - -### Requirements - -- *good balancing*: two nodes that have the same announced capacity should receive close to the same number of items - -- *multi-datacenter*: the replicas of a partition should be distributed over as many datacenters as possible - -- *minimal disruption*: when adding or removing a node, as few partitions as possible should have to move around - -- *order-agnostic*: the same set of nodes (each associated with a datacenter name - and a capacity) should always return the same distribution of partition - replicas, independently of the order in which nodes were added/removed (this - is to keep the implementation simple) - -### Methods - -#### Naive multi-DC ring walking strategy - -This strategy can be used with any ring-like algorithm to make it aware of the *multi-datacenter* requirement: - -In this method, the ring is a list of positions, each associated with a single node in the cluster. -Partitions contain all the keys between two consecutive items of the ring. -To find the nodes that store replicas of a given partition: - -- select the node for the position of the partition's lower bound -- go clockwise on the ring, skipping nodes that: - - we halve already selected - - are in a datacenter of a node we have selected, except if we already have nodes from all possible datacenters - -In this way the selected nodes will always be distributed over -`min(n_datacenters, n_replicas)` different datacenters, which is the best we -can do. - -This method was implemented in the first version of Garage, with the basic -ring construction from Dynamo DB that consists in associating `n_token` random positions to -each node (I know it's not optimal, the Dynamo paper already studies this). - -#### Better rings - -The ring construction that selects `n_token` random positions for each nodes gives a ring of positions that -is not well-balanced: the space between the tokens varies a lot, and some partitions are thus bigger than others. -This problem was demonstrated in the original Dynamo DB paper. - -To solve this, we want to apply a better second method for partitionning our dataset: - -1. fix an initially large number of partitions (say 1024) with evenly-spaced delimiters, - -2. attribute each partition randomly to a node, with a probability - proportionnal to its capacity (which `n_tokens` represented in the first - method) - -For now we continue using the multi-DC ring walking described above. - -I have studied two ways to do the attribution of partitions to nodes, in a way that is deterministic: - -- Min-hash: for each partition, select node that minimizes `hash(node, partition_number)` -- MagLev: see [here](https://blog.acolyer.org/2016/03/21/maglev-a-fast-and-reliable-software-network-load-balancer/) - -MagLev provided significantly better balancing, as it guarantees that the exact -same number of partitions is attributed to all nodes that have the same -capacity (and that this number is proportionnal to the node's capacity, except -for large values), however in both cases: - -- the distribution is still bad, because we use the naive multi-DC ring walking - that behaves strangely due to interactions between consecutive positions on - the ring - -- the disruption in case of adding/removing a node is not as low as it can be, - as we show with the following method. - -A quick description of MagLev (backend = node, lookup table = ring): - -> The basic idea of Maglev hashing is to assign a preference list of all the -> lookup table positions to each backend. Then all the backends take turns -> filling their most-preferred table positions that are still empty, until the -> lookup table is completely filled in. Hence, Maglev hashing gives an almost -> equal share of the lookup table to each of the backends. Heterogeneous -> backend weights can be achieved by altering the relative frequency of the -> backends’ turns… - -Here are some stats (run `scripts/simulate_ring.py` to reproduce): - -``` -##### Custom-ring (min-hash) ##### - -#partitions per node (capacity in parenthesis): -- datura (8) : 227 -- digitale (8) : 351 -- drosera (8) : 259 -- geant (16) : 476 -- gipsie (16) : 410 -- io (16) : 495 -- isou (8) : 231 -- mini (4) : 149 -- mixi (4) : 188 -- modi (4) : 127 -- moxi (4) : 159 - -Variance of load distribution for load normalized to intra-class mean -(a class being the set of nodes with the same announced capacity): 2.18% <-- REALLY BAD - -Disruption when removing nodes (partitions moved on 0/1/2/3 nodes): -removing atuin digitale : 63.09% 30.18% 6.64% 0.10% -removing atuin drosera : 72.36% 23.44% 4.10% 0.10% -removing atuin datura : 73.24% 21.48% 5.18% 0.10% -removing jupiter io : 48.34% 38.48% 12.30% 0.88% -removing jupiter isou : 74.12% 19.73% 6.05% 0.10% -removing grog mini : 84.47% 12.40% 2.93% 0.20% -removing grog mixi : 80.76% 16.60% 2.64% 0.00% -removing grog moxi : 83.59% 14.06% 2.34% 0.00% -removing grog modi : 87.01% 11.43% 1.46% 0.10% -removing grisou geant : 48.24% 37.40% 13.67% 0.68% -removing grisou gipsie : 53.03% 33.59% 13.09% 0.29% -on average: 69.84% 23.53% 6.40% 0.23% <-- COULD BE BETTER - --------- - -##### MagLev ##### - -#partitions per node: -- datura (8) : 273 -- digitale (8) : 256 -- drosera (8) : 267 -- geant (16) : 452 -- gipsie (16) : 427 -- io (16) : 483 -- isou (8) : 272 -- mini (4) : 184 -- mixi (4) : 160 -- modi (4) : 144 -- moxi (4) : 154 - -Variance of load distribution: 0.37% <-- Already much better, but not optimal - -Disruption when removing nodes (partitions moved on 0/1/2/3 nodes): -removing atuin digitale : 62.60% 29.20% 7.91% 0.29% -removing atuin drosera : 65.92% 26.56% 7.23% 0.29% -removing atuin datura : 63.96% 27.83% 7.71% 0.49% -removing jupiter io : 44.63% 40.33% 14.06% 0.98% -removing jupiter isou : 63.38% 27.25% 8.98% 0.39% -removing grog mini : 72.46% 21.00% 6.35% 0.20% -removing grog mixi : 72.95% 22.46% 4.39% 0.20% -removing grog moxi : 74.22% 20.61% 4.98% 0.20% -removing grog modi : 75.98% 18.36% 5.27% 0.39% -removing grisou geant : 46.97% 36.62% 15.04% 1.37% -removing grisou gipsie : 49.22% 36.52% 12.79% 1.46% -on average: 62.94% 27.89% 8.61% 0.57% <-- WORSE THAN PREVIOUSLY -``` - -#### The magical solution: multi-DC aware MagLev - -Suppose we want to select three replicas for each partition (this is what we do in our simulation and in most Garage deployments). -We apply MagLev three times consecutively, one for each replica selection. -The first time is pretty much the same as normal MagLev, but for the following times, when a node runs through its preference -list to select a partition to replicate, we skip partitions for which adding this node would not bring datacenter-diversity. -More precisely, we skip a partition in the preference list if: - -- the node already replicates the partition (from one of the previous rounds of MagLev) -- the node is in a datacenter where a node already replicates the partition and there are other datacenters available - -Refer to `method4` in the simulation script for a formal definition. - -``` -##### Multi-DC aware MagLev ##### - -#partitions per node: -- datura (8) : 268 <-- NODES WITH THE SAME CAPACITY -- digitale (8) : 267 HAVE THE SAME NUM OF PARTITIONS -- drosera (8) : 267 (+- 1) -- geant (16) : 470 -- gipsie (16) : 472 -- io (16) : 516 -- isou (8) : 268 -- mini (4) : 136 -- mixi (4) : 136 -- modi (4) : 136 -- moxi (4) : 136 - -Variance of load distribution: 0.06% <-- CAN'T DO BETTER THAN THIS - -Disruption when removing nodes (partitions moved on 0/1/2/3 nodes): -removing atuin digitale : 65.72% 33.01% 1.27% 0.00% -removing atuin drosera : 64.65% 33.89% 1.37% 0.10% -removing atuin datura : 66.11% 32.62% 1.27% 0.00% -removing jupiter io : 42.97% 53.42% 3.61% 0.00% -removing jupiter isou : 66.11% 32.32% 1.56% 0.00% -removing grog mini : 80.47% 18.85% 0.68% 0.00% -removing grog mixi : 80.27% 18.85% 0.88% 0.00% -removing grog moxi : 80.18% 19.04% 0.78% 0.00% -removing grog modi : 79.69% 19.92% 0.39% 0.00% -removing grisou geant : 44.63% 52.15% 3.22% 0.00% -removing grisou gipsie : 43.55% 52.54% 3.91% 0.00% -on average: 64.94% 33.33% 1.72% 0.01% <-- VERY GOOD (VERY LOW VALUES FOR 2 AND 3 NODES) -``` diff --git a/doc/book/src/S3_Compatibility.md b/doc/book/src/S3_Compatibility.md deleted file mode 100644 index c0fc2863..00000000 --- a/doc/book/src/S3_Compatibility.md +++ /dev/null @@ -1,84 +0,0 @@ -## S3 Compatibility status - -### Global S3 features - -Implemented: - -- path-style URLs (`garage.tld/bucket/key`) -- putting and getting objects in buckets -- multipart uploads -- listing objects -- access control on a per-key-per-bucket basis - -Not implemented: - -- vhost-style URLs (`bucket.garage.tld/key`) -- object-level ACL -- encryption -- most `x-amz-` headers - - -### Endpoint implementation - -All APIs that are not mentionned are not implemented and will return a 400 bad request. - -#### AbortMultipartUpload - -Implemented. - -#### CompleteMultipartUpload - -Implemented badly. Garage will not check that all the parts stored correspond to the list given by the client in the request body. This means that the multipart upload might be completed with an invalid size. This is a bug and will be fixed. - -#### CopyObject - -Implemented. - -#### CreateBucket - -Garage does not accept creating buckets or giving access using API calls, it has to be done using the CLI tools. CreateBucket will return a 200 if the bucket exists and user has write access, and a 403 Forbidden in all other cases. - -#### CreateMultipartUpload - -Implemented. - -#### DeleteBucket - -Garage does not accept deleting buckets using API calls, it has to be done using the CLI tools. This request will return a 403 Forbidden. - -#### DeleteObject - -Implemented. - -#### DeleteObjects - -Implemented. - -#### GetObject - -Implemented. - -#### HeadBucket - -Implemented. - -#### HeadObject - -Implemented. - -#### ListObjects - -Implemented, but there isn't a very good specification of what `encoding-type=url` covers so there might be some encoding bugs. In our implementation the url-encoded fields are in the same in ListObjects as they are in ListObjectsV2. - -#### ListObjectsV2 - -Implemented. - -#### PutObject - -Implemented. - -#### UploadPart - -Implemented. - diff --git a/doc/book/src/SUMMARY.md b/doc/book/src/SUMMARY.md index 5d01dbee..7697948b 100644 --- a/doc/book/src/SUMMARY.md +++ b/doc/book/src/SUMMARY.md @@ -2,25 +2,28 @@ [The Garage Data Store](./intro.md) -- [Getting Started](./getting_started.md) +- [Getting Started](./getting_started/index.md) - [Installation](./getting_started/install.md) - [Configure a cluster](./getting_started/cluster.md) - [Create buckets and keys](./getting_started/bucket.md) - [Handle files](./getting_started/files.md) -- [Cookbooks]() - - [Host a website](./website.md) +- [Cookbook](./cookbook/index.md) + - [Host a website](./cookbook/website.md) - [Integrate as a media backend]() - [Operate a cluster]() -- [Reference Manual]() +- [Reference Manual](./reference_manual/index.md) - [Garage CLI]() - - [S3 API](./compatibility.md) + - [S3 API](./reference_manual/s3_compatibility.md) -- [Design]() - - [Related Work](./related_work.md) - - [Internals](./internals.md) +- [Design](./design/index.md) + - [Related Work](./design/related_work.md) + - [Internals](./design/internals.md) -- [Development]() - - [Setup your environment](./devenv.md) +- [Development](./development/index.md) + - [Setup your environment](./development/devenv.md) - [Your first contribution]() + +- [Working Documents](./working_documents/index.md) + - [Load Balancing Data](./working_documents/load_balancing.md) diff --git a/doc/book/src/compatibility.md b/doc/book/src/compatibility.md deleted file mode 100644 index acf9968b..00000000 --- a/doc/book/src/compatibility.md +++ /dev/null @@ -1 +0,0 @@ -# S3 API diff --git a/doc/book/src/cookbook/index.md b/doc/book/src/cookbook/index.md new file mode 100644 index 00000000..741ecbe7 --- /dev/null +++ b/doc/book/src/cookbook/index.md @@ -0,0 +1 @@ +# Cookbook diff --git a/doc/book/src/cookbook/website.md b/doc/book/src/cookbook/website.md new file mode 100644 index 00000000..2ea82a9a --- /dev/null +++ b/doc/book/src/cookbook/website.md @@ -0,0 +1 @@ +# Host a website diff --git a/doc/book/src/design/index.md b/doc/book/src/design/index.md new file mode 100644 index 00000000..3d14cb7c --- /dev/null +++ b/doc/book/src/design/index.md @@ -0,0 +1 @@ +# Design diff --git a/doc/book/src/design/internals.md b/doc/book/src/design/internals.md new file mode 100644 index 00000000..e712ae07 --- /dev/null +++ b/doc/book/src/design/internals.md @@ -0,0 +1,158 @@ +**WARNING: this documentation is more a "design draft", which was written before Garage's actual implementation. The general principle is similar but details have not yet been updated.** + +#### Modules + +- `membership/`: configuration, membership management (gossip of node's presence and status), ring generation --> what about Serf (used by Consul/Nomad) : https://www.serf.io/? Seems a huge library with many features so maybe overkill/hard to integrate +- `metadata/`: metadata management +- `blocks/`: block management, writing, GC and rebalancing +- `internal/`: server to server communication (HTTP server and client that reuses connections, TLS if we want, etc) +- `api/`: S3 API +- `web/`: web management interface + +#### Metadata tables + +**Objects:** + +- *Hash key:* Bucket name (string) +- *Sort key:* Object key (string) +- *Sort key:* Version timestamp (int) +- *Sort key:* Version UUID (string) +- Complete: bool +- Inline: bool, true for objects < threshold (say 1024) +- Object size (int) +- Mime type (string) +- Data for inlined objects (blob) +- Hash of first block otherwise (string) + +*Having only a hash key on the bucket name will lead to storing all file entries of this table for a specific bucket on a single node. At the same time, it is the only way I see to rapidly being able to list all bucket entries...* + +**Blocks:** + +- *Hash key:* Version UUID (string) +- *Sort key:* Offset of block in total file (int) +- Hash of data block (string) + +A version is defined by the existence of at least one entry in the blocks table for a certain version UUID. +We must keep the following invariant: if a version exists in the blocks table, it has to be referenced in the objects table. +We explicitly manage concurrent versions of an object: the version timestamp and version UUID columns are index columns, thus we may have several concurrent versions of an object. +Important: before deleting an older version from the objects table, we must make sure that we did a successfull delete of the blocks of that version from the blocks table. + +Thus, the workflow for reading an object is as follows: + +1. Check permissions (LDAP) +2. Read entry in object table. If data is inline, we have its data, stop here. + -> if several versions, take newest one and launch deletion of old ones in background +3. Read first block from cluster. If size <= 1 block, stop here. +4. Simultaneously with previous step, if size > 1 block: query the Blocks table for the IDs of the next blocks +5. Read subsequent blocks from cluster + +Workflow for PUT: + +1. Check write permission (LDAP) +2. Select a new version UUID +3. Write a preliminary entry for the new version in the objects table with complete = false +4. Send blocks to cluster and write entries in the blocks table +5. Update the version with complete = true and all of the accurate information (size, etc) +6. Return success to the user +7. Launch a background job to check and delete older versions + +Workflow for DELETE: + +1. Check write permission (LDAP) +2. Get current version (or versions) in object table +3. Do the deletion of those versions NOT IN A BACKGROUND JOB THIS TIME +4. Return succes to the user if we were able to delete blocks from the blocks table and entries from the object table + +To delete a version: + +1. List the blocks from Cassandra +2. For each block, delete it from cluster. Don't care if some deletions fail, we can do GC. +3. Delete all of the blocks from the blocks table +4. Finally, delete the version from the objects table + +Known issue: if someone is reading from a version that we want to delete and the object is big, the read might be interrupted. I think it is ok to leave it like this, we just cut the connection if data disappears during a read. + +("Soit P un problème, on s'en fout est une solution à ce problème") + +#### Block storage on disk + +**Blocks themselves:** + +- file path = /blobs/(first 3 hex digits of hash)/(rest of hash) + +**Reverse index for GC & other block-level metadata:** + +- file path = /meta/(first 3 hex digits of hash)/(rest of hash) +- map block hash -> set of version UUIDs where it is referenced + +Usefull metadata: + +- list of versions that reference this block in the Casandra table, so that we can do GC by checking in Cassandra that the lines still exist +- list of other nodes that we know have acknowledged a write of this block, usefull in the rebalancing algorithm + +Write strategy: have a single thread that does all write IO so that it is serialized (or have several threads that manage independent parts of the hash space). When writing a blob, write it to a temporary file, close, then rename so that a concurrent read gets a consistent result (either not found or found with whole content). + +Read strategy: the only read operation is get(hash) that returns either the data or not found (can do a corruption check as well and return corrupted state if it is the case). Can be done concurrently with writes. + +**Internal API:** + +- get(block hash) -> ok+data/not found/corrupted +- put(block hash & data, version uuid + offset) -> ok/error +- put with no data(block hash, version uuid + offset) -> ok/not found plz send data/error +- delete(block hash, version uuid + offset) -> ok/error + +GC: when last ref is deleted, delete block. +Long GC procedure: check in Cassandra that version UUIDs still exist and references this block. + +Rebalancing: takes as argument the list of newly added nodes. + +- List all blocks that we have. For each block: +- If it hits a newly introduced node, send it to them. + Use put with no data first to check if it has to be sent to them already or not. + Use a random listing order to avoid race conditions (they do no harm but we might have two nodes sending the same thing at the same time thus wasting time). +- If it doesn't hit us anymore, delete it and its reference list. + +Only one balancing can be running at a same time. It can be restarted at the beginning with new parameters. + +#### Membership management + +Two sets of nodes: + +- set of nodes from which a ping was recently received, with status: number of stored blocks, request counters, error counters, GC%, rebalancing% + (eviction from this set after say 30 seconds without ping) +- set of nodes that are part of the system, explicitly modified by the operator using the web UI (persisted to disk), + is a CRDT using a version number for the value of the whole set + +Thus, three states for nodes: + +- healthy: in both sets +- missing: not pingable but part of desired cluster +- unused/draining: currently present but not part of the desired cluster, empty = if contains nothing, draining = if still contains some blocks + +Membership messages between nodes: + +- ping with current state + hash of current membership info -> reply with same info +- send&get back membership info (the ids of nodes that are in the two sets): used when no local membership change in a long time and membership info hash discrepancy detected with first message (passive membership fixing with full CRDT gossip) +- inform of newly pingable node(s) -> no result, when receive new info repeat to all (reliable broadcast) +- inform of operator membership change -> no result, when receive new info repeat to all (reliable broadcast) + +Ring: generated from the desired set of nodes, however when doing read/writes on the ring, skip nodes that are known to be not pingable. +The tokens are generated in a deterministic fashion from node IDs (hash of node id + token number from 1 to K). +Number K of tokens per node: decided by the operator & stored in the operator's list of nodes CRDT. Default value proposal: with node status information also broadcast disk total size and free space, and propose a default number of tokens equal to 80%Free space / 10Gb. (this is all user interface) + + +#### Constants + +- Block size: around 1MB ? --> Exoscale use 16MB chunks +- Number of tokens in the hash ring: one every 10Gb of allocated storage +- Threshold for storing data directly in Cassandra objects table: 1kb bytes (maybe up to 4kb?) +- Ping timeout (time after which a node is registered as unresponsive/missing): 30 seconds +- Ping interval: 10 seconds +- ?? + +#### Links + +- CDC: +- Erasure coding: +- [Openstack Storage Concepts](https://docs.openstack.org/arch-design/design-storage/design-storage-concepts.html) +- [RADOS](https://ceph.com/wp-content/uploads/2016/08/weil-rados-pdsw07.pdf) diff --git a/doc/book/src/design/related_work.md b/doc/book/src/design/related_work.md new file mode 100644 index 00000000..bae4691c --- /dev/null +++ b/doc/book/src/design/related_work.md @@ -0,0 +1,56 @@ +# Related Work + +## Context + +Data storage is critical: it can lead to data loss if done badly and/or on hardware failure. +Filesystems + RAID can help on a single machine but a machine failure can put the whole storage offline. +Moreover, it put a hard limit on scalability. Often this limit can be pushed back far away by buying expensive machines. +But here we consider non specialized off the shelf machines that can be as low powered and subject to failures as a raspberry pi. + +Distributed storage may help to solve both availability and scalability problems on these machines. +Many solutions were proposed, they can be categorized as block storage, file storage and object storage depending on the abstraction they provide. + +## Overview + +Block storage is the most low level one, it's like exposing your raw hard drive over the network. +It requires very low latencies and stable network, that are often dedicated. +However it provides disk devices that can be manipulated by the operating system with the less constraints: it can be partitioned with any filesystem, meaning that it supports even the most exotic features. +We can cite [iSCSI](https://en.wikipedia.org/wiki/ISCSI) or [Fibre Channel](https://en.wikipedia.org/wiki/Fibre_Channel). +Openstack Cinder proxy previous solution to provide an uniform API. + +File storage provides a higher abstraction, they are one filesystem among others, which means they don't necessarily have all the exotic features of every filesystem. +Often, they relax some POSIX constraints while many applications will still be compatible without any modification. +As an example, we are able to run MariaDB (very slowly) over GlusterFS... +We can also mention CephFS (read [RADOS](https://ceph.com/wp-content/uploads/2016/08/weil-rados-pdsw07.pdf) whitepaper), Lustre, LizardFS, MooseFS, etc. +OpenStack Manila proxy previous solutions to provide an uniform API. + +Finally object storages provide the highest level abstraction. +They are the testimony that the POSIX filesystem API is not adapted to distributed filesystems. +Especially, the strong concistency has been dropped in favor of eventual consistency which is way more convenient and powerful in presence of high latencies and unreliability. +We often read about S3 that pioneered the concept that it's a filesystem for the WAN. +Applications must be adapted to work for the desired object storage service. +Today, the S3 HTTP REST API acts as a standard in the industry. +However, Amazon S3 source code is not open but alternatives were proposed. +We identified Minio, Pithos, Swift and Ceph. +Minio/Ceph enforces a total order, so properties similar to a (relaxed) filesystem. +Swift and Pithos are probably the most similar to AWS S3 with their consistent hashing ring. +However Pithos is not maintained anymore. More precisely the company that published Pithos version 1 has developped a second version 2 but has not open sourced it. +Some tests conducted by the [ACIDES project](https://acides.org/) have shown that Openstack Swift consumes way more resources (CPU+RAM) that we can afford. Furthermore, people developing Swift have not designed their software for geo-distribution. + +There were many attempts in research too. I am only thinking to [LBFS](https://pdos.csail.mit.edu/papers/lbfs:sosp01/lbfs.pdf) that was used as a basis for Seafile. But none of them have been effectively implemented yet. + +## Existing software + +**[Pithos](https://github.com/exoscale/pithos) :** +Pithos has been abandonned and should probably not used yet, in the following we explain why we did not pick their design. +Pithos was relying as a S3 proxy in front of Cassandra (and was working with Scylla DB too). +From its designers' mouth, storing data in Cassandra has shown its limitations justifying the project abandonment. +They built a closed-source version 2 that does not store blobs in the database (only metadata) but did not communicate further on it. +We considered there v2's design but concluded that it does not fit both our *Self-contained & lightweight* and *Simple* properties. It makes the development, the deployment and the operations more complicated while reducing the flexibility. + +**[IPFS](https://ipfs.io/) :** +*Not written yet* + +## Specific research papers + +*Not yet written* diff --git a/doc/book/src/development/devenv.md b/doc/book/src/development/devenv.md new file mode 100644 index 00000000..6cb7c554 --- /dev/null +++ b/doc/book/src/development/devenv.md @@ -0,0 +1,17 @@ +# Setup your development environment + +We propose the following quickstart to setup a full dev. environment as quickly as possible: + + 1. Setup a rust/cargo environment. eg. `dnf install rust cargo` + 2. Install awscli v2 by following the guide [here](https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2.html). + 3. Run `cargo build` to build the project + 4. Run `./script/dev-cluster.sh` to launch a test cluster (feel free to read the script) + 5. Run `./script/dev-configure.sh` to configure your test cluster with default values (same datacenter, 100 tokens) + 6. Run `./script/dev-bucket.sh` to create a bucket named `eprouvette` and an API key that will be stored in `/tmp/garage.s3` + 7. Run `source ./script/dev-env-aws.sh` to configure your CLI environment + 8. You can use `garage` to manage the cluster. Try `garage --help`. + 9. You can use the `awsgrg` alias to add, remove, and delete files. Try `awsgrg help`, `awsgrg cp /proc/cpuinfo s3://eprouvette/cpuinfo.txt`, or `awsgrg ls s3://eprouvette`. `awsgrg` is a wrapper on the `aws s3` command pre-configured with the previously generated API key (the one in `/tmp/garage.s3`) and localhost as the endpoint. + +Now you should be ready to start hacking on garage! + + diff --git a/doc/book/src/development/index.md b/doc/book/src/development/index.md new file mode 100644 index 00000000..459110d3 --- /dev/null +++ b/doc/book/src/development/index.md @@ -0,0 +1 @@ +# Development diff --git a/doc/book/src/devenv.md b/doc/book/src/devenv.md deleted file mode 100644 index 6cb7c554..00000000 --- a/doc/book/src/devenv.md +++ /dev/null @@ -1,17 +0,0 @@ -# Setup your development environment - -We propose the following quickstart to setup a full dev. environment as quickly as possible: - - 1. Setup a rust/cargo environment. eg. `dnf install rust cargo` - 2. Install awscli v2 by following the guide [here](https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2.html). - 3. Run `cargo build` to build the project - 4. Run `./script/dev-cluster.sh` to launch a test cluster (feel free to read the script) - 5. Run `./script/dev-configure.sh` to configure your test cluster with default values (same datacenter, 100 tokens) - 6. Run `./script/dev-bucket.sh` to create a bucket named `eprouvette` and an API key that will be stored in `/tmp/garage.s3` - 7. Run `source ./script/dev-env-aws.sh` to configure your CLI environment - 8. You can use `garage` to manage the cluster. Try `garage --help`. - 9. You can use the `awsgrg` alias to add, remove, and delete files. Try `awsgrg help`, `awsgrg cp /proc/cpuinfo s3://eprouvette/cpuinfo.txt`, or `awsgrg ls s3://eprouvette`. `awsgrg` is a wrapper on the `aws s3` command pre-configured with the previously generated API key (the one in `/tmp/garage.s3`) and localhost as the endpoint. - -Now you should be ready to start hacking on garage! - - diff --git a/doc/book/src/getting_started.md b/doc/book/src/getting_started.md deleted file mode 100644 index 282f5034..00000000 --- a/doc/book/src/getting_started.md +++ /dev/null @@ -1,5 +0,0 @@ -# Getting Started - -Let's start your Garage journey! -In this chapter, we explain how to deploy a simple garage cluster and start interacting with it. -Our goal is to introduce you to Garage's workflows. diff --git a/doc/book/src/getting_started/index.md b/doc/book/src/getting_started/index.md new file mode 100644 index 00000000..282f5034 --- /dev/null +++ b/doc/book/src/getting_started/index.md @@ -0,0 +1,5 @@ +# Getting Started + +Let's start your Garage journey! +In this chapter, we explain how to deploy a simple garage cluster and start interacting with it. +Our goal is to introduce you to Garage's workflows. diff --git a/doc/book/src/internals.md b/doc/book/src/internals.md deleted file mode 100644 index e712ae07..00000000 --- a/doc/book/src/internals.md +++ /dev/null @@ -1,158 +0,0 @@ -**WARNING: this documentation is more a "design draft", which was written before Garage's actual implementation. The general principle is similar but details have not yet been updated.** - -#### Modules - -- `membership/`: configuration, membership management (gossip of node's presence and status), ring generation --> what about Serf (used by Consul/Nomad) : https://www.serf.io/? Seems a huge library with many features so maybe overkill/hard to integrate -- `metadata/`: metadata management -- `blocks/`: block management, writing, GC and rebalancing -- `internal/`: server to server communication (HTTP server and client that reuses connections, TLS if we want, etc) -- `api/`: S3 API -- `web/`: web management interface - -#### Metadata tables - -**Objects:** - -- *Hash key:* Bucket name (string) -- *Sort key:* Object key (string) -- *Sort key:* Version timestamp (int) -- *Sort key:* Version UUID (string) -- Complete: bool -- Inline: bool, true for objects < threshold (say 1024) -- Object size (int) -- Mime type (string) -- Data for inlined objects (blob) -- Hash of first block otherwise (string) - -*Having only a hash key on the bucket name will lead to storing all file entries of this table for a specific bucket on a single node. At the same time, it is the only way I see to rapidly being able to list all bucket entries...* - -**Blocks:** - -- *Hash key:* Version UUID (string) -- *Sort key:* Offset of block in total file (int) -- Hash of data block (string) - -A version is defined by the existence of at least one entry in the blocks table for a certain version UUID. -We must keep the following invariant: if a version exists in the blocks table, it has to be referenced in the objects table. -We explicitly manage concurrent versions of an object: the version timestamp and version UUID columns are index columns, thus we may have several concurrent versions of an object. -Important: before deleting an older version from the objects table, we must make sure that we did a successfull delete of the blocks of that version from the blocks table. - -Thus, the workflow for reading an object is as follows: - -1. Check permissions (LDAP) -2. Read entry in object table. If data is inline, we have its data, stop here. - -> if several versions, take newest one and launch deletion of old ones in background -3. Read first block from cluster. If size <= 1 block, stop here. -4. Simultaneously with previous step, if size > 1 block: query the Blocks table for the IDs of the next blocks -5. Read subsequent blocks from cluster - -Workflow for PUT: - -1. Check write permission (LDAP) -2. Select a new version UUID -3. Write a preliminary entry for the new version in the objects table with complete = false -4. Send blocks to cluster and write entries in the blocks table -5. Update the version with complete = true and all of the accurate information (size, etc) -6. Return success to the user -7. Launch a background job to check and delete older versions - -Workflow for DELETE: - -1. Check write permission (LDAP) -2. Get current version (or versions) in object table -3. Do the deletion of those versions NOT IN A BACKGROUND JOB THIS TIME -4. Return succes to the user if we were able to delete blocks from the blocks table and entries from the object table - -To delete a version: - -1. List the blocks from Cassandra -2. For each block, delete it from cluster. Don't care if some deletions fail, we can do GC. -3. Delete all of the blocks from the blocks table -4. Finally, delete the version from the objects table - -Known issue: if someone is reading from a version that we want to delete and the object is big, the read might be interrupted. I think it is ok to leave it like this, we just cut the connection if data disappears during a read. - -("Soit P un problème, on s'en fout est une solution à ce problème") - -#### Block storage on disk - -**Blocks themselves:** - -- file path = /blobs/(first 3 hex digits of hash)/(rest of hash) - -**Reverse index for GC & other block-level metadata:** - -- file path = /meta/(first 3 hex digits of hash)/(rest of hash) -- map block hash -> set of version UUIDs where it is referenced - -Usefull metadata: - -- list of versions that reference this block in the Casandra table, so that we can do GC by checking in Cassandra that the lines still exist -- list of other nodes that we know have acknowledged a write of this block, usefull in the rebalancing algorithm - -Write strategy: have a single thread that does all write IO so that it is serialized (or have several threads that manage independent parts of the hash space). When writing a blob, write it to a temporary file, close, then rename so that a concurrent read gets a consistent result (either not found or found with whole content). - -Read strategy: the only read operation is get(hash) that returns either the data or not found (can do a corruption check as well and return corrupted state if it is the case). Can be done concurrently with writes. - -**Internal API:** - -- get(block hash) -> ok+data/not found/corrupted -- put(block hash & data, version uuid + offset) -> ok/error -- put with no data(block hash, version uuid + offset) -> ok/not found plz send data/error -- delete(block hash, version uuid + offset) -> ok/error - -GC: when last ref is deleted, delete block. -Long GC procedure: check in Cassandra that version UUIDs still exist and references this block. - -Rebalancing: takes as argument the list of newly added nodes. - -- List all blocks that we have. For each block: -- If it hits a newly introduced node, send it to them. - Use put with no data first to check if it has to be sent to them already or not. - Use a random listing order to avoid race conditions (they do no harm but we might have two nodes sending the same thing at the same time thus wasting time). -- If it doesn't hit us anymore, delete it and its reference list. - -Only one balancing can be running at a same time. It can be restarted at the beginning with new parameters. - -#### Membership management - -Two sets of nodes: - -- set of nodes from which a ping was recently received, with status: number of stored blocks, request counters, error counters, GC%, rebalancing% - (eviction from this set after say 30 seconds without ping) -- set of nodes that are part of the system, explicitly modified by the operator using the web UI (persisted to disk), - is a CRDT using a version number for the value of the whole set - -Thus, three states for nodes: - -- healthy: in both sets -- missing: not pingable but part of desired cluster -- unused/draining: currently present but not part of the desired cluster, empty = if contains nothing, draining = if still contains some blocks - -Membership messages between nodes: - -- ping with current state + hash of current membership info -> reply with same info -- send&get back membership info (the ids of nodes that are in the two sets): used when no local membership change in a long time and membership info hash discrepancy detected with first message (passive membership fixing with full CRDT gossip) -- inform of newly pingable node(s) -> no result, when receive new info repeat to all (reliable broadcast) -- inform of operator membership change -> no result, when receive new info repeat to all (reliable broadcast) - -Ring: generated from the desired set of nodes, however when doing read/writes on the ring, skip nodes that are known to be not pingable. -The tokens are generated in a deterministic fashion from node IDs (hash of node id + token number from 1 to K). -Number K of tokens per node: decided by the operator & stored in the operator's list of nodes CRDT. Default value proposal: with node status information also broadcast disk total size and free space, and propose a default number of tokens equal to 80%Free space / 10Gb. (this is all user interface) - - -#### Constants - -- Block size: around 1MB ? --> Exoscale use 16MB chunks -- Number of tokens in the hash ring: one every 10Gb of allocated storage -- Threshold for storing data directly in Cassandra objects table: 1kb bytes (maybe up to 4kb?) -- Ping timeout (time after which a node is registered as unresponsive/missing): 30 seconds -- Ping interval: 10 seconds -- ?? - -#### Links - -- CDC: -- Erasure coding: -- [Openstack Storage Concepts](https://docs.openstack.org/arch-design/design-storage/design-storage-concepts.html) -- [RADOS](https://ceph.com/wp-content/uploads/2016/08/weil-rados-pdsw07.pdf) diff --git a/doc/book/src/intro.md b/doc/book/src/intro.md index ec77036f..02920f83 100644 --- a/doc/book/src/intro.md +++ b/doc/book/src/intro.md @@ -60,6 +60,30 @@ In a certain way, Ceph and Minio are closer togethers than they are from Garage *More comparisons are available in our [Related Work](design/related_work.md) chapter.* +## Other Resources + +This website is not the only source of information about Garage! +We reference here other places on the Internet where you can learn more about Garage. + +### Rust API (docs.rs) + +If you encounter a specific bug in Garage or plan to patch it, you may jump directly to the source code documentation! + + - [garage\_api](https://docs.rs/garage_api/latest/garage_api/) - contains the S3 standard API endpoint + - [garage\_model](https://docs.rs/garage_model/latest/garage_model/) - contains Garage's model built on the table abstraction + - [garage\_rpc](https://docs.rs/garage_rpc/latest/garage_rpc/) - contains Garage's federation protocol + - [garage\_table](https://docs.rs/garage_table/latest/garage_table/) - contains core Garage's CRDT datatypes + - [garage\_util](https://docs.rs/garage_util/latest/garage_util/) - contains garage entrypoints (daemon, cli) + - [garage\_web](https://docs.rs/garage_web/latest/garage_web/) - contains the S3 website endpoint + +### Talks + +We love to talk and hear about Garage, that's why we keep a log here: + + - [(fr, 2020-12-02) Garage : jouer dans la cour des grands quand on est un hébergeur associatif](https://git.deuxfleurs.fr/Deuxfleurs/garage/src/branch/master/doc/20201202_talk/talk.pdf) + +*Did you write or talk about Garage? [Open a pull request](https://git.deuxfleurs.fr/Deuxfleurs/garage/) to add a link here!* + ## Community If you want to discuss with us, you can join our Matrix channel at [#garage:deuxfleurs.fr](https://matrix.to/#/#garage:deuxfleurs.fr). diff --git a/doc/book/src/reference_manual/index.md b/doc/book/src/reference_manual/index.md new file mode 100644 index 00000000..a2f380f6 --- /dev/null +++ b/doc/book/src/reference_manual/index.md @@ -0,0 +1 @@ +# Reference Manual diff --git a/doc/book/src/reference_manual/s3_compatibility.md b/doc/book/src/reference_manual/s3_compatibility.md new file mode 100644 index 00000000..c0fc2863 --- /dev/null +++ b/doc/book/src/reference_manual/s3_compatibility.md @@ -0,0 +1,84 @@ +## S3 Compatibility status + +### Global S3 features + +Implemented: + +- path-style URLs (`garage.tld/bucket/key`) +- putting and getting objects in buckets +- multipart uploads +- listing objects +- access control on a per-key-per-bucket basis + +Not implemented: + +- vhost-style URLs (`bucket.garage.tld/key`) +- object-level ACL +- encryption +- most `x-amz-` headers + + +### Endpoint implementation + +All APIs that are not mentionned are not implemented and will return a 400 bad request. + +#### AbortMultipartUpload + +Implemented. + +#### CompleteMultipartUpload + +Implemented badly. Garage will not check that all the parts stored correspond to the list given by the client in the request body. This means that the multipart upload might be completed with an invalid size. This is a bug and will be fixed. + +#### CopyObject + +Implemented. + +#### CreateBucket + +Garage does not accept creating buckets or giving access using API calls, it has to be done using the CLI tools. CreateBucket will return a 200 if the bucket exists and user has write access, and a 403 Forbidden in all other cases. + +#### CreateMultipartUpload + +Implemented. + +#### DeleteBucket + +Garage does not accept deleting buckets using API calls, it has to be done using the CLI tools. This request will return a 403 Forbidden. + +#### DeleteObject + +Implemented. + +#### DeleteObjects + +Implemented. + +#### GetObject + +Implemented. + +#### HeadBucket + +Implemented. + +#### HeadObject + +Implemented. + +#### ListObjects + +Implemented, but there isn't a very good specification of what `encoding-type=url` covers so there might be some encoding bugs. In our implementation the url-encoded fields are in the same in ListObjects as they are in ListObjectsV2. + +#### ListObjectsV2 + +Implemented. + +#### PutObject + +Implemented. + +#### UploadPart + +Implemented. + diff --git a/doc/book/src/related_work.md b/doc/book/src/related_work.md deleted file mode 100644 index bae4691c..00000000 --- a/doc/book/src/related_work.md +++ /dev/null @@ -1,56 +0,0 @@ -# Related Work - -## Context - -Data storage is critical: it can lead to data loss if done badly and/or on hardware failure. -Filesystems + RAID can help on a single machine but a machine failure can put the whole storage offline. -Moreover, it put a hard limit on scalability. Often this limit can be pushed back far away by buying expensive machines. -But here we consider non specialized off the shelf machines that can be as low powered and subject to failures as a raspberry pi. - -Distributed storage may help to solve both availability and scalability problems on these machines. -Many solutions were proposed, they can be categorized as block storage, file storage and object storage depending on the abstraction they provide. - -## Overview - -Block storage is the most low level one, it's like exposing your raw hard drive over the network. -It requires very low latencies and stable network, that are often dedicated. -However it provides disk devices that can be manipulated by the operating system with the less constraints: it can be partitioned with any filesystem, meaning that it supports even the most exotic features. -We can cite [iSCSI](https://en.wikipedia.org/wiki/ISCSI) or [Fibre Channel](https://en.wikipedia.org/wiki/Fibre_Channel). -Openstack Cinder proxy previous solution to provide an uniform API. - -File storage provides a higher abstraction, they are one filesystem among others, which means they don't necessarily have all the exotic features of every filesystem. -Often, they relax some POSIX constraints while many applications will still be compatible without any modification. -As an example, we are able to run MariaDB (very slowly) over GlusterFS... -We can also mention CephFS (read [RADOS](https://ceph.com/wp-content/uploads/2016/08/weil-rados-pdsw07.pdf) whitepaper), Lustre, LizardFS, MooseFS, etc. -OpenStack Manila proxy previous solutions to provide an uniform API. - -Finally object storages provide the highest level abstraction. -They are the testimony that the POSIX filesystem API is not adapted to distributed filesystems. -Especially, the strong concistency has been dropped in favor of eventual consistency which is way more convenient and powerful in presence of high latencies and unreliability. -We often read about S3 that pioneered the concept that it's a filesystem for the WAN. -Applications must be adapted to work for the desired object storage service. -Today, the S3 HTTP REST API acts as a standard in the industry. -However, Amazon S3 source code is not open but alternatives were proposed. -We identified Minio, Pithos, Swift and Ceph. -Minio/Ceph enforces a total order, so properties similar to a (relaxed) filesystem. -Swift and Pithos are probably the most similar to AWS S3 with their consistent hashing ring. -However Pithos is not maintained anymore. More precisely the company that published Pithos version 1 has developped a second version 2 but has not open sourced it. -Some tests conducted by the [ACIDES project](https://acides.org/) have shown that Openstack Swift consumes way more resources (CPU+RAM) that we can afford. Furthermore, people developing Swift have not designed their software for geo-distribution. - -There were many attempts in research too. I am only thinking to [LBFS](https://pdos.csail.mit.edu/papers/lbfs:sosp01/lbfs.pdf) that was used as a basis for Seafile. But none of them have been effectively implemented yet. - -## Existing software - -**[Pithos](https://github.com/exoscale/pithos) :** -Pithos has been abandonned and should probably not used yet, in the following we explain why we did not pick their design. -Pithos was relying as a S3 proxy in front of Cassandra (and was working with Scylla DB too). -From its designers' mouth, storing data in Cassandra has shown its limitations justifying the project abandonment. -They built a closed-source version 2 that does not store blobs in the database (only metadata) but did not communicate further on it. -We considered there v2's design but concluded that it does not fit both our *Self-contained & lightweight* and *Simple* properties. It makes the development, the deployment and the operations more complicated while reducing the flexibility. - -**[IPFS](https://ipfs.io/) :** -*Not written yet* - -## Specific research papers - -*Not yet written* diff --git a/doc/book/src/website.md b/doc/book/src/website.md deleted file mode 100644 index 2ea82a9a..00000000 --- a/doc/book/src/website.md +++ /dev/null @@ -1 +0,0 @@ -# Host a website diff --git a/doc/book/src/working_documents/index.md b/doc/book/src/working_documents/index.md new file mode 100644 index 00000000..6eb7eccb --- /dev/null +++ b/doc/book/src/working_documents/index.md @@ -0,0 +1,7 @@ +# Working Documents + +Working documents are documents that reflect the fact that Garage is a software that evolves quickly. +They are a way to communicate our ideas, our changes, and so on. + +Ideally, while the feature/patch has been merged, the working document should serve as a source to +update the rest of the documentation. diff --git a/doc/book/src/working_documents/load_balancing.md b/doc/book/src/working_documents/load_balancing.md new file mode 100644 index 00000000..f0f1a4d4 --- /dev/null +++ b/doc/book/src/working_documents/load_balancing.md @@ -0,0 +1,197 @@ +## Load Balancing Data + +I have conducted a quick study of different methods to load-balance data over different Garage nodes using consistent hashing. + +### Requirements + +- *good balancing*: two nodes that have the same announced capacity should receive close to the same number of items + +- *multi-datacenter*: the replicas of a partition should be distributed over as many datacenters as possible + +- *minimal disruption*: when adding or removing a node, as few partitions as possible should have to move around + +- *order-agnostic*: the same set of nodes (each associated with a datacenter name + and a capacity) should always return the same distribution of partition + replicas, independently of the order in which nodes were added/removed (this + is to keep the implementation simple) + +### Methods + +#### Naive multi-DC ring walking strategy + +This strategy can be used with any ring-like algorithm to make it aware of the *multi-datacenter* requirement: + +In this method, the ring is a list of positions, each associated with a single node in the cluster. +Partitions contain all the keys between two consecutive items of the ring. +To find the nodes that store replicas of a given partition: + +- select the node for the position of the partition's lower bound +- go clockwise on the ring, skipping nodes that: + - we halve already selected + - are in a datacenter of a node we have selected, except if we already have nodes from all possible datacenters + +In this way the selected nodes will always be distributed over +`min(n_datacenters, n_replicas)` different datacenters, which is the best we +can do. + +This method was implemented in the first version of Garage, with the basic +ring construction from Dynamo DB that consists in associating `n_token` random positions to +each node (I know it's not optimal, the Dynamo paper already studies this). + +#### Better rings + +The ring construction that selects `n_token` random positions for each nodes gives a ring of positions that +is not well-balanced: the space between the tokens varies a lot, and some partitions are thus bigger than others. +This problem was demonstrated in the original Dynamo DB paper. + +To solve this, we want to apply a better second method for partitionning our dataset: + +1. fix an initially large number of partitions (say 1024) with evenly-spaced delimiters, + +2. attribute each partition randomly to a node, with a probability + proportionnal to its capacity (which `n_tokens` represented in the first + method) + +For now we continue using the multi-DC ring walking described above. + +I have studied two ways to do the attribution of partitions to nodes, in a way that is deterministic: + +- Min-hash: for each partition, select node that minimizes `hash(node, partition_number)` +- MagLev: see [here](https://blog.acolyer.org/2016/03/21/maglev-a-fast-and-reliable-software-network-load-balancer/) + +MagLev provided significantly better balancing, as it guarantees that the exact +same number of partitions is attributed to all nodes that have the same +capacity (and that this number is proportionnal to the node's capacity, except +for large values), however in both cases: + +- the distribution is still bad, because we use the naive multi-DC ring walking + that behaves strangely due to interactions between consecutive positions on + the ring + +- the disruption in case of adding/removing a node is not as low as it can be, + as we show with the following method. + +A quick description of MagLev (backend = node, lookup table = ring): + +> The basic idea of Maglev hashing is to assign a preference list of all the +> lookup table positions to each backend. Then all the backends take turns +> filling their most-preferred table positions that are still empty, until the +> lookup table is completely filled in. Hence, Maglev hashing gives an almost +> equal share of the lookup table to each of the backends. Heterogeneous +> backend weights can be achieved by altering the relative frequency of the +> backends’ turns… + +Here are some stats (run `scripts/simulate_ring.py` to reproduce): + +``` +##### Custom-ring (min-hash) ##### + +#partitions per node (capacity in parenthesis): +- datura (8) : 227 +- digitale (8) : 351 +- drosera (8) : 259 +- geant (16) : 476 +- gipsie (16) : 410 +- io (16) : 495 +- isou (8) : 231 +- mini (4) : 149 +- mixi (4) : 188 +- modi (4) : 127 +- moxi (4) : 159 + +Variance of load distribution for load normalized to intra-class mean +(a class being the set of nodes with the same announced capacity): 2.18% <-- REALLY BAD + +Disruption when removing nodes (partitions moved on 0/1/2/3 nodes): +removing atuin digitale : 63.09% 30.18% 6.64% 0.10% +removing atuin drosera : 72.36% 23.44% 4.10% 0.10% +removing atuin datura : 73.24% 21.48% 5.18% 0.10% +removing jupiter io : 48.34% 38.48% 12.30% 0.88% +removing jupiter isou : 74.12% 19.73% 6.05% 0.10% +removing grog mini : 84.47% 12.40% 2.93% 0.20% +removing grog mixi : 80.76% 16.60% 2.64% 0.00% +removing grog moxi : 83.59% 14.06% 2.34% 0.00% +removing grog modi : 87.01% 11.43% 1.46% 0.10% +removing grisou geant : 48.24% 37.40% 13.67% 0.68% +removing grisou gipsie : 53.03% 33.59% 13.09% 0.29% +on average: 69.84% 23.53% 6.40% 0.23% <-- COULD BE BETTER + +-------- + +##### MagLev ##### + +#partitions per node: +- datura (8) : 273 +- digitale (8) : 256 +- drosera (8) : 267 +- geant (16) : 452 +- gipsie (16) : 427 +- io (16) : 483 +- isou (8) : 272 +- mini (4) : 184 +- mixi (4) : 160 +- modi (4) : 144 +- moxi (4) : 154 + +Variance of load distribution: 0.37% <-- Already much better, but not optimal + +Disruption when removing nodes (partitions moved on 0/1/2/3 nodes): +removing atuin digitale : 62.60% 29.20% 7.91% 0.29% +removing atuin drosera : 65.92% 26.56% 7.23% 0.29% +removing atuin datura : 63.96% 27.83% 7.71% 0.49% +removing jupiter io : 44.63% 40.33% 14.06% 0.98% +removing jupiter isou : 63.38% 27.25% 8.98% 0.39% +removing grog mini : 72.46% 21.00% 6.35% 0.20% +removing grog mixi : 72.95% 22.46% 4.39% 0.20% +removing grog moxi : 74.22% 20.61% 4.98% 0.20% +removing grog modi : 75.98% 18.36% 5.27% 0.39% +removing grisou geant : 46.97% 36.62% 15.04% 1.37% +removing grisou gipsie : 49.22% 36.52% 12.79% 1.46% +on average: 62.94% 27.89% 8.61% 0.57% <-- WORSE THAN PREVIOUSLY +``` + +#### The magical solution: multi-DC aware MagLev + +Suppose we want to select three replicas for each partition (this is what we do in our simulation and in most Garage deployments). +We apply MagLev three times consecutively, one for each replica selection. +The first time is pretty much the same as normal MagLev, but for the following times, when a node runs through its preference +list to select a partition to replicate, we skip partitions for which adding this node would not bring datacenter-diversity. +More precisely, we skip a partition in the preference list if: + +- the node already replicates the partition (from one of the previous rounds of MagLev) +- the node is in a datacenter where a node already replicates the partition and there are other datacenters available + +Refer to `method4` in the simulation script for a formal definition. + +``` +##### Multi-DC aware MagLev ##### + +#partitions per node: +- datura (8) : 268 <-- NODES WITH THE SAME CAPACITY +- digitale (8) : 267 HAVE THE SAME NUM OF PARTITIONS +- drosera (8) : 267 (+- 1) +- geant (16) : 470 +- gipsie (16) : 472 +- io (16) : 516 +- isou (8) : 268 +- mini (4) : 136 +- mixi (4) : 136 +- modi (4) : 136 +- moxi (4) : 136 + +Variance of load distribution: 0.06% <-- CAN'T DO BETTER THAN THIS + +Disruption when removing nodes (partitions moved on 0/1/2/3 nodes): +removing atuin digitale : 65.72% 33.01% 1.27% 0.00% +removing atuin drosera : 64.65% 33.89% 1.37% 0.10% +removing atuin datura : 66.11% 32.62% 1.27% 0.00% +removing jupiter io : 42.97% 53.42% 3.61% 0.00% +removing jupiter isou : 66.11% 32.32% 1.56% 0.00% +removing grog mini : 80.47% 18.85% 0.68% 0.00% +removing grog mixi : 80.27% 18.85% 0.88% 0.00% +removing grog moxi : 80.18% 19.04% 0.78% 0.00% +removing grog modi : 79.69% 19.92% 0.39% 0.00% +removing grisou geant : 44.63% 52.15% 3.22% 0.00% +removing grisou gipsie : 43.55% 52.54% 3.91% 0.00% +on average: 64.94% 33.33% 1.72% 0.01% <-- VERY GOOD (VERY LOW VALUES FOR 2 AND 3 NODES) +``` -- cgit v1.2.3 From 60f994a118afdde241d9a2296e2e8e090e08674d Mon Sep 17 00:00:00 2001 From: Quentin Dufour Date: Wed, 17 Mar 2021 17:24:11 +0100 Subject: Working on the getting started guide --- doc/book/src/SUMMARY.md | 3 +- doc/book/src/cookbook/index.md | 4 + doc/book/src/design/index.md | 4 + doc/book/src/development/index.md | 3 + doc/book/src/getting_started/binary.md | 22 ++++ doc/book/src/getting_started/daemon.md | 148 +++++++++++++++++++++++ doc/book/src/getting_started/install.md | 25 ---- doc/book/src/reference_manual/index.md | 4 + doc/book/src/working_documents/index.md | 7 +- doc/book/src/working_documents/load_balancing.md | 2 +- 10 files changed, 192 insertions(+), 30 deletions(-) create mode 100644 doc/book/src/getting_started/binary.md create mode 100644 doc/book/src/getting_started/daemon.md delete mode 100644 doc/book/src/getting_started/install.md (limited to 'doc') diff --git a/doc/book/src/SUMMARY.md b/doc/book/src/SUMMARY.md index 7697948b..48abf741 100644 --- a/doc/book/src/SUMMARY.md +++ b/doc/book/src/SUMMARY.md @@ -3,7 +3,8 @@ [The Garage Data Store](./intro.md) - [Getting Started](./getting_started/index.md) - - [Installation](./getting_started/install.md) + - [Get a binary](./getting_started/binary.md) + - [Configure the daemon](./getting_started/daemon.md) - [Configure a cluster](./getting_started/cluster.md) - [Create buckets and keys](./getting_started/bucket.md) - [Handle files](./getting_started/files.md) diff --git a/doc/book/src/cookbook/index.md b/doc/book/src/cookbook/index.md index 741ecbe7..d7a51065 100644 --- a/doc/book/src/cookbook/index.md +++ b/doc/book/src/cookbook/index.md @@ -1 +1,5 @@ # Cookbook + +A cookbook, when you cook, is a collection of recipes. +Similarly, Garage's cookbook contains a collection of recipes that are known to works well! +This chapter could also be referred as "Tutorials" or "Best practises". diff --git a/doc/book/src/design/index.md b/doc/book/src/design/index.md index 3d14cb7c..d09a6008 100644 --- a/doc/book/src/design/index.md +++ b/doc/book/src/design/index.md @@ -1 +1,5 @@ # Design + +The design section helps you to see Garage from a "big picture" perspective. +It will allow you to understand if Garage is a good fit for you, +how to better use it, how to contribute to it, what can Garage could and could not do, etc. diff --git a/doc/book/src/development/index.md b/doc/book/src/development/index.md index 459110d3..d6b5e38b 100644 --- a/doc/book/src/development/index.md +++ b/doc/book/src/development/index.md @@ -1 +1,4 @@ # Development + +Now that you are a Garage expert, you want to enhance it, you are in the right place! +We discuss here how to hack on Garage, how we manage its development, etc. diff --git a/doc/book/src/getting_started/binary.md b/doc/book/src/getting_started/binary.md new file mode 100644 index 00000000..ba8838b6 --- /dev/null +++ b/doc/book/src/getting_started/binary.md @@ -0,0 +1,22 @@ +# Get a binary + +Currently, only two installations procedures are supported for Garage: from Docker (x86\_64 for Linux) and from source. +In the future, we plan to add a third one, by publishing a compiled binary (x86\_64 for Linux). +We did not test other architecture/operating system but, as long as your architecture/operating system is supported by Rust, you should be able to run Garage (feel free to report your tests!). + +## From Docker + +Our docker image is currently named `lxpz/garage_amd64` and is stored on the [Docker Hub](https://hub.docker.com/r/lxpz/garage_amd64/tags?page=1&ordering=last_updated). +We encourage you to use a fixed tag (eg. `v0.1.1d`) and not the `latest` tag. +For this example, we will use the latest published version at the time of the writing which is `v0.1.1d` but it's up to you +to check [the most recent versions on the Docker Hub](https://hub.docker.com/r/lxpz/garage_amd64/tags?page=1&ordering=last_updated). + +For example: + +``` +sudo docker pull lxpz/garage_amd64:v0.1.1d +``` + +## From source + + diff --git a/doc/book/src/getting_started/daemon.md b/doc/book/src/getting_started/daemon.md new file mode 100644 index 00000000..d704ad0d --- /dev/null +++ b/doc/book/src/getting_started/daemon.md @@ -0,0 +1,148 @@ +# Configure the daemon + +Garage is a software that can be run only in a cluster and requires at least 3 instances. +In our getting started guide, we document two deployment types: + - [Single machine deployment](#single-machine-deployment) though `docker-compose` + - [Multiple machine deployment](#multiple-machine-deployment) through `docker` or `systemd` + +In any case, you first need to generate TLS certificates, as traffic is encrypted between Garage's nodes. + +## Generating a TLS Certificate + +Next, to generate your TLS certificates, run on your machine: + +``` +wget https://git.deuxfleurs.fr/Deuxfleurs/garage/raw/branch/master/genkeys.sh +chmod +x genkeys.sh +./genkeys.sh +``` + +It will creates a folder named `pki` containing the keys that you will used for the cluster. + +### Single machine deployment + +Single machine deployment is only described through docker compose. + +```yml +version: '3.4' + +networks: { virtnet: { ipam: { config: [ subnet: 172.20.0.0/24 ]}}} + +services: + g1: + image: lxpz/garage_amd64:v0.1.1d + networks: { virtnet: { ipv4_address: 172.20.0.101 }} + volumes: + - "./pki:/pki" + - "./config.toml:/garage/config.toml" + + g2: + image: lxpz/garage_amd64:v0.1.1d + networks: { virtnet: { ipv4_address: 172.20.0.102 }} + volumes: + - "./pki:/pki" + - "./config.toml:/garage/config.toml" + + g3: + image: lxpz/garage_amd64:v0.1.1d + networks: { virtnet: { ipv4_address: 172.20.0.103 }} + volumes: + - "./pki:/pki" + - "./config.toml:/garage/config.toml" +``` + +*We define a static network here which is not considered as a best practise on Docker. +The rational is that Garage only supports IP address and not domain names in its configuration, so we need to know the IP address in advance.* + +and then create the `config.toml` file as follow: + +```toml +metadata_dir = "/garage/meta" +data_dir = "/garage/data" +rpc_bind_addr = "[::]:3901" +bootstrap_peers = [ + "172.20.0.101:3901", + "172.20.0.102:3901", + "172.20.0.103:3901", +] + +[rpc_tls] +ca_cert = "/pki/garage-ca.crt" +node_cert = "/pki/garage.crt" +node_key = "/pki/garage.key" + +[s3_api] +s3_region = "garage" +api_bind_addr = "[::]:3900" + +[s3_web] +bind_addr = "[::]:3902" +root_domain = ".web.garage" +index = "index.html" +``` + +*Please note that we have not mounted `/garage/meta` or `/garage/data` on the host: data will be lost when the container will be destroyed.* + +And that's all, you are ready to launch your cluster! + +``` +sudo docker-compose up +``` + +While your daemons are up, your cluster is still not configured yet. +However, you can check that your services are still listening as expected by querying them from your host: + +```bash +curl http://172.20.0.{101,102,103}:3902 +``` + +which should give you: + +``` +Not found +Not found +Not found +``` + +### Multiple machine deployment + +Before deploying garage on your infrastructure, you must inventory your machines. +For our example, we will suppose the following infrastructure: + +| Location | Name | IP Address | Disk Space | +|----------|---------|------------|------------| +| Paris | Mercury | fc00:1::1 | 1 To | +| Paris | Venus | fc00:1::2 | 2 To | +| London | Earth | fc00:1::2 | 2 To | +| Brussels | Mars | fc00:B::1 | 1.5 To | + +First, you need to setup your machines/VMs by copying on them the `pki` folder in `/etc/garage/pki`. +All your machines will also share the same configuration file, stored in `/etc/garage/config.toml`: + +```toml +metadata_dir = "/var/lib/garage/meta" +data_dir = "/var/lib/garage/data" +rpc_bind_addr = "[::]:3901" +bootstrap_peers = [ + "[fc00:1::1]:3901", + "[fc00:1::2]:3901", + "[fc00:B::1]:3901", + "[fc00:F::1]:3901", +] + +[rpc_tls] +ca_cert = "/pki/garage-ca.crt" +node_cert = "/pki/garage.crt" +node_key = "/pki/garage.key" + +[s3_api] +s3_region = "garage" +api_bind_addr = "[::]:3900" + +[s3_web] +bind_addr = "[::]:3902" +root_domain = ".web.garage" +index = "index.html" +``` + + diff --git a/doc/book/src/getting_started/install.md b/doc/book/src/getting_started/install.md deleted file mode 100644 index 98af1283..00000000 --- a/doc/book/src/getting_started/install.md +++ /dev/null @@ -1,25 +0,0 @@ -# Installation - -Currently, only two installations procedures are supported for Garage: from Docker (x86\_64 for Linux) and from source. -In the future, we plan to add a third one, by publishing a compiled binary (x86\_64 for Linux). -We did not test other architecture/operating system but, as long as your architecture/operating system is supported by Rust, you should be able to run Garage (feel free to report your tests!). - -## From Docker - -Garage is a software that can be run only in a cluster and requires at least 3 instances. -If you plan to run the 3 instances on your machine for test purposes, we recommend a **docker-compose** deployment. -If you have 3 independent machines (or 3 VM on independent machines) that can communite together, a **simple docker** deployment is enough. -In any case, you first need to pick a Docker image version. - -Our docker image is currently named `lxpz/garage_amd64` and is stored on the [Docker Hub](https://hub.docker.com/r/lxpz/garage_amd64/tags?page=1&ordering=last_updated). -We encourage you to use a fixed tag (eg. `v0.1.1d`) and not the `latest` tag. -For this example, we will use the latest published version at the time of the writing which is `v0.1.1d` but it's up to you -to check [the most recent versions on the Docker Hub](https://hub.docker.com/r/lxpz/garage_amd64/tags?page=1&ordering=last_updated). - -### Single machine deployment with docker-compose - - - -### Multiple machine deployment with docker - -## From source diff --git a/doc/book/src/reference_manual/index.md b/doc/book/src/reference_manual/index.md index a2f380f6..0d4bd6f3 100644 --- a/doc/book/src/reference_manual/index.md +++ b/doc/book/src/reference_manual/index.md @@ -1 +1,5 @@ # Reference Manual + +A reference manual contains some extensive descriptions about the features and the behaviour of the software. +Reading of this chapter is recommended once you have a good knowledge/understanding of Garage. +It will be useful if you want to tune it or to use it in some exotic conditions. diff --git a/doc/book/src/working_documents/index.md b/doc/book/src/working_documents/index.md index 6eb7eccb..a9e7f899 100644 --- a/doc/book/src/working_documents/index.md +++ b/doc/book/src/working_documents/index.md @@ -1,7 +1,8 @@ # Working Documents Working documents are documents that reflect the fact that Garage is a software that evolves quickly. -They are a way to communicate our ideas, our changes, and so on. +They are a way to communicate our ideas, our changes, and so on before or while we are implementing them in Garage. +If you like to live on the edge, it could also serve as a documentation of our next features to be released. -Ideally, while the feature/patch has been merged, the working document should serve as a source to -update the rest of the documentation. +Ideally, once the feature/patch has been merged, the working document should serve as a source to +update the rest of the documentation and then be removed. diff --git a/doc/book/src/working_documents/load_balancing.md b/doc/book/src/working_documents/load_balancing.md index f0f1a4d4..583b6086 100644 --- a/doc/book/src/working_documents/load_balancing.md +++ b/doc/book/src/working_documents/load_balancing.md @@ -1,4 +1,4 @@ -## Load Balancing Data +## Load Balancing Data (planned for version 0.2) I have conducted a quick study of different methods to load-balance data over different Garage nodes using consistent hashing. -- cgit v1.2.3 From 468e45ed7f47ce41f196274519c6b59f3bf46f0e Mon Sep 17 00:00:00 2001 From: Quentin Dufour Date: Wed, 17 Mar 2021 18:01:06 +0100 Subject: WIP doc --- doc/book/src/getting_started/binary.md | 22 ++++++++++++ doc/book/src/getting_started/daemon.md | 64 +++++++++++++++++++++++++++------- 2 files changed, 74 insertions(+), 12 deletions(-) (limited to 'doc') diff --git a/doc/book/src/getting_started/binary.md b/doc/book/src/getting_started/binary.md index ba8838b6..9a18babc 100644 --- a/doc/book/src/getting_started/binary.md +++ b/doc/book/src/getting_started/binary.md @@ -19,4 +19,26 @@ sudo docker pull lxpz/garage_amd64:v0.1.1d ## From source +Garage is a standard Rust project. +First, you need `rust` and `cargo`. +On Debian: + +```bash +sudo apt-get update +sudo apt-get install -y rustc cargo +``` + +Then, you can ask cargo to install the binary for you: + +```bash +cargo install garage +``` + +That's all, `garage` should be in `$HOME/.cargo/bin`. +You can add this folder to your `$PATH` or copy the binary somewhere else on your system. +For the following, we will assume you copied it in `/usr/local/bin/garage`: + +```bash +sudo cp $HOME/.cargo/bin/garage /usr/local/bin/garage +``` diff --git a/doc/book/src/getting_started/daemon.md b/doc/book/src/getting_started/daemon.md index d704ad0d..0f0199fe 100644 --- a/doc/book/src/getting_started/daemon.md +++ b/doc/book/src/getting_started/daemon.md @@ -2,14 +2,14 @@ Garage is a software that can be run only in a cluster and requires at least 3 instances. In our getting started guide, we document two deployment types: - - [Single machine deployment](#single-machine-deployment) though `docker-compose` - - [Multiple machine deployment](#multiple-machine-deployment) through `docker` or `systemd` + - [Test deployment](#test-deployment) though `docker-compose` + - [Real-world deployment](#real-world-deployment) through `docker` or `systemd` In any case, you first need to generate TLS certificates, as traffic is encrypted between Garage's nodes. ## Generating a TLS Certificate -Next, to generate your TLS certificates, run on your machine: +To generate your TLS certificates, run on your machine: ``` wget https://git.deuxfleurs.fr/Deuxfleurs/garage/raw/branch/master/genkeys.sh @@ -19,9 +19,18 @@ chmod +x genkeys.sh It will creates a folder named `pki` containing the keys that you will used for the cluster. -### Single machine deployment +## Test deployment -Single machine deployment is only described through docker compose. +Single machine deployment is only described through `docker-compose`. + +Before starting, we recommend you create a folder for our deployment: + +```bash +mkdir garage-single +cd garage-single +``` + +We start by creating a file named `docker-compose.yml` describing our network and our containers: ```yml version: '3.4' @@ -54,7 +63,7 @@ services: *We define a static network here which is not considered as a best practise on Docker. The rational is that Garage only supports IP address and not domain names in its configuration, so we need to know the IP address in advance.* -and then create the `config.toml` file as follow: +and then create the `config.toml` file next to it as follow: ```toml metadata_dir = "/garage/meta" @@ -104,7 +113,9 @@ Not found Not found ``` -### Multiple machine deployment +That's all, you are ready to [configure your cluster!](./cluster.md). + +## Real-world deployment Before deploying garage on your infrastructure, you must inventory your machines. For our example, we will suppose the following infrastructure: @@ -116,8 +127,14 @@ For our example, we will suppose the following infrastructure: | London | Earth | fc00:1::2 | 2 To | | Brussels | Mars | fc00:B::1 | 1.5 To | -First, you need to setup your machines/VMs by copying on them the `pki` folder in `/etc/garage/pki`. -All your machines will also share the same configuration file, stored in `/etc/garage/config.toml`: +On each machine, we will have a similar setup, especially you must consider the following folders/files: + - `/etc/garage/pki`: Garage certificates, must be generated on your computer and copied on the servers + - `/etc/garage/config.toml`: Garage daemon's configuration (defined below) + - `/etc/systemd/system/garage.service`: Service file to start garage at boot automatically (defined below, not required if you use docker) + - `/var/lib/garage/meta`: Contains Garage's metadata, put this folder on a SSD if possible + - `/var/lib/garage/data`: Contains Garage's data, this folder will grows and must be on a large storage, possibly big HDDs. + +A valid `/etc/garage/config.toml` for our cluster would be: ```toml metadata_dir = "/var/lib/garage/meta" @@ -131,9 +148,9 @@ bootstrap_peers = [ ] [rpc_tls] -ca_cert = "/pki/garage-ca.crt" -node_cert = "/pki/garage.crt" -node_key = "/pki/garage.key" +ca_cert = "/etc/garage/pki/garage-ca.crt" +node_cert = "/etc/garage/pki/garage.crt" +node_key = "/etc/garage/pki/garage.key" [s3_api] s3_region = "garage" @@ -145,4 +162,27 @@ root_domain = ".web.garage" index = "index.html" ``` +Please make sure to change `bootstrap_peers` to **your** IP addresses! + +### For docker users + +On each machine, you can run the daemon with: + +```bash +docker run \ + -d \ + --restart always \ + --network host \ + -v /etc/garage/pki:/etc/garage/pki \ + -v /etc/garage/config.toml:/garage/config.toml \ + -v /var/lib/garage/meta:/var/lib/garage/meta \ + -v /var/lib/garage/data:/var/lib/garage/data \ + lxpz/garage_amd64:v0.1.1d +``` + +It should be restart automatically at each reboot. +Please note that we use host networking as otherwise Docker containers can no communicate with IPv6. + +To upgrade, simply stop and remove this container and start again the command with a new version of garage. +### For systemd/raw binary users -- cgit v1.2.3 From 44d0815ff98b72cd4f23b41ab6c87a867a392815 Mon Sep 17 00:00:00 2001 From: Quentin Dufour Date: Wed, 17 Mar 2021 20:04:27 +0100 Subject: Wrote daemon --- doc/book/src/getting_started/daemon.md | 33 +++++++++++++++++++++++++++++++++ 1 file changed, 33 insertions(+) (limited to 'doc') diff --git a/doc/book/src/getting_started/daemon.md b/doc/book/src/getting_started/daemon.md index 0f0199fe..7c303c6a 100644 --- a/doc/book/src/getting_started/daemon.md +++ b/doc/book/src/getting_started/daemon.md @@ -186,3 +186,36 @@ Please note that we use host networking as otherwise Docker containers can no co To upgrade, simply stop and remove this container and start again the command with a new version of garage. ### For systemd/raw binary users + +Create a file named `/etc/systemd/system/garage.service`: + +```toml +[Unit] +Description=Garage Data Store +After=network-online.target +Wants=network-online.target + +[Service] +Environment='RUST_LOG=garage=info' 'RUST_BACKTRACE=1' +ExecStart=/usr/local/bin/garage server -c /etc/garage/config.toml + +[Install] +WantedBy=multi-user.target +``` + +To start the service then automatically enable it at boot: + +```bash +sudo systemctl start garage +sudo systemctl enable garage +``` + +To see if the service is running and to browse its logs: + +```bash +sudo systemctl status garage +sudo journalctl -u garage +``` + +If you want to modify the service file, do not forget to run `systemctl daemon-reload` +to inform `systemd` of your modifications. -- cgit v1.2.3 From b82a61fba2f115bd0263ff863562fbe63254058f Mon Sep 17 00:00:00 2001 From: Quentin Dufour Date: Wed, 17 Mar 2021 20:58:30 +0100 Subject: Simplify our README --- doc/Quickstart.md | 140 -------------------------------- doc/book/src/SUMMARY.md | 1 + doc/book/src/getting_started/cluster.md | 13 +++ 3 files changed, 14 insertions(+), 140 deletions(-) delete mode 100644 doc/Quickstart.md (limited to 'doc') diff --git a/doc/Quickstart.md b/doc/Quickstart.md deleted file mode 100644 index 6d0993a4..00000000 --- a/doc/Quickstart.md +++ /dev/null @@ -1,140 +0,0 @@ -# Quickstart on an existing deployment - -First, chances are that your garage deployment is secured by TLS. -All your commands must be prefixed with their certificates. -I will define an alias once and for all to ease future commands. -Please adapt the path of the binary and certificates to your installation! - -``` -alias grg="/garage/garage --ca-cert /secrets/garage-ca.crt --client-cert /secrets/garage.crt --client-key /secrets/garage.key" -``` - -Now we can check that everything is going well by checking our cluster status: - -``` -grg status -``` - -Don't forget that `help` command and `--help` subcommands can help you anywhere, the CLI tool is self-documented! Two examples: - -``` -grg help -grg bucket allow --help -``` - -Fine, now let's create a bucket (we imagine that you want to deploy nextcloud): - -``` -grg bucket create nextcloud-bucket -``` - -Check that everything went well: - -``` -grg bucket list -grg bucket info nextcloud-bucket -``` - -Now we will generate an API key to access this bucket. -Note that API keys are independent of buckets: one key can access multiple buckets, multiple keys can access one bucket. - -Now, let's start by creating a key only for our PHP application: - -``` -grg key new --name nextcloud-app-key -``` - -You will have the following output (this one is fake, `key_id` and `secret_key` were generated with the openssl CLI tool): - -``` -Key { key_id: "GK3515373e4c851ebaad366558", secret_key: "7d37d093435a41f2aab8f13c19ba067d9776c90215f56614adad6ece597dbb34", name: "nextcloud-app-key", name_timestamp: 1603280506694, deleted: false, authorized_buckets: [] } -``` - -Check that everything works as intended (be careful, info works only with your key identifier and not with its friendly name!): - -``` -grg key list -grg key info GK3515373e4c851ebaad366558 -``` - -Now that we have a bucket and a key, we need to give permissions to the key on the bucket! - -``` -grg bucket allow --read --write nextcloud-bucket --key GK3515373e4c851ebaad366558 -``` - -You can check at any times allowed keys on your bucket with: - -``` -grg bucket info nextcloud-bucket -``` - -Now, let's move to the S3 API! -We will use the `s3cmd` CLI tool. -You can install it via your favorite package manager. -Otherwise, check [their website](https://s3tools.org/s3cmd) - -We will configure `s3cmd` with its interactive configuration tool, be careful not all endpoints are implemented! -Especially, the test run at the end does not work (yet). - -``` -$ s3cmd --configure - -Enter new values or accept defaults in brackets with Enter. -Refer to user manual for detailed description of all options. - -Access key and Secret key are your identifiers for Amazon S3. Leave them empty for using the env variables. -Access Key: GK3515373e4c851ebaad366558 -Secret Key: 7d37d093435a41f2aab8f13c19ba067d9776c90215f56614adad6ece597dbb34 -Default Region [US]: garage - -Use "s3.amazonaws.com" for S3 Endpoint and not modify it to the target Amazon S3. -S3 Endpoint [s3.amazonaws.com]: garage.deuxfleurs.fr - -Use "%(bucket)s.s3.amazonaws.com" to the target Amazon S3. "%(bucket)s" and "%(location)s" vars can be used -if the target S3 system supports dns based buckets. -DNS-style bucket+hostname:port template for accessing a bucket [%(bucket)s.s3.amazonaws.com]: garage.deuxfleurs.fr - -Encryption password is used to protect your files from reading -by unauthorized persons while in transfer to S3 -Encryption password: -Path to GPG program [/usr/bin/gpg]: - -When using secure HTTPS protocol all communication with Amazon S3 -servers is protected from 3rd party eavesdropping. This method is -slower than plain HTTP, and can only be proxied with Python 2.7 or newer -Use HTTPS protocol [Yes]: - -On some networks all internet access must go through a HTTP proxy. -Try setting it here if you can't connect to S3 directly -HTTP Proxy server name: - -New settings: - Access Key: GK3515373e4c851ebaad366558 - Secret Key: 7d37d093435a41f2aab8f13c19ba067d9776c90215f56614adad6ece597dbb34 - Default Region: garage - S3 Endpoint: garage.deuxfleurs.fr - DNS-style bucket+hostname:port template for accessing a bucket: garage.deuxfleurs.fr - Encryption password: - Path to GPG program: /usr/bin/gpg - Use HTTPS protocol: True - HTTP Proxy server name: - HTTP Proxy server port: 0 - -Test access with supplied credentials? [Y/n] n - -Save settings? [y/N] y -Configuration saved to '/home/quentin/.s3cfg' -``` - -Now, if everything works, the following commands should work: - -``` -echo hello world > hello.txt -s3cmd put hello.txt s3://nextcloud-bucket -s3cmd ls s3://nextcloud-bucket -s3cmd rm s3://nextcloud-bucket/hello.txt -``` - -That's all for now! - diff --git a/doc/book/src/SUMMARY.md b/doc/book/src/SUMMARY.md index 48abf741..7c435f23 100644 --- a/doc/book/src/SUMMARY.md +++ b/doc/book/src/SUMMARY.md @@ -5,6 +5,7 @@ - [Getting Started](./getting_started/index.md) - [Get a binary](./getting_started/binary.md) - [Configure the daemon](./getting_started/daemon.md) + - [Control the daemon](./getting_started/control.md) - [Configure a cluster](./getting_started/cluster.md) - [Create buckets and keys](./getting_started/bucket.md) - [Handle files](./getting_started/files.md) diff --git a/doc/book/src/getting_started/cluster.md b/doc/book/src/getting_started/cluster.md index 868379b9..a17bf14e 100644 --- a/doc/book/src/getting_started/cluster.md +++ b/doc/book/src/getting_started/cluster.md @@ -1 +1,14 @@ # Configure a cluster + +## Test cluster + +## Real-world cluster + +For our example, we will suppose we have the following infrastructure: + +| Location | Name | IP Address | Disk Space | +|----------|---------|------------|------------| +| Paris | Mercury | fc00:1::1 | 1 To | +| Paris | Venus | fc00:1::2 | 2 To | +| London | Earth | fc00:1::2 | 2 To | +| Brussels | Mars | fc00:B::1 | 1.5 To | -- cgit v1.2.3 From 1a5af9d1fc4ab4727f9747b0947d1a0c7e0002e7 Mon Sep 17 00:00:00 2001 From: Quentin Dufour Date: Wed, 17 Mar 2021 22:06:37 +0100 Subject: WIP getting started --- doc/book/src/getting_started/bucket.md | 57 +++++++++++++----------- doc/book/src/getting_started/cluster.md | 72 +++++++++++++++++++++++++++--- doc/book/src/getting_started/control.md | 77 +++++++++++++++++++++++++++++++++ doc/book/src/getting_started/daemon.md | 1 + 4 files changed, 175 insertions(+), 32 deletions(-) create mode 100644 doc/book/src/getting_started/control.md (limited to 'doc') diff --git a/doc/book/src/getting_started/bucket.md b/doc/book/src/getting_started/bucket.md index 8b05ee23..b22ce788 100644 --- a/doc/book/src/getting_started/bucket.md +++ b/doc/book/src/getting_started/bucket.md @@ -1,71 +1,78 @@ # Create buckets and keys -First, chances are that your garage deployment is secured by TLS. -All your commands must be prefixed with their certificates. -I will define an alias once and for all to ease future commands. -Please adapt the path of the binary and certificates to your installation! +*We use a command named `garagectl` which is in fact an alias you must define as explained in the [Control the daemon](./daemon.md) section.* -``` -alias grg="/garage/garage --ca-cert /secrets/garage-ca.crt --client-cert /secrets/garage.crt --client-key /secrets/garage.key" -``` - -Now we can check that everything is going well by checking our cluster status: - -``` -grg status -``` +In this section, we will suppose that we want to create a bucket named `nextcloud-bucket` +that will be accessed through a key named `nextcloud-app-key`. Don't forget that `help` command and `--help` subcommands can help you anywhere, the CLI tool is self-documented! Two examples: ``` -grg help -grg bucket allow --help +garagectl help +garagectl bucket allow --help ``` +## Create a bucket + Fine, now let's create a bucket (we imagine that you want to deploy nextcloud): ``` -grg bucket create nextcloud-bucket +garagectl bucket create nextcloud-bucket ``` Check that everything went well: ``` -grg bucket list -grg bucket info nextcloud-bucket +garagectl bucket list +garagectl bucket info nextcloud-bucket ``` +## Create an API key + Now we will generate an API key to access this bucket. Note that API keys are independent of buckets: one key can access multiple buckets, multiple keys can access one bucket. Now, let's start by creating a key only for our PHP application: ``` -grg key new --name nextcloud-app-key +garagectl key new --name nextcloud-app-key ``` You will have the following output (this one is fake, `key_id` and `secret_key` were generated with the openssl CLI tool): -``` -Key { key_id: "GK3515373e4c851ebaad366558", secret_key: "7d37d093435a41f2aab8f13c19ba067d9776c90215f56614adad6ece597dbb34", name: "nextcloud-app-key", name_timestamp: 1603280506694, deleted: false, authorized_buckets: [] } +```javascript +Key { + key_id: "GK3515373e4c851ebaad366558", + secret_key: "7d37d093435a41f2aab8f13c19ba067d9776c90215f56614adad6ece597dbb34", + name: "nextcloud-app-key", + name_timestamp: 1603280506694, + deleted: false, + authorized_buckets: [] +} ``` Check that everything works as intended (be careful, info works only with your key identifier and not with its friendly name!): ``` -grg key list -grg key info GK3515373e4c851ebaad366558 +garagectl key list +garagectl key info GK3515373e4c851ebaad366558 ``` +## Allow a key to access a bucket + Now that we have a bucket and a key, we need to give permissions to the key on the bucket! ``` -grg bucket allow --read --write nextcloud-bucket --key GK3515373e4c851ebaad366558 +garagectl bucket allow \ + --read \ + --write + nextcloud-bucket \ + --key GK3515373e4c851ebaad366558 ``` You can check at any times allowed keys on your bucket with: ``` -grg bucket info nextcloud-bucket +garagectl bucket info nextcloud-bucket ``` diff --git a/doc/book/src/getting_started/cluster.md b/doc/book/src/getting_started/cluster.md index a17bf14e..af6e8f10 100644 --- a/doc/book/src/getting_started/cluster.md +++ b/doc/book/src/getting_started/cluster.md @@ -1,14 +1,72 @@ # Configure a cluster +*We use a command named `garagectl` which is in fact an alias you must define as explained in the [Control the daemon](./daemon.md) section.* + +In this section, we will inform garage of the disk space available on each node of the cluster +as well as the site (think datacenter) of each machine. + ## Test cluster +As this part is not relevant for a test cluster, you can use this one-liner to create a basic topology: + +```bash +garagectl status | grep UNCONFIGURED | grep -Po '^[0-9a-f]+' | while read id; do + garagectl node configure -d dc1 -n 10 $id +done +``` + ## Real-world cluster -For our example, we will suppose we have the following infrastructure: +For our example, we will suppose we have the following infrastructure (Tokens, Identifier and Datacenter are specific values to garage described in the following): + +| Location | Name | Disk Space | `Tokens` | `Identifier` | `Datacenter` | +|----------|---------|------------|----------|--------------|--------------| +| Paris | Mercury | 1 To | `100` | `8781c5` | `par1` | +| Paris | Venus | 2 To | `200` | `2a638e` | `par1` | +| London | Earth | 2 To | `200` | `68143d` | `lon1` | +| Brussels | Mars | 1.5 To | `150` | `212f75` | `bru1` | + +### Identifier + +After its first launch, garage generates a random and unique identifier for each nodes, such as: + +``` +8781c50c410a41b363167e9d49cc468b6b9e4449b6577b64f15a249a149bdcbc +``` + +Often a shorter form can be used, containing only the beginning of the identifier, like `8781c5`, +which identifies the server "Mercury" located in "Paris" according to our previous table. + +The most simple way to match an identifier to a node is to run: + +``` +garagectl status +``` + +It will display the IP address associated with each node; from the IP address you will be able to recognize the node. + +### Tokens + +Garage reasons on an arbitrary metric about disk storage that is named "tokens". +The number of tokens must be proportional to the disk space dedicated to the node. +Additionaly, ideally the number of tokens must be in the order of magnitude of 100 +to provide a good trade-off between data load balancing and performances (*this sentence must be verified, it may be wrong*). + +Here we chose 1 token = 10 Go but you are free to select the value that best fit your needs. + +### Datacenter + +Datacenter are simply a user-chosen identifier that identify a group of server that are located in the same place. +It is up to the system administrator deploying garage to identify what does "the same place" means. +Behind the scene, garage will try to store the same data on different sites to provide high availability despite a data center failure. + +### Inject the topology + +Given the information above, we will configure our cluster as follow: -| Location | Name | IP Address | Disk Space | -|----------|---------|------------|------------| -| Paris | Mercury | fc00:1::1 | 1 To | -| Paris | Venus | fc00:1::2 | 2 To | -| London | Earth | fc00:1::2 | 2 To | -| Brussels | Mars | fc00:B::1 | 1.5 To | +``` +garagectl node configure --datacenter par1 -n 100 -t mercury 8781c5 +garagectl node configure --datacenter par1 -n 200 -t venus 2a638e +garagectl node configure --datacenter lon1 -n 200 -t earth 68143d +garagectl node configure --datacenter bru1 -n 150 -t mars 212f75 +``` diff --git a/doc/book/src/getting_started/control.md b/doc/book/src/getting_started/control.md new file mode 100644 index 00000000..9a66a0dc --- /dev/null +++ b/doc/book/src/getting_started/control.md @@ -0,0 +1,77 @@ +# Control the daemon + +The `garage` binary has two purposes: + - it acts as a daemon when launched with `garage server ...` + - it acts as a control tool for the daemon when launched with any other command + +In this section, we will see how to use the `garage` binary as a control tool for the daemon we just started. +You first need to get a shell having access to this binary, which depends of your configuration: + - with `docker-compose`, run `sudo docker-compose exec g1 bash` then `/garage/garage` + - with `docker`, run `sudo docker exec -ti garaged bash` then `/garage/garage` + - with `systemd`, simply run `/usr/local/bin/garage` if you followed previous instructions + +*You can also install the binary on your machine to remotely control the cluster.* + +## Talk to the daemon and create an alias + +`garage` requires 4 options to talk with the daemon: + +``` +--ca-cert +--client-cert +--client-key +-h, --rpc-host +``` + +The 3 first ones are certificates and keys needed by TLS, the last one is simply the address of garage's RPC endpoint. +Because we configure garage directly from the server, we do not need to set `--rpc-host`. +To avoid typing the 3 first options each time we want to run a command, we will create an alias. + +### `docker-compose` alias + +```bash +alias garagectl='/garage/garage \ + --ca-cert /pki/garage-ca.crt \ + --client-cert /pki/garage.crt \ + --client-key /pki/garage.key' +``` + +### `docker` alias + +```bash +alias garagectl='/garage/garage \ + --ca-cert /etc/garage/pki/garage-ca.crt \ + --client-cert /etc/garage/pki/garage.crt \ + --client-key /etc/garage/pki/garage.key' +``` + + +### raw binary alias + +```bash +alias garagectl='/usr/local/bin/garage \ + --ca-cert /etc/garage/pki/garage-ca.crt \ + --client-cert /etc/garage/pki/garage.crt \ + --client-key /etc/garage/pki/garage.key' +``` + +Of course, if your deployment does not match exactly one of this alias, feel free to adapt it to your needs! + +## Test the alias + +You can test your alias by running a simple command such as: + +``` +garagectl status +``` + +You should get something like that as result: + +``` +Healthy nodes: +2a638ed6c775b69a… 37f0ba978d27 [::ffff:172.20.0.101]:3901 UNCONFIGURED/REMOVED +68143d720f20c89d… 9795a2f7abb5 [::ffff:172.20.0.103]:3901 UNCONFIGURED/REMOVED +8781c50c410a41b3… 758338dde686 [::ffff:172.20.0.102]:3901 UNCONFIGURED/REMOVED +``` + +...which means that you are ready to configure your cluster! diff --git a/doc/book/src/getting_started/daemon.md b/doc/book/src/getting_started/daemon.md index 7c303c6a..2f2b71b0 100644 --- a/doc/book/src/getting_started/daemon.md +++ b/doc/book/src/getting_started/daemon.md @@ -171,6 +171,7 @@ On each machine, you can run the daemon with: ```bash docker run \ -d \ + --name garaged \ --restart always \ --network host \ -v /etc/garage/pki:/etc/garage/pki \ -- cgit v1.2.3 From ea21c544343afeb37e96678089bcd535e64982a7 Mon Sep 17 00:00:00 2001 From: Quentin Dufour Date: Wed, 17 Mar 2021 22:44:35 +0100 Subject: Add handle files section to the doc --- doc/book/src/getting_started/files.md | 41 +++++++++++++++++++++++++++++++++++ 1 file changed, 41 insertions(+) (limited to 'doc') diff --git a/doc/book/src/getting_started/files.md b/doc/book/src/getting_started/files.md index c8042dd3..0e3939ce 100644 --- a/doc/book/src/getting_started/files.md +++ b/doc/book/src/getting_started/files.md @@ -1 +1,42 @@ # Handle files + +We recommend the use of MinIO Client to interact with Garage files (`mc`). +Instructions to install it and use it are provided on the [MinIO website](https://docs.min.io/docs/minio-client-quickstart-guide.html). +Before reading the following, you need a working `mc` command on your path. + +## Configure `mc` + +You need your access key and secret key created in the [previous section](bucket.md). +You also need to set the endpoint: it must match the IP address of one of the node of the cluster and the API port (3900 by default). +For this whole configuration, you must set an alias name: we chose `my-garage`, that you will used for all commands. + +Adapt the following command accordingly and run it: + +```bash +mc alias set \ + my-garage \ + http://172.20.0.101:3900 \ + \ + \ + --api S3v4 +``` + +You must also add an environment variable to your configuration to inform MinIO of our region (`garage` by default). +The best way is to add the following snippet to your `$HOME/.bash_profile` or `$HOME/.bashrc` file: + +```bash +export MC_REGION=garage +``` + +## Use `mc` + +You can not list buckets from `mc` currently. + +But the following commands and many more should work: + +```bash +mc cp image.png my-garage/nextcloud-bucket +mc cp my-garage/nextcloud-bucket/image.png . +mc ls my-garage/nextcloud-bucket +mc mirror localdir/ my-garage/another-bucket +``` -- cgit v1.2.3