diff options
author | Alex <alex@adnab.me> | 2024-04-10 15:23:12 +0000 |
---|---|---|
committer | Alex <alex@adnab.me> | 2024-04-10 15:23:12 +0000 |
commit | 1779fd40c0fe676bedda0d40f647d7fe8b0f1e7e (patch) | |
tree | 47e42c4e6ae47590fbb5c8f94e90a23bf04c1674 /doc | |
parent | b47706809cc9d28d1328bafdf9756e96388cca24 (diff) | |
parent | ff093ddbb8485409f389abe7b5e569cb38d222d2 (diff) | |
download | garage-1779fd40c0fe676bedda0d40f647d7fe8b0f1e7e.tar.gz garage-1779fd40c0fe676bedda0d40f647d7fe8b0f1e7e.zip |
Merge pull request 'Garage v1.0' (#683) from next-0.10 into main
Reviewed-on: https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/683
Diffstat (limited to 'doc')
-rw-r--r-- | doc/api/garage-admin-v1.yml | 1 | ||||
-rw-r--r-- | doc/book/connect/apps/index.md | 49 | ||||
-rw-r--r-- | doc/book/cookbook/encryption.md | 51 | ||||
-rw-r--r-- | doc/book/cookbook/from-source.md | 1 | ||||
-rw-r--r-- | doc/book/cookbook/real-world.md | 17 | ||||
-rw-r--r-- | doc/book/design/internals.md | 2 | ||||
-rw-r--r-- | doc/book/operations/durability-repairs.md | 5 | ||||
-rw-r--r-- | doc/book/operations/layout.md | 2 | ||||
-rw-r--r-- | doc/book/quick-start/_index.md | 2 | ||||
-rw-r--r-- | doc/book/reference-manual/configuration.md | 175 | ||||
-rw-r--r-- | doc/book/reference-manual/features.md | 6 | ||||
-rw-r--r-- | doc/book/reference-manual/s3-compatibility.md | 1 | ||||
-rw-r--r-- | doc/book/working-documents/migration-1.md | 77 | ||||
-rw-r--r-- | doc/drafts/admin-api.md | 140 | ||||
-rw-r--r-- | doc/drafts/k2v-spec.md | 2 |
15 files changed, 344 insertions, 187 deletions
diff --git a/doc/api/garage-admin-v1.yml b/doc/api/garage-admin-v1.yml index fd78feb1..1ea77b2e 100644 --- a/doc/api/garage-admin-v1.yml +++ b/doc/api/garage-admin-v1.yml @@ -98,7 +98,6 @@ paths: type: string example: - "k2v" - - "sled" - "lmdb" - "sqlite" - "consul-discovery" diff --git a/doc/book/connect/apps/index.md b/doc/book/connect/apps/index.md index c8571fac..9a678275 100644 --- a/doc/book/connect/apps/index.md +++ b/doc/book/connect/apps/index.md @@ -80,6 +80,53 @@ To test your new configuration, just reload your Nextcloud webpage and start sen *External link:* [Nextcloud Documentation > Primary Storage](https://docs.nextcloud.com/server/latest/admin_manual/configuration_files/primary_storage.html) +#### SSE-C encryption (since Garage v1.0) + +Since version 1.0, Garage supports server-side encryption with customer keys +(SSE-C). In this mode, Garage is responsible for encrypting and decrypting +objects, but it does not store the encryption key itself. The encryption key +should be provided by Nextcloud upon each request. This mode of operation is +supported by Nextcloud and it has successfully been tested together with +Garage. + +To enable SSE-C encryption: + +1. Make sure your Garage server is accessible via SSL through a reverse proxy + such as Nginx, and that it is using a valid public certificate (Nextcloud + might be able to connect to an S3 server that is using a self-signed + certificate, but you will lose many hours while trying, so don't). + Configure values for `use_ssl` and `port` accordingly in your `config.php` + file. + +2. Generate an encryption key using the following command: + + ``` + openssl rand -base64 32 + ``` + + Make sure to keep this key **secret**! + +3. Add the encryption key in your `config.php` file as follows: + + + ```php + <?php + $CONFIG = array( + 'objectstore' => [ + 'class' => '\\OC\\Files\\ObjectStore\\S3', + 'arguments' => [ + ... + 'sse_c_key' => 'exampleencryptionkeyLbU+5fKYQcVoqnn+RaIOXgo=', + ... + ], + ], + ``` + +Nextcloud will now make Garage encrypt files at rest in the storage bucket. +These files will not be readable by an S3 client that has credentials to the +bucket but doesn't also know the secret encryption key. + + ### External Storage **From the GUI.** Activate the "External storage support" app from the "Applications" page (click on your account icon on the top right corner of your screen to display the menu). Go to your parameters page (also located below your account icon). Click on external storage (or the corresponding translation in your language). @@ -245,7 +292,7 @@ with average object size ranging from 50 KB to 150 KB. As such, your Garage cluster should be configured appropriately for good performance: - use Garage v0.8.0 or higher with the [LMDB database engine](@documentation/reference-manual/configuration.md#db-engine-since-v0-8-0). - With the default Sled database engine, your database could quickly end up taking tens of GB of disk space. + Older versions of Garage used the Sled database engine which had issues, such as databases quickly ending up taking tens of GB of disk space. - the Garage database should be stored on a SSD ### Creating your bucket diff --git a/doc/book/cookbook/encryption.md b/doc/book/cookbook/encryption.md index 21a5cbc6..bfbea0ec 100644 --- a/doc/book/cookbook/encryption.md +++ b/doc/book/cookbook/encryption.md @@ -53,20 +53,43 @@ and that's also why your nodes have super long identifiers. Adding TLS support built into Garage is not currently planned. -## Garage stores data in plain text on the filesystem - -Garage does not handle data encryption at rest by itself, and instead delegates -to the user to add encryption, either at the storage layer (LUKS, etc) or on -the client side (or both). There are no current plans to add data encryption -directly in Garage. - -Implementing data encryption directly in Garage might make things simpler for -end users, but also raises many more questions, especially around key -management: for encryption of data, where could Garage get the encryption keys -from ? If we encrypt data but keep the keys in a plaintext file next to them, -it's useless. We probably don't want to have to manage secrets in garage as it -would be very hard to do in a secure way. Maybe integrate with an external -system such as Hashicorp Vault? +## Garage stores data in plain text on the filesystem or encrypted using customer keys (SSE-C) + +For standard S3 API requests, Garage does not encrypt data at rest by itself. +For the most generic at rest encryption of data, we recommend setting up your +storage partitions on encrypted LUKS devices. + +If you are developping your own client software that makes use of S3 storage, +we recommend implementing data encryption directly on the client side and never +transmitting plaintext data to Garage. This makes it easy to use an external +untrusted storage provider if necessary. + +Garage does support [SSE-C +encryption](https://docs.aws.amazon.com/AmazonS3/latest/userguide/ServerSideEncryptionCustomerKeys.html), +an encryption mode of Amazon S3 where data is encrypted at rest using +encryption keys given by the client. The encryption keys are passed to the +server in a header in each request, to encrypt or decrypt data at the moment of +reading or writing. The server discards the key as soon as it has finished +using it for the request. This mode allows the data to be encrypted at rest by +Garage itself, but it requires support in the client software. It is also not +adapted to a model where the server is not trusted or assumed to be +compromised, as the server can easily know the encryption keys. Note however +that when using SSE-C encryption, the only Garage node that knows the +encryption key passed in a given request is the node to which the request is +directed (which can be a gateway node), so it is easy to have untrusted nodes +in the cluster as long as S3 API requests containing SSE-C encryption keys are +not directed to them. + +Implementing automatic data encryption directly in Garage without client-side +management of keys (something like +[SSE-S3](https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingServerSideEncryption.html)) +could make things simpler for end users that don't want to setup LUKS, but also +raises many more questions, especially around key management: for encryption of +data, where could Garage get the encryption keys from? If we encrypt data but +keep the keys in a plaintext file next to them, it's useless. We probably don't +want to have to manage secrets in Garage as it would be very hard to do in a +secure way. At the time of speaking, there are no plans to implement this in +Garage. # Adding data encryption using external tools diff --git a/doc/book/cookbook/from-source.md b/doc/book/cookbook/from-source.md index 0d7d8e77..7105c999 100644 --- a/doc/book/cookbook/from-source.md +++ b/doc/book/cookbook/from-source.md @@ -91,6 +91,5 @@ The following feature flags are available in v0.8.0: | `metrics` | *by default* | Enable collection of metrics in Prometheus format on the admin API | | `telemetry-otlp` | optional | Enable collection of execution traces using OpenTelemetry | | `syslog` | optional | Enable logging to Syslog | -| `sled` | *by default* | Enable using Sled to store Garage's metadata | | `lmdb` | *by default* | Enable using LMDB to store Garage's metadata | | `sqlite` | *by default* | Enable using Sqlite3 to store Garage's metadata | diff --git a/doc/book/cookbook/real-world.md b/doc/book/cookbook/real-world.md index 894f3db4..7dba784d 100644 --- a/doc/book/cookbook/real-world.md +++ b/doc/book/cookbook/real-world.md @@ -89,20 +89,21 @@ to store 2 TB of data in total. - If you only have an HDD and no SSD, it's fine to put your metadata alongside the data on the same drive, but then consider your filesystem choice wisely - (see above). Having lots of RAM for your kernel to cache the metadata will - help a lot with performance. + (see above). Having lots of RAM for your kernel to cache the metadata will + help a lot with performance. The default LMDB database engine is the most + tested and has good performance. ## Get a Docker image Our docker image is currently named `dxflrs/garage` and is stored on the [Docker Hub](https://hub.docker.com/r/dxflrs/garage/tags?page=1&ordering=last_updated). -We encourage you to use a fixed tag (eg. `v0.9.4`) and not the `latest` tag. -For this example, we will use the latest published version at the time of the writing which is `v0.9.4` but it's up to you +We encourage you to use a fixed tag (eg. `v1.0.0`) and not the `latest` tag. +For this example, we will use the latest published version at the time of the writing which is `v1.0.0` but it's up to you to check [the most recent versions on the Docker Hub](https://hub.docker.com/r/dxflrs/garage/tags?page=1&ordering=last_updated). For example: ``` -sudo docker pull dxflrs/garage:v0.9.4 +sudo docker pull dxflrs/garage:v1.0.0 ``` ## Deploying and configuring Garage @@ -127,7 +128,7 @@ data_dir = "/var/lib/garage/data" db_engine = "lmdb" metadata_auto_snapshot_interval = "6h" -replication_mode = "3" +replication_factor = 3 compression_level = 2 @@ -168,7 +169,7 @@ docker run \ -v /etc/garage.toml:/etc/garage.toml \ -v /var/lib/garage/meta:/var/lib/garage/meta \ -v /var/lib/garage/data:/var/lib/garage/data \ - dxflrs/garage:v0.9.4 + dxflrs/garage:v1.0.0 ``` With this command line, Garage should be started automatically at each boot. @@ -182,7 +183,7 @@ If you want to use `docker-compose`, you may use the following `docker-compose.y version: "3" services: garage: - image: dxflrs/garage:v0.9.4 + image: dxflrs/garage:v1.0.0 network_mode: "host" restart: unless-stopped volumes: diff --git a/doc/book/design/internals.md b/doc/book/design/internals.md index cefb7acc..8e3c214e 100644 --- a/doc/book/design/internals.md +++ b/doc/book/design/internals.md @@ -97,7 +97,7 @@ delete a tombstone, the following condition has to be met: superseeded by the tombstone. This ensures that deleting the tombstone is safe and that no deleted value will come back in the system. -Garage makes use of Sled's atomic operations (such as compare-and-swap and +Garage uses atomic database operations (such as compare-and-swap and transactions) to ensure that only tombstones that have been correctly propagated to other nodes are ever deleted from the local entry tree. diff --git a/doc/book/operations/durability-repairs.md b/doc/book/operations/durability-repairs.md index c76dc39e..fdf163e2 100644 --- a/doc/book/operations/durability-repairs.md +++ b/doc/book/operations/durability-repairs.md @@ -141,4 +141,7 @@ blocks may still be held by Garage. If you suspect that such corruption has occu in your cluster, you can run one of the following repair procedures: - `garage repair versions`: checks that all versions belong to a non-deleted object, and purges any orphan version -- `garage repair block_refs`: checks that all block references belong to a non-deleted object version, and purges any orphan block reference (this will then allow the blocks to be garbage-collected) + +- `garage repair block-refs`: checks that all block references belong to a non-deleted object version, and purges any orphan block reference (this will then allow the blocks to be garbage-collected) + +- `garage repair block-rc`: checks that the reference counters for blocks are in sync with the actual number of non-deleted entries in the block reference table diff --git a/doc/book/operations/layout.md b/doc/book/operations/layout.md index cf1372b0..667e89d2 100644 --- a/doc/book/operations/layout.md +++ b/doc/book/operations/layout.md @@ -12,7 +12,7 @@ An introduction to building cluster layouts can be found in the [production depl In Garage, all of the data that can be stored in a given cluster is divided into slices which we call *partitions*. Each partition is stored by one or several nodes in the cluster -(see [`replication_mode`](@/documentation/reference-manual/configuration.md#replication_mode)). +(see [`replication_factor`](@/documentation/reference-manual/configuration.md#replication_factor)). The layout determines the correspondence between these partitions, which exist on a logical level, and actual storage nodes. diff --git a/doc/book/quick-start/_index.md b/doc/book/quick-start/_index.md index 870bf9e9..9619f388 100644 --- a/doc/book/quick-start/_index.md +++ b/doc/book/quick-start/_index.md @@ -59,7 +59,7 @@ metadata_dir = "/tmp/meta" data_dir = "/tmp/data" db_engine = "sqlite" -replication_mode = "none" +replication_factor = 1 rpc_bind_addr = "[::]:3901" rpc_public_addr = "127.0.0.1:3901" diff --git a/doc/book/reference-manual/configuration.md b/doc/book/reference-manual/configuration.md index 739f5e0e..423795fe 100644 --- a/doc/book/reference-manual/configuration.md +++ b/doc/book/reference-manual/configuration.md @@ -8,7 +8,8 @@ weight = 20 Here is an example `garage.toml` configuration file that illustrates all of the possible options: ```toml -replication_mode = "3" +replication_factor = 3 +consistency_mode = "consistent" metadata_dir = "/var/lib/garage/meta" data_dir = "/var/lib/garage/data" @@ -22,8 +23,6 @@ db_engine = "lmdb" block_size = "1M" block_ram_buffer_max = "256MiB" -sled_cache_capacity = "128MiB" -sled_flush_every_ms = 2000 lmdb_map_size = "1T" compression_level = 1 @@ -101,13 +100,12 @@ Top-level configuration options: [`metadata_auto_snapshot_interval`](#metadata_auto_snapshot_interval), [`metadata_dir`](#metadata_dir), [`metadata_fsync`](#metadata_fsync), -[`replication_mode`](#replication_mode), +[`replication_factor`](#replication_factor), +[`consistency_mode`](#consistency_mode), [`rpc_bind_addr`](#rpc_bind_addr), [`rpc_bind_outgoing`](#rpc_bind_outgoing), [`rpc_public_addr`](#rpc_public_addr), -[`rpc_secret`/`rpc_secret_file`](#rpc_secret), -[`sled_cache_capacity`](#sled_cache_capacity), -[`sled_flush_every_ms`](#sled_flush_every_ms). +[`rpc_secret`/`rpc_secret_file`](#rpc_secret). The `[consul_discovery]` section: [`api`](#consul_api), @@ -161,11 +159,12 @@ values in the configuration file: ### Top-level configuration options -#### `replication_mode` {#replication_mode} +#### `replication_factor` {#replication_factor} -Garage supports the following replication modes: +The replication factor can be any positive integer smaller or equal the node count in your cluster. +The chosen replication factor has a big impact on the cluster's failure tolerancy and performance characteristics. -- `none` or `1`: data stored on Garage is stored on a single node. There is no +- `1`: data stored on Garage is stored on a single node. There is no redundancy, and data will be unavailable as soon as one node fails or its network is disconnected. Do not use this for anything else than test deployments. @@ -176,17 +175,6 @@ Garage supports the following replication modes: before losing data. Data remains available in read-only mode when one node is down, but write operations will fail. - - `2-dangerous`: a variant of mode `2`, where written objects are written to - the second replica asynchronously. This means that Garage will return `200 - OK` to a PutObject request before the second copy is fully written (or even - before it even starts being written). This means that data can more easily - be lost if the node crashes before a second copy can be completed. This - also means that written objects might not be visible immediately in read - operations. In other words, this mode severely breaks the consistency and - durability guarantees of standard Garage cluster operation. Benefits of - this mode: you can still write to your cluster when one node is - unavailable. - - `3`: data stored on Garage will be stored on three different nodes, if possible each in a different zones. Garage tolerates two node failure, or several node failures but in no more than two zones (in a deployment with at @@ -194,55 +182,84 @@ Garage supports the following replication modes: or node failures are only in a single zone, reading and writing data to Garage can continue normally. - - `3-degraded`: a variant of replication mode `3`, that lowers the read +- `5`, `7`, ...: When setting the replication factor above 3, it is most useful to + choose an uneven value, since for every two copies added, one more node can fail + before losing the ability to write and read to the cluster. + +Note that in modes `2` and `3`, +if at least the same number of zones are available, an arbitrary number of failures in +any given zone is tolerated as copies of data will be spread over several zones. + +**Make sure `replication_factor` is the same in the configuration files of all nodes. +Never run a Garage cluster where that is not the case.** + +It is technically possible to change the replication factor although it's a +dangerous operation that is not officially supported. This requires you to +delete the existing cluster layout and create a new layout from scratch, +meaning that a full rebalancing of your cluster's data will be needed. To do +it, shut down your cluster entirely, delete the `custer_layout` files in the +meta directories of all your nodes, update all your configuration files with +the new `replication_factor` parameter, restart your cluster, and then create a +new layout with all the nodes you want to keep. Rebalancing data will take +some time, and data might temporarily appear unavailable to your users. +It is recommended to shut down public access to the cluster while rebalancing +is in progress. In theory, no data should be lost as rebalancing is a +routine operation for Garage, although we cannot guarantee you that everything + will go right in such an extreme scenario. + +#### `consistency_mode` {#consistency_mode} + +The consistency mode setting determines the read and write behaviour of your cluster. + + - `consistent`: The default setting. This is what the paragraph above describes. + The read and write quorum will be determined so that read-after-write consistency + is guaranteed. + - `degraded`: Lowers the read quorum to `1`, to allow you to read data from your cluster when several nodes (or nodes in several zones) are unavailable. In this mode, Garage - does not provide read-after-write consistency anymore. The write quorum is - still 2, ensuring that data successfully written to Garage is stored on at - least two nodes. - - - `3-dangerous`: a variant of replication mode `3` that lowers both the read + does not provide read-after-write consistency anymore. + The write quorum stays the same as in the `consistent` mode, ensuring that + data successfully written to Garage is stored on multiple nodes (depending + the replication factor). + - `dangerous`: This mode lowers both the read and write quorums to `1`, to allow you to both read and write to your cluster when several nodes (or nodes in several zones) are unavailable. It is the least consistent mode of operation proposed by Garage, and also one that should probably never be used. -Note that in modes `2` and `3`, -if at least the same number of zones are available, an arbitrary number of failures in -any given zone is tolerated as copies of data will be spread over several zones. +Changing the `consistency_mode` between modes while leaving the `replication_factor` untouched +(e.g. setting your node's `consistency_mode` to `degraded` when it was previously unset, or from +`dangerous` to `consistent`), can be done easily by just changing the `consistency_mode` +parameter in your config files and restarting all your Garage nodes. -**Make sure `replication_mode` is the same in the configuration files of all nodes. -Never run a Garage cluster where that is not the case.** +The consistency mode can be used together with various replication factors, to achieve +a wide range of read and write characteristics. Some examples: + + - Replication factor `2`, consistency mode `degraded`: While this mode + technically exists, its properties are the same as with consistency mode `consistent`, + since the read quorum with replication factor `2`, consistency mode `consistent` is already 1. + + - Replication factor `2`, consistency mode `dangerous`: written objects are written to + the second replica asynchronously. This means that Garage will return `200 + OK` to a PutObject request before the second copy is fully written (or even + before it even starts being written). This means that data can more easily + be lost if the node crashes before a second copy can be completed. This + also means that written objects might not be visible immediately in read + operations. In other words, this configuration severely breaks the consistency and + durability guarantees of standard Garage cluster operation. Benefits of + this configuration: you can still write to your cluster when one node is + unavailable. The quorums associated with each replication mode are described below: -| `replication_mode` | Number of replicas | Write quorum | Read quorum | Read-after-write consistency? | -| ------------------ | ------------------ | ------------ | ----------- | ----------------------------- | -| `none` or `1` | 1 | 1 | 1 | yes | -| `2` | 2 | 2 | 1 | yes | -| `2-dangerous` | 2 | 1 | 1 | NO | -| `3` | 3 | 2 | 2 | yes | -| `3-degraded` | 3 | 2 | 1 | NO | -| `3-dangerous` | 3 | 1 | 1 | NO | - -Changing the `replication_mode` between modes with the same number of replicas -(e.g. from `3` to `3-degraded`, or from `2-dangerous` to `2`), can be done easily by -just changing the `replication_mode` parameter in your config files and restarting all your -Garage nodes. - -It is also technically possible to change the replication mode to a mode with a -different numbers of replicas, although it's a dangerous operation that is not -officially supported. This requires you to delete the existing cluster layout -and create a new layout from scratch, meaning that a full rebalancing of your -cluster's data will be needed. To do it, shut down your cluster entirely, -delete the `custer_layout` files in the meta directories of all your nodes, -update all your configuration files with the new `replication_mode` parameter, -restart your cluster, and then create a new layout with all the nodes you want -to keep. Rebalancing data will take some time, and data might temporarily -appear unavailable to your users. It is recommended to shut down public access -to the cluster while rebalancing is in progress. In theory, no data should be -lost as rebalancing is a routine operation for Garage, although we cannot -guarantee you that everything will go right in such an extreme scenario. +| `consistency_mode` | `replication_factor` | Write quorum | Read quorum | Read-after-write consistency? | +| ------------------ | -------------------- | ------------ | ----------- | ----------------------------- | +| `consistent` | 1 | 1 | 1 | yes | +| `consistent` | 2 | 2 | 1 | yes | +| `dangerous` | 2 | 1 | 1 | NO | +| `consistent` | 3 | 2 | 2 | yes | +| `degraded` | 3 | 2 | 1 | NO | +| `dangerous` | 3 | 1 | 1 | NO | #### `metadata_dir` {#metadata_dir} @@ -278,23 +295,18 @@ Since `v0.8.0`, Garage can use alternative storage backends as follows: | DB engine | `db_engine` value | Database path | | --------- | ----------------- | ------------- | -| [LMDB](https://www.lmdb.tech) (default since `v0.9.0`) | `"lmdb"` | `<metadata_dir>/db.lmdb/` | -| [Sled](https://sled.rs) (default up to `v0.8.0`) | `"sled"` | `<metadata_dir>/db/` | -| [Sqlite](https://sqlite.org) | `"sqlite"` | `<metadata_dir>/db.sqlite` | +| [LMDB](https://www.lmdb.tech) (since `v0.8.0`, default since `v0.9.0`) | `"lmdb"` | `<metadata_dir>/db.lmdb/` | +| [Sqlite](https://sqlite.org) (since `v0.8.0`) | `"sqlite"` | `<metadata_dir>/db.sqlite` | +| [Sled](https://sled.rs) (old default, removed since `v1.0`) | `"sled"` | `<metadata_dir>/db/` | -Sled was the only database engine up to Garage v0.7.0. Performance issues and -API limitations of Sled prompted the addition of alternative engines in v0.8.0. -Since v0.9.0, LMDB is the default engine instead of Sled, and Sled is -deprecated. We plan to remove Sled in Garage v1.0. +Sled was supported until Garage v0.9.x, and was removed in Garage v1.0. +You can still use an older binary of Garage (e.g. v0.9.4) to migrate +old Sled metadata databases to another engine. Performance characteristics of the different DB engines are as follows: -- Sled: tends to produce large data files and also has performance issues, - especially when the metadata folder is on a traditional HDD and not on SSD. - -- LMDB: the recommended database engine for high-performance distributed - clusters, much more space-efficient and significantly faster. LMDB works very - well, but is known to have the following limitations: +- LMDB: the recommended database engine for high-performance distributed clusters. +LMDB works very well, but is known to have the following limitations: - The data format of LMDB is not portable between architectures, so for instance the Garage database of an x86-64 node cannot be moved to an ARM64 @@ -310,6 +322,9 @@ Performance characteristics of the different DB engines are as follows: other nodes), or if you have saved regular snapshots at the filesystem level. + - Keys in LMDB are limited to 511 bytes. This limit translates to limits on + object keys in S3 and sort keys in K2V that are limted to 479 bytes. + - Sqlite: Garage supports Sqlite as an alternative storage backend for metadata, which does not have the issues listed above for LMDB. On versions 0.8.x and earlier, Sqlite should be avoided due to abysmal @@ -353,7 +368,6 @@ Here is how this option impacts the different database engines: | Database | `metadata_fsync = false` (default) | `metadata_fsync = true` | |----------|------------------------------------|-------------------------------| -| Sled | default options | *unsupported* | | Sqlite | `PRAGMA synchronous = OFF` | `PRAGMA synchronous = NORMAL` | | LMDB | `MDB_NOMETASYNC` + `MDB_NOSYNC` | `MDB_NOMETASYNC` | @@ -455,21 +469,6 @@ node. The default value is 256MiB. -#### `sled_cache_capacity` {#sled_cache_capacity} - -This parameter can be used to tune the capacity of the cache used by -[sled](https://sled.rs), the database Garage uses internally to store metadata. -Tune this to fit the RAM you wish to make available to your Garage instance. -This value has a conservative default (128MB) so that Garage doesn't use too much -RAM by default, but feel free to increase this for higher performance. - -#### `sled_flush_every_ms` {#sled_flush_every_ms} - -This parameters can be used to tune the flushing interval of sled. -Increase this if sled is thrashing your SSD, at the risk of losing more data in case -of a power outage (though this should not matter much as data is replicated on other -nodes). The default value, 2000ms, should be appropriate for most use cases. - #### `lmdb_map_size` {#lmdb_map_size} This parameters can be used to set the map size used by LMDB, diff --git a/doc/book/reference-manual/features.md b/doc/book/reference-manual/features.md index f7014b26..34f692cc 100644 --- a/doc/book/reference-manual/features.md +++ b/doc/book/reference-manual/features.md @@ -39,10 +39,10 @@ Read about cluster layout management [here](@/documentation/operations/layout.md ### Several replication modes -Garage supports a variety of replication modes, with 1 copy, 2 copies or 3 copies of your data, +Garage supports a variety of replication modes, with configurable replica count, and with various levels of consistency, in order to adapt to a variety of usage scenarios. -Read our reference page on [supported replication modes](@/documentation/reference-manual/configuration.md#replication_mode) -to select the replication mode best suited to your use case (hint: in most cases, `replication_mode = "3"` is what you want). +Read our reference page on [supported replication modes](@/documentation/reference-manual/configuration.md#replication_factor) +to select the replication mode best suited to your use case (hint: in most cases, `replication_factor = 3` is what you want). ### Compression and deduplication diff --git a/doc/book/reference-manual/s3-compatibility.md b/doc/book/reference-manual/s3-compatibility.md index 1bcfd123..d2c47f3e 100644 --- a/doc/book/reference-manual/s3-compatibility.md +++ b/doc/book/reference-manual/s3-compatibility.md @@ -33,6 +33,7 @@ Feel free to open a PR to suggest fixes this table. Minio is missing because the | [URL path-style](https://docs.aws.amazon.com/AmazonS3/latest/userguide/VirtualHosting.html#path-style-access) (eg. `host.tld/bucket/key`) | ✅ Implemented | ✅ | ✅ | ❓| ✅ | | [URL vhost-style](https://docs.aws.amazon.com/AmazonS3/latest/userguide/VirtualHosting.html#virtual-hosted-style-access) URL (eg. `bucket.host.tld/key`) | ✅ Implemented | ❌| ✅| ✅ | ✅ | | [Presigned URLs](https://docs.aws.amazon.com/AmazonS3/latest/userguide/ShareObjectPreSignedURL.html) | ✅ Implemented | ❌| ✅ | ✅ | ✅(❓) | +| [SSE-C encryption](https://docs.aws.amazon.com/AmazonS3/latest/userguide/ServerSideEncryptionCustomerKeys.html) | ✅ Implemented | ❓ | ✅ | ❌ | ✅ | *Note:* OpenIO does not says if it supports presigned URLs. Because it is part of signature v4 and they claim they support it without additional precisions, diff --git a/doc/book/working-documents/migration-1.md b/doc/book/working-documents/migration-1.md new file mode 100644 index 00000000..2fb14ef9 --- /dev/null +++ b/doc/book/working-documents/migration-1.md @@ -0,0 +1,77 @@ ++++ +title = "Migrating from 0.9 to 1.0" +weight = 11 ++++ + +**This guide explains how to migrate to 1.0 if you have an existing 0.9 cluster. +We don't recommend trying to migrate to 1.0 directly from 0.8 or older.** + +This migration procedure has been tested on several clusters without issues. +However, it is still a *critical procedure* that might cause issues. +**Make sure to back up all your data before attempting it!** + +You might also want to read our [general documentation on upgrading Garage](@/documentation/operations/upgrading.md). + +## Changes introduced in v1.0 + +The following are **breaking changes** in Garage v1.0 that require your attention when migrating: + +- The Sled metadata db engine has been **removed**. If your cluster was still + using Sled, you will need to **use a Garage v0.9.x binary** to convert the + database using the `garage convert-db` subcommand. See + [here](@/documentation/reference-manual/configuration/#db_engine) for the + details of the procedure. + +The following syntax changes have been made to the configuration file: + +- The `replication_mode` parameter has been split into two parameters: + [`replication_factor`](@/documentation/reference-manual/configuration/#replication_factor) + and + [`consistency_mode`](@/documentation/reference-manual/configuration/#consistency_mode). + The old syntax using `replication_mode` is still supported for legacy + reasons and can still be used. + +- The parameters `sled_cache_capacity` and `sled_flush_every_ms` have been removed. + +## Migration procedure + +The migration to Garage v1.0 can be done with almost no downtime, +by restarting all nodes at once in the new version. + +The migration steps are as follows: + +1. Do a `garage repair --all-nodes --yes tables`, check the logs and check that + all data seems to be synced correctly between nodes. If you have time, do + additional `garage repair` procedures (`blocks`, `versions`, `block_refs`, + etc.) + +2. Ensure you have a snapshot of your Garage installation that you can restore + to in case the upgrade goes wrong: + + - If you are running Garage v0.9.4 or later, use the `garage meta snapshot + --all` to make a backup snapshot of the metadata directories of your nodes + for backup purposes, and save a copy of the following files in the + metadata directories of your nodes: `cluster_layout`, `data_layout`, + `node_key`, `node_key.pub`. + + - If you are running a filesystem such as ZFS or BTRFS that support + snapshotting, you can create a filesystem-level snapshot to be used as a + restoration point if needed. + + - In other cases, make a backup using the old procedure: turn off each node + individually; back up its metadata folder (for instance, use the following + command if your metadata directory is `/var/lib/garage/meta`: `cd + /var/lib/garage ; tar -acf meta-v0.9.tar.zst meta/`); turn it back on + again. This will allow you to take a backup of all nodes without + impacting global cluster availability. You can do all nodes of a single + zone at once as this does not impact the availability of Garage. + +3. Prepare your updated binaries and configuration files for Garage v1.0 + +4. Shut down all v0.9 nodes simultaneously, and restart them all simultaneously + in v1.0. Use your favorite deployment tool (Ansible, Kubernetes, Nomad) to + achieve this as fast as possible. Garage v1.0 should be in a working state + as soon as enough nodes have started. + +5. Monitor your cluster in the following hours to see if it works well under + your production load. diff --git a/doc/drafts/admin-api.md b/doc/drafts/admin-api.md index ce56d8e0..16338194 100644 --- a/doc/drafts/admin-api.md +++ b/doc/drafts/admin-api.md @@ -69,11 +69,10 @@ Example response body: ```json { - "node": "ec79480e0ce52ae26fd00c9da684e4fa56658d9c64cdcecb094e936de0bfe71f", - "garageVersion": "git:v0.9.0-dev", + "node": "b10c110e4e854e5aa3f4637681befac755154b20059ec163254ddbfae86b09df", + "garageVersion": "v1.0.0", "garageFeatures": [ "k2v", - "sled", "lmdb", "sqlite", "metrics", @@ -81,83 +80,92 @@ Example response body: ], "rustVersion": "1.68.0", "dbEngine": "LMDB (using Heed crate)", - "knownNodes": [ + "layoutVersion": 5, + "nodes": [ { - "id": "ec79480e0ce52ae26fd00c9da684e4fa56658d9c64cdcecb094e936de0bfe71f", - "addr": "10.0.0.11:3901", + "id": "62b218d848e86a64f7fe1909735f29a4350547b54c4b204f91246a14eb0a1a8c", + "role": { + "id": "62b218d848e86a64f7fe1909735f29a4350547b54c4b204f91246a14eb0a1a8c", + "zone": "dc1", + "capacity": 100000000000, + "tags": [] + }, + "addr": "10.0.0.3:3901", + "hostname": "node3", "isUp": true, - "lastSeenSecsAgo": 9, - "hostname": "node1" + "lastSeenSecsAgo": 12, + "draining": false, + "dataPartition": { + "available": 660270088192, + "total": 873862266880 + }, + "metadataPartition": { + "available": 660270088192, + "total": 873862266880 + } }, { - "id": "4a6ae5a1d0d33bf895f5bb4f0a418b7dc94c47c0dd2eb108d1158f3c8f60b0ff", - "addr": "10.0.0.12:3901", + "id": "a11c7cf18af297379eff8688360155fe68d9061654449ba0ce239252f5a7487f", + "role": null, + "addr": "10.0.0.2:3901", + "hostname": "node2", "isUp": true, - "lastSeenSecsAgo": 1, - "hostname": "node2" + "lastSeenSecsAgo": 11, + "draining": true, + "dataPartition": { + "available": 660270088192, + "total": 873862266880 + }, + "metadataPartition": { + "available": 660270088192, + "total": 873862266880 + } }, { - "id": "23ffd0cdd375ebff573b20cc5cef38996b51c1a7d6dbcf2c6e619876e507cf27", - "addr": "10.0.0.21:3901", + "id": "a235ac7695e0c54d7b403943025f57504d500fdcc5c3e42c71c5212faca040a2", + "role": { + "id": "a235ac7695e0c54d7b403943025f57504d500fdcc5c3e42c71c5212faca040a2", + "zone": "dc1", + "capacity": 100000000000, + "tags": [] + }, + "addr": "127.0.0.1:3904", + "hostname": "lindy", "isUp": true, - "lastSeenSecsAgo": 7, - "hostname": "node3" + "lastSeenSecsAgo": 2, + "draining": false, + "dataPartition": { + "available": 660270088192, + "total": 873862266880 + }, + "metadataPartition": { + "available": 660270088192, + "total": 873862266880 + } }, { - "id": "e2ee7984ee65b260682086ec70026165903c86e601a4a5a501c1900afe28d84b", - "addr": "10.0.0.22:3901", - "isUp": true, - "lastSeenSecsAgo": 1, - "hostname": "node4" - } - ], - "layout": { - "version": 12, - "roles": [ - { - "id": "ec79480e0ce52ae26fd00c9da684e4fa56658d9c64cdcecb094e936de0bfe71f", + "id": "b10c110e4e854e5aa3f4637681befac755154b20059ec163254ddbfae86b09df", + "role": { + "id": "b10c110e4e854e5aa3f4637681befac755154b20059ec163254ddbfae86b09df", "zone": "dc1", - "capacity": 10737418240, - "tags": [ - "node1" - ] + "capacity": 100000000000, + "tags": [] }, - { - "id": "4a6ae5a1d0d33bf895f5bb4f0a418b7dc94c47c0dd2eb108d1158f3c8f60b0ff", - "zone": "dc1", - "capacity": 10737418240, - "tags": [ - "node2" - ] + "addr": "10.0.0.1:3901", + "hostname": "node1", + "isUp": true, + "lastSeenSecsAgo": 3, + "draining": false, + "dataPartition": { + "available": 660270088192, + "total": 873862266880 }, - { - "id": "23ffd0cdd375ebff573b20cc5cef38996b51c1a7d6dbcf2c6e619876e507cf27", - "zone": "dc2", - "capacity": 10737418240, - "tags": [ - "node3" - ] + "metadataPartition": { + "available": 660270088192, + "total": 873862266880 } - ], - "stagedRoleChanges": [ - { - "id": "e2ee7984ee65b260682086ec70026165903c86e601a4a5a501c1900afe28d84b", - "remove": false, - "zone": "dc2", - "capacity": 10737418240, - "tags": [ - "node4" - ] - } - { - "id": "23ffd0cdd375ebff573b20cc5cef38996b51c1a7d6dbcf2c6e619876e507cf27", - "remove": true, - "zone": null, - "capacity": null, - "tags": null, - } - ] - } + } + ] } ``` diff --git a/doc/drafts/k2v-spec.md b/doc/drafts/k2v-spec.md index faa1a247..3956fa31 100644 --- a/doc/drafts/k2v-spec.md +++ b/doc/drafts/k2v-spec.md @@ -146,7 +146,7 @@ in a bucket, as the partition key becomes the sort key in the index. How indexing works: - Each node keeps a local count of how many items it stores for each partition, - in a local Sled tree that is updated atomically when an item is modified. + in a local database tree that is updated atomically when an item is modified. - These local counters are asynchronously stored in the index table which is a regular Garage table spread in the network. Counters are stored as LWW values, so basically the final table will have the following structure: |