diff options
author | Alex Auvolat <alex@adnab.me> | 2024-03-18 20:17:54 +0100 |
---|---|---|
committer | Alex Auvolat <alex@adnab.me> | 2024-03-18 20:19:30 +0100 |
commit | 0038ca8a78f147b9c0ec07ef0121773aaf110dc9 (patch) | |
tree | 43f39f30c63a6affa62eeea62cfec674f217c2b4 /doc/book | |
parent | 81191d2d92e58ff82ace0f4d82b275c157673ade (diff) | |
parent | 1a0bffae3491fae6af5a8d4defc5c6b84839e197 (diff) | |
download | garage-0038ca8a78f147b9c0ec07ef0121773aaf110dc9.tar.gz garage-0038ca8a78f147b9c0ec07ef0121773aaf110dc9.zip |
Merge branch 'main' into next-0.10
Diffstat (limited to 'doc/book')
-rw-r--r-- | doc/book/cookbook/from-source.md | 2 | ||||
-rw-r--r-- | doc/book/cookbook/real-world.md | 38 | ||||
-rw-r--r-- | doc/book/operations/durability-repairs.md | 20 | ||||
-rw-r--r-- | doc/book/operations/recovering.md | 54 | ||||
-rw-r--r-- | doc/book/operations/upgrading.md | 12 | ||||
-rw-r--r-- | doc/book/quick-start/_index.md | 2 | ||||
-rw-r--r-- | doc/book/reference-manual/configuration.md | 76 |
7 files changed, 177 insertions, 27 deletions
diff --git a/doc/book/cookbook/from-source.md b/doc/book/cookbook/from-source.md index f7fd17ce..f0e185a4 100644 --- a/doc/book/cookbook/from-source.md +++ b/doc/book/cookbook/from-source.md @@ -91,4 +91,4 @@ The following feature flags are available in v0.8.0: | `metrics` | *by default* | Enable collection of metrics in Prometheus format on the admin API | | `telemetry-otlp` | optional | Enable collection of execution traces using OpenTelemetry | | `lmdb` | *by default* | Enable using LMDB to store Garage's metadata | -| `sqlite` | optional | Enable using Sqlite3 to store Garage's metadata | +| `sqlite` | *by default* | Enable using Sqlite3 to store Garage's metadata | diff --git a/doc/book/cookbook/real-world.md b/doc/book/cookbook/real-world.md index 30be4907..cd42bb0c 100644 --- a/doc/book/cookbook/real-world.md +++ b/doc/book/cookbook/real-world.md @@ -27,7 +27,7 @@ To run a real-world deployment, make sure the following conditions are met: [Yggdrasil](https://yggdrasil-network.github.io/) are approaches to consider in addition to building out your own VPN tunneling. -- This guide will assume you are using Docker containers to deploy Garage on each node. +- This guide will assume you are using Docker containers to deploy Garage on each node. Garage can also be run independently, for instance as a [Systemd service](@/documentation/cookbook/systemd.md). You can also use an orchestrator such as Nomad or Kubernetes to automatically manage Docker containers on a fleet of nodes. @@ -53,9 +53,9 @@ to store 2 TB of data in total. ### Best practices -- If you have fast dedicated networking between all your nodes, and are planing to store - very large files, bump the `block_size` configuration parameter to 10 MB - (`block_size = 10485760`). +- If you have reasonably fast networking between all your nodes, and are planing to store + mostly large files, bump the `block_size` configuration parameter to 10 MB + (`block_size = "10M"`). - Garage stores its files in two locations: it uses a metadata directory to store frequently-accessed small metadata items, and a data directory to store data blocks of uploaded objects. @@ -73,14 +73,25 @@ to store 2 TB of data in total. help a lot with performance. The default LMDB database engine is the most tested and has good performance. -- For the metadata storage, Garage does not do checksumming and integrity - verification on its own. If you are afraid of bitrot/data corruption, - put your metadata directory on a ZFS or BTRFS partition. Otherwise, just use regular - EXT4 or XFS. - - Servers with multiple HDDs are supported natively by Garage without resorting to RAID, see [our dedicated documentation page](@/documentation/operations/multi-hdd.md). +- For the metadata storage, Garage does not do checksumming and integrity + verification on its own, so it is better to use a robust filesystem such as + BTRFS or ZFS. Users have reported that when using the LMDB database engine + (the default), database files have a tendency of becoming corrupted after an + unclean shutdown (e.g. a power outage), so you should take regular snapshots + to be able to recover from such a situation. This can be done using Garage's + built-in automatic snapshotting (since v0.9.4), or by using filesystem level + snapshots. If you cannot do so, you might want to switch to Sqlite which is + more robust. + +- LMDB is the fastest and most tested database engine, but it has the following + weaknesses: 1/ data files are not architecture-independent, you cannot simply + move a Garage metadata directory between nodes running different architectures, + and 2/ LMDB is not suited for 32-bit platforms. Sqlite is a viable alternative + if any of these are of concern. + ## Get a Docker image Our docker image is currently named `dxflrs/garage` and is stored on the [Docker Hub](https://hub.docker.com/r/dxflrs/garage/tags?page=1&ordering=last_updated). @@ -114,6 +125,7 @@ A valid `/etc/garage.toml` for our cluster would look as follows: metadata_dir = "/var/lib/garage/meta" data_dir = "/var/lib/garage/data" db_engine = "lmdb" +metadata_auto_snapshot_interval = "6h" replication_factor = 3 @@ -186,7 +198,7 @@ upgrades. With the containerized setup proposed here, the upgrade process will require stopping and removing the existing container, and re-creating it with the upgraded version. -## Controling the daemon +## Controlling the daemon The `garage` binary has two purposes: - it acts as a daemon when launched with `garage server` @@ -244,7 +256,7 @@ You can then instruct nodes to connect to one another as follows: Venus$ garage node connect 563e1ac825ee3323aa441e72c26d1030d6d4414aeb3dd25287c531e7fc2bc95d@[fc00:1::1]:3901 ``` -You don't nead to instruct all node to connect to all other nodes: +You don't need to instruct all node to connect to all other nodes: nodes will discover one another transitively. Now if your run `garage status` on any node, you should have an output that looks as follows: @@ -327,8 +339,8 @@ Given the information above, we will configure our cluster as follow: ```bash garage layout assign 563e -z par1 -c 1T -t mercury garage layout assign 86f0 -z par1 -c 2T -t venus -garage layout assign 6814 -z lon1 -c 2T -t earth -garage layout assign 212f -z bru1 -c 1.5T -t mars +garage layout assign 6814 -z lon1 -c 2T -t earth +garage layout assign 212f -z bru1 -c 1.5T -t mars ``` At this point, the changes in the cluster layout have not yet been applied. diff --git a/doc/book/operations/durability-repairs.md b/doc/book/operations/durability-repairs.md index 578899a8..c76dc39e 100644 --- a/doc/book/operations/durability-repairs.md +++ b/doc/book/operations/durability-repairs.md @@ -19,7 +19,7 @@ connecting to. To run on all nodes, add the `-a` flag as follows: # Data block operations -## Data store scrub +## Data store scrub {#scrub} Scrubbing the data store means examining each individual data block to check that their content is correct, by verifying their hash. Any block found to be corrupted @@ -104,6 +104,24 @@ operation will also move out all data from locations marked as read-only. # Metadata operations +## Metadata snapshotting + +It is good practice to setup automatic snapshotting of your metadata database +file, to recover from situations where it becomes corrupted on disk. This can +be done at the filesystem level if you are using ZFS or BTRFS. + +Since Garage v0.9.4, Garage is able to take snapshots of the metadata database +itself. This basically amounts to copying the database file, except that it can +be run live while Garage is running without the risk of corruption or +inconsistencies. This can be setup to run automatically on a schedule using +[`metadata_auto_snapshot_interval`](@/documentation/reference-manual/configuration.md#metadata_auto_snapshot_interval). +A snapshot can also be triggered manually using the `garage meta snapshot` +command. Note that taking a snapshot using this method is very intensive as it +requires making a full copy of the database file, so you might prefer using +filesystem-level snapshots if possible. To recover a corrupted node from such a +snapshot, read the instructions +[here](@/documentation/operations/recovering.md#corrupted_meta). + ## Metadata table resync Garage automatically resyncs all entries stored in the metadata tables every hour, diff --git a/doc/book/operations/recovering.md b/doc/book/operations/recovering.md index 7a830788..6e19db0e 100644 --- a/doc/book/operations/recovering.md +++ b/doc/book/operations/recovering.md @@ -108,3 +108,57 @@ garage layout apply # once satisfied, apply the changes Garage will then start synchronizing all required data on the new node. This process can be monitored using the `garage stats -a` command. + +## Replacement scenario 3: corrupted metadata {#corrupted_meta} + +In some cases, your metadata DB file might become corrupted, for instance if +your node suffered a power outage and did not shut down properly. In this case, +you can recover without having to change the node ID and rebuilding a cluster +layout. This means that data blocks will not need to be shuffled around, you +must simply find a way to repair the metadata file. The best way is generally +to discard the corrupted file and recover it from another source. + +First of all, start by locating the database file in your metadata directory, +which [depends on your `db_engine` +choice](@/documentation/reference-manual/configuration.md#db_engine). Then, +your recovery options are as follows: + +- **Option 1: resyncing from other nodes.** In case your cluster is replicated + with two or three copies, you can simply delete the database file, and Garage + will resync from other nodes. To do so, stop Garage, delete the database file + or directory, and restart Garage. Then, do a full table repair by calling + `garage repair -a --yes tables`. This will take a bit of time to complete as + the new node will need to receive copies of the metadata tables from the + network. + +- **Option 2: restoring a snapshot taken by Garage.** Since v0.9.4, Garage can + [automatically take regular + snapshots](@/documentation/reference-manual/configuration.md#metadata_auto_snapshot_interval) + of your metadata DB file. This file or directory should be located under + `<metadata_dir>/snapshots`, and is named according to the UTC time at which it + was taken. Stop Garage, discard the database file/directory and replace it by the + snapshot you want to use. For instance, in the case of LMDB: + + ```bash + cd $METADATA_DIR + mv db.lmdb db.lmdb.bak + cp -r snapshots/2024-03-15T12:13:52Z db.lmdb + ``` + + And for Sqlite: + + ```bash + cd $METADATA_DIR + mv db.sqlite db.sqlite.bak + cp snapshots/2024-03-15T12:13:52Z db.sqlite + ``` + + Then, restart Garage and run a full table repair by calling `garage repair -a + --yes tables`. This should run relatively fast as only the changes that + occurred since the snapshot was taken will need to be resynchronized. Of + course, if your cluster is not replicated, you will lose all changes that + occurred since the snapshot was taken. + +- **Option 3: restoring a filesystem-level snapshot.** If you are using ZFS or + BTRFS to snapshot your metadata partition, refer to their specific + documentation on rolling back or copying files from an old snapshot. diff --git a/doc/book/operations/upgrading.md b/doc/book/operations/upgrading.md index 6b6ea26d..c239bfe4 100644 --- a/doc/book/operations/upgrading.md +++ b/doc/book/operations/upgrading.md @@ -73,6 +73,18 @@ The entire procedure would look something like this: You can do all of the nodes in a single zone at once as that won't impact global cluster availability. Do not try to make a backup of the metadata folder of a running node. + **Since Garage v0.9.4,** you can use the `garage meta snapshot --all` command + to take a simultaneous snapshot of the metadata database files of all your + nodes. This avoids the tedious process of having to take them down one by + one before upgrading. Be careful that if automatic snapshotting is enabled, + Garage only keeps the last two snapshots and deletes older ones, so you might + want to disable automatic snapshotting in your upgraded configuration file + until you have confirmed that the upgrade ran successfully. In addition to + snapshotting the metadata databases of your nodes, you should back-up at + least the `cluster_layout` file of one of your Garage instances (this file + should be the same on all nodes and you can copy it safely while Garage is + running). + 3. Prepare your binaries and configuration files for the new Garage version 4. Restart all nodes simultaneously in the new version diff --git a/doc/book/quick-start/_index.md b/doc/book/quick-start/_index.md index be9fe329..9619f388 100644 --- a/doc/book/quick-start/_index.md +++ b/doc/book/quick-start/_index.md @@ -57,7 +57,7 @@ to generate unique and private secrets for security reasons: cat > garage.toml <<EOF metadata_dir = "/tmp/meta" data_dir = "/tmp/data" -db_engine = "lmdb" +db_engine = "sqlite" replication_factor = 1 diff --git a/doc/book/reference-manual/configuration.md b/doc/book/reference-manual/configuration.md index 4df2d0df..a21f945b 100644 --- a/doc/book/reference-manual/configuration.md +++ b/doc/book/reference-manual/configuration.md @@ -15,6 +15,8 @@ metadata_dir = "/var/lib/garage/meta" data_dir = "/var/lib/garage/data" metadata_fsync = true data_fsync = false +disable_scrub = false +metadata_auto_snapshot_interval = "6h" db_engine = "lmdb" @@ -86,7 +88,9 @@ Top-level configuration options: [`data_dir`](#data_dir), [`data_fsync`](#data_fsync), [`db_engine`](#db_engine), +[`disable_scrub`](#disable_scrub), [`lmdb_map_size`](#lmdb_map_size), +[`metadata_auto_snapshot_interval`](#metadata_auto_snapshot_interval), [`metadata_dir`](#metadata_dir), [`metadata_fsync`](#metadata_fsync), [`replication_factor`](#replication_factor), @@ -277,18 +281,33 @@ old Sled metadata databases to another engine. Performance characteristics of the different DB engines are as follows: -- LMDB: the recommended database engine on 64-bit systems, much more - space-efficient and slightly faster. Note that the data format of LMDB is not - portable between architectures, so for instance the Garage database of an - x86-64 node cannot be moved to an ARM64 node. Also note that, while LMDB can - technically be used on 32-bit systems, this will limit your node to very - small database sizes due to how LMDB works; it is therefore not recommended. +- LMDB: the recommended database engine for high-performance distributed clusters. +LMDB works very well, but is known to have the following limitations: + + - The data format of LMDB is not portable between architectures, so for + instance the Garage database of an x86-64 node cannot be moved to an ARM64 + node. + + - While LMDB can technically be used on 32-bit systems, this will limit your + node to very small database sizes due to how LMDB works; it is therefore + not recommended. + + - Several users have reported corrupted LMDB database files after an unclean + shutdown (e.g. a power outage). This situation can generally be recovered + from if your cluster is geo-replicated (by rebuilding your metadata db from + other nodes), or if you have saved regular snapshots at the filesystem + level. + + - Keys in LMDB are limited to 511 bytes. This limit translates to limits on + object keys in S3 and sort keys in K2V that are limted to 479 bytes. - Sqlite: Garage supports Sqlite as an alternative storage backend for - metadata, and although it has not been tested as much, it is expected to work - satisfactorily. Since Garage v0.9.0, performance issues have largely been - fixed by allowing for a no-fsync mode (see `metadata_fsync`). Sqlite does not - have the database size limitation of LMDB on 32-bit systems. + metadata, which does not have the issues listed above for LMDB. + On versions 0.8.x and earlier, Sqlite should be avoided due to abysmal + performance, which was fixed with the addition of `metadata_fsync`. + Sqlite is still probably slower than LMDB due to the way we use it, + so it is not the best choice for high-performance storage clusters, + but it should work fine in many cases. It is possible to convert Garage's metadata directory from one format to another using the `garage convert-db` command, which should be used as follows: @@ -315,7 +334,7 @@ Using this option reduces the risk of simultaneous metadata corruption on severa cluster nodes, which could lead to data loss. If multi-site replication is used, this option is most likely not necessary, as -it is extremely unlikely that two nodes in different locations will have a +it is extremely unlikely that two nodes in different locations will have a power failure at the exact same time. (Metadata corruption on a single node is not an issue, the corrupted data file @@ -343,6 +362,41 @@ at the cost of a moderate drop in write performance. Similarly to `metatada_fsync`, this is likely not necessary if geographical replication is used. +#### `metadata_auto_snapshot_interval` (since Garage v0.9.4) {#metadata_auto_snapshot_interval} + +If this value is set, Garage will automatically take a snapshot of the metadata +DB file at a regular interval and save it in the metadata directory. +This can allow to recover from situations where the metadata DB file is corrupted, +for instance after an unclean shutdown. +See [this page](@/documentation/operations/recovering.md#corrupted_meta) for details. + +Garage keeps only the two most recent snapshots of the metadata DB and deletes +older ones automatically. + +Note that taking a metadata snapshot is a relatively intensive operation as the +entire data file is copied. A snapshot being taken might have performance +impacts on the Garage node while it is running. If the cluster is under heavy +write load when a snapshot operation is running, this might also cause the +database file to grow in size significantly as pages cannot be recycled easily. +For this reason, it might be better to use filesystem-level snapshots instead +if possible. + +#### `disable_scrub` {#disable_scrub} + +By default, Garage runs a scrub of the data directory approximately once per +month, with a random delay to avoid all nodes running at the same time. When +it scrubs the data directory, Garage will read all of the data files stored on +disk to check their integrity, and will rebuild any data files that it finds +corrupted, using the remaining valid copies stored on other nodes. +See [this page](@/documentation/operations/durability-repair.md#scrub) for details. + +Set the `disable_scrub` configuration value to `true` if you don't need Garage +to scrub the data directory, for instance if you are already scrubbing at the +filesystem level. Note that in this case, if you find a corrupted data file, +you should delete it from the data directory and then call `garage repair +blocks` on the node to ensure that it re-obtains a copy from another node on +the network. + #### `block_size` {#block_size} Garage splits stored objects in consecutive chunks of size `block_size` |