From a7dddebedd59cfb00955b8648aa0b022581472ba Mon Sep 17 00:00:00 2001 From: Alex Auvolat Date: Thu, 14 Mar 2024 16:29:42 +0100 Subject: [doc-updates] doc: be slightly more critical of LMDB --- doc/book/cookbook/real-world.md | 47 ++++++++++++++++++------------ doc/book/quick-start/_index.md | 2 +- doc/book/reference-manual/configuration.md | 35 +++++++++++++++------- 3 files changed, 53 insertions(+), 31 deletions(-) (limited to 'doc') diff --git a/doc/book/cookbook/real-world.md b/doc/book/cookbook/real-world.md index c15ea384..15a58b9b 100644 --- a/doc/book/cookbook/real-world.md +++ b/doc/book/cookbook/real-world.md @@ -27,7 +27,7 @@ To run a real-world deployment, make sure the following conditions are met: [Yggdrasil](https://yggdrasil-network.github.io/) are approaches to consider in addition to building out your own VPN tunneling. -- This guide will assume you are using Docker containers to deploy Garage on each node. +- This guide will assume you are using Docker containers to deploy Garage on each node. Garage can also be run independently, for instance as a [Systemd service](@/documentation/cookbook/systemd.md). You can also use an orchestrator such as Nomad or Kubernetes to automatically manage Docker containers on a fleet of nodes. @@ -53,9 +53,9 @@ to store 2 TB of data in total. ### Best practices -- If you have fast dedicated networking between all your nodes, and are planing to store - very large files, bump the `block_size` configuration parameter to 10 MB - (`block_size = 10485760`). +- If you have reasonably fast networking between all your nodes, and are planing to store + mostly large files, bump the `block_size` configuration parameter to 10 MB + (`block_size = "10M"`). - Garage stores its files in two locations: it uses a metadata directory to store frequently-accessed small metadata items, and a data directory to store data blocks of uploaded objects. @@ -68,20 +68,29 @@ to store 2 TB of data in total. EXT4 is not recommended as it has more strict limitations on the number of inodes, which might cause issues with Garage when large numbers of objects are stored. -- If you only have an HDD and no SSD, it's fine to put your metadata alongside the data - on the same drive. Having lots of RAM for your kernel to cache the metadata will - help a lot with performance. Make sure to use the LMDB database engine, - instead of Sled, which suffers from quite bad performance degradation on HDDs. - Sled is still the default for legacy reasons, but is not recommended anymore. - -- For the metadata storage, Garage does not do checksumming and integrity - verification on its own. If you are afraid of bitrot/data corruption, - put your metadata directory on a ZFS or BTRFS partition. Otherwise, just use regular - EXT4 or XFS. - - Servers with multiple HDDs are supported natively by Garage without resorting to RAID, see [our dedicated documentation page](@/documentation/operations/multi-hdd.md). +- For the metadata storage, Garage does not do checksumming and integrity + verification on its own. Users have reported that when using the LMDB + database engine (the default), database files have a tendency of becoming + corrupted after an unclean shutdown (e.g. a power outage), so you should use + a robust filesystem such as BTRFS or ZFS for the metadata partition, and take + regular snapshots so that you can restore to a recent known-good state in + case of an incident. If you cannot do so, you might want to switch to Sqlite + which is more robust. + +- LMDB is the fastest and most tested database engine, but it has the following + weaknesses: 1/ data files are not architecture-independent, you cannot simply + move a Garage metadata directory between nodes running different architectures, + and 2/ LMDB is not suited for 32-bit platforms. Sqlite is a viable alternative + if any of these are of concern. + +- If you only have an HDD and no SSD, it's fine to put your metadata alongside + the data on the same drive, but then consider your filesystem choice wisely + (see above). Having lots of RAM for your kernel to cache the metadata will + help a lot with performance. + ## Get a Docker image Our docker image is currently named `dxflrs/garage` and is stored on the [Docker Hub](https://hub.docker.com/r/dxflrs/garage/tags?page=1&ordering=last_updated). @@ -187,7 +196,7 @@ upgrades. With the containerized setup proposed here, the upgrade process will require stopping and removing the existing container, and re-creating it with the upgraded version. -## Controling the daemon +## Controlling the daemon The `garage` binary has two purposes: - it acts as a daemon when launched with `garage server` @@ -245,7 +254,7 @@ You can then instruct nodes to connect to one another as follows: Venus$ garage node connect 563e1ac825ee3323aa441e72c26d1030d6d4414aeb3dd25287c531e7fc2bc95d@[fc00:1::1]:3901 ``` -You don't nead to instruct all node to connect to all other nodes: +You don't need to instruct all node to connect to all other nodes: nodes will discover one another transitively. Now if your run `garage status` on any node, you should have an output that looks as follows: @@ -328,8 +337,8 @@ Given the information above, we will configure our cluster as follow: ```bash garage layout assign 563e -z par1 -c 1T -t mercury garage layout assign 86f0 -z par1 -c 2T -t venus -garage layout assign 6814 -z lon1 -c 2T -t earth -garage layout assign 212f -z bru1 -c 1.5T -t mars +garage layout assign 6814 -z lon1 -c 2T -t earth +garage layout assign 212f -z bru1 -c 1.5T -t mars ``` At this point, the changes in the cluster layout have not yet been applied. diff --git a/doc/book/quick-start/_index.md b/doc/book/quick-start/_index.md index f359843d..870bf9e9 100644 --- a/doc/book/quick-start/_index.md +++ b/doc/book/quick-start/_index.md @@ -57,7 +57,7 @@ to generate unique and private secrets for security reasons: cat > garage.toml < Date: Thu, 14 Mar 2024 16:35:15 +0100 Subject: [doc-updates] from source: fix default feature list --- doc/book/cookbook/from-source.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'doc') diff --git a/doc/book/cookbook/from-source.md b/doc/book/cookbook/from-source.md index bacf93ab..ade47e96 100644 --- a/doc/book/cookbook/from-source.md +++ b/doc/book/cookbook/from-source.md @@ -91,5 +91,5 @@ The following feature flags are available in v0.8.0: | `metrics` | *by default* | Enable collection of metrics in Prometheus format on the admin API | | `telemetry-otlp` | optional | Enable collection of execution traces using OpenTelemetry | | `sled` | *by default* | Enable using Sled to store Garage's metadata | -| `lmdb` | optional | Enable using LMDB to store Garage's metadata | -| `sqlite` | optional | Enable using Sqlite3 to store Garage's metadata | +| `lmdb` | *by default* | Enable using LMDB to store Garage's metadata | +| `sqlite` | *by default* | Enable using Sqlite3 to store Garage's metadata | -- cgit v1.2.3 From 990205dc3b2f552e9b68168c0aad3e96e2d6f2f0 Mon Sep 17 00:00:00 2001 From: Alex Auvolat Date: Thu, 14 Mar 2024 16:57:44 +0100 Subject: [disable-scrub] document `disable_scrub` config option --- doc/book/operations/durability-repairs.md | 2 +- doc/book/reference-manual/configuration.md | 18 ++++++++++++++++++ 2 files changed, 19 insertions(+), 1 deletion(-) (limited to 'doc') diff --git a/doc/book/operations/durability-repairs.md b/doc/book/operations/durability-repairs.md index 578899a8..f4450dae 100644 --- a/doc/book/operations/durability-repairs.md +++ b/doc/book/operations/durability-repairs.md @@ -19,7 +19,7 @@ connecting to. To run on all nodes, add the `-a` flag as follows: # Data block operations -## Data store scrub +## Data store scrub {#scrub} Scrubbing the data store means examining each individual data block to check that their content is correct, by verifying their hash. Any block found to be corrupted diff --git a/doc/book/reference-manual/configuration.md b/doc/book/reference-manual/configuration.md index 81af1de0..8e87b7d8 100644 --- a/doc/book/reference-manual/configuration.md +++ b/doc/book/reference-manual/configuration.md @@ -14,6 +14,7 @@ metadata_dir = "/var/lib/garage/meta" data_dir = "/var/lib/garage/data" metadata_fsync = true data_fsync = false +disable_scrub = false db_engine = "lmdb" @@ -87,6 +88,7 @@ Top-level configuration options: [`data_dir`](#data_dir), [`data_fsync`](#data_fsync), [`db_engine`](#db_engine), +[`disable_scrub`](#disable_scrub), [`lmdb_map_size`](#lmdb_map_size), [`metadata_dir`](#metadata_dir), [`metadata_fsync`](#metadata_fsync), @@ -344,6 +346,22 @@ at the cost of a moderate drop in write performance. Similarly to `metatada_fsync`, this is likely not necessary if geographical replication is used. +#### `disable_scrub` {#disable_scrub} + +By default, Garage runs a scrub of the data directory approximately once per +month, with a random delay to avoid all nodes running at the same time. When +it scrubs the data directory, Garage will read all of the data files stored on +disk to check their integrity, and will rebuild any data files that it finds +corrupted, using the remaining valid copies stored on other nodes. +See [this page](@/documentation/operations/durability-repair.md#scrub) for details. + +Set the `disable_scrub` configuration value to `true` if you don't need Garage +to scrub the data directory, for instance if you are already scrubbing at the +filesystem level. Note that in this case, if you find a corrupted data file, +you should delete it from the data directory and then call `garage repair +blocks` on the node to ensure that it re-obtains a copy from another node on +the network. + #### `block_size` {#block_size} Garage splits stored objects in consecutive chunks of size `block_size` -- cgit v1.2.3 From 8cf3d24875d41d79ab08d637cd38d2a5b9e527dd Mon Sep 17 00:00:00 2001 From: Alex Auvolat Date: Fri, 15 Mar 2024 13:16:41 +0100 Subject: [db-snapshot] documentation for metadata db snapshots --- doc/book/cookbook/real-world.md | 16 +++++---- doc/book/operations/durability-repairs.md | 18 ++++++++++ doc/book/operations/recovering.md | 54 ++++++++++++++++++++++++++++++ doc/book/operations/upgrading.md | 12 +++++++ doc/book/reference-manual/configuration.md | 21 ++++++++++++ 5 files changed, 114 insertions(+), 7 deletions(-) (limited to 'doc') diff --git a/doc/book/cookbook/real-world.md b/doc/book/cookbook/real-world.md index 15a58b9b..9e226030 100644 --- a/doc/book/cookbook/real-world.md +++ b/doc/book/cookbook/real-world.md @@ -72,13 +72,14 @@ to store 2 TB of data in total. to RAID, see [our dedicated documentation page](@/documentation/operations/multi-hdd.md). - For the metadata storage, Garage does not do checksumming and integrity - verification on its own. Users have reported that when using the LMDB - database engine (the default), database files have a tendency of becoming - corrupted after an unclean shutdown (e.g. a power outage), so you should use - a robust filesystem such as BTRFS or ZFS for the metadata partition, and take - regular snapshots so that you can restore to a recent known-good state in - case of an incident. If you cannot do so, you might want to switch to Sqlite - which is more robust. + verification on its own, so it is better to use a robust filesystem such as + BTRFS or ZFS. Users have reported that when using the LMDB database engine + (the default), database files have a tendency of becoming corrupted after an + unclean shutdown (e.g. a power outage), so you should take regular snapshots + to be able to recover from such a situation. This can be done using Garage's + built-in automatic snapshotting (since v0.9.4), or by using filesystem level + snapshots. If you cannot do so, you might want to switch to Sqlite which is + more robust. - LMDB is the fastest and most tested database engine, but it has the following weaknesses: 1/ data files are not architecture-independent, you cannot simply @@ -124,6 +125,7 @@ A valid `/etc/garage.toml` for our cluster would look as follows: metadata_dir = "/var/lib/garage/meta" data_dir = "/var/lib/garage/data" db_engine = "lmdb" +metadata_auto_snapshot_interval = "6h" replication_mode = "3" diff --git a/doc/book/operations/durability-repairs.md b/doc/book/operations/durability-repairs.md index f4450dae..c76dc39e 100644 --- a/doc/book/operations/durability-repairs.md +++ b/doc/book/operations/durability-repairs.md @@ -104,6 +104,24 @@ operation will also move out all data from locations marked as read-only. # Metadata operations +## Metadata snapshotting + +It is good practice to setup automatic snapshotting of your metadata database +file, to recover from situations where it becomes corrupted on disk. This can +be done at the filesystem level if you are using ZFS or BTRFS. + +Since Garage v0.9.4, Garage is able to take snapshots of the metadata database +itself. This basically amounts to copying the database file, except that it can +be run live while Garage is running without the risk of corruption or +inconsistencies. This can be setup to run automatically on a schedule using +[`metadata_auto_snapshot_interval`](@/documentation/reference-manual/configuration.md#metadata_auto_snapshot_interval). +A snapshot can also be triggered manually using the `garage meta snapshot` +command. Note that taking a snapshot using this method is very intensive as it +requires making a full copy of the database file, so you might prefer using +filesystem-level snapshots if possible. To recover a corrupted node from such a +snapshot, read the instructions +[here](@/documentation/operations/recovering.md#corrupted_meta). + ## Metadata table resync Garage automatically resyncs all entries stored in the metadata tables every hour, diff --git a/doc/book/operations/recovering.md b/doc/book/operations/recovering.md index 7a830788..6e19db0e 100644 --- a/doc/book/operations/recovering.md +++ b/doc/book/operations/recovering.md @@ -108,3 +108,57 @@ garage layout apply # once satisfied, apply the changes Garage will then start synchronizing all required data on the new node. This process can be monitored using the `garage stats -a` command. + +## Replacement scenario 3: corrupted metadata {#corrupted_meta} + +In some cases, your metadata DB file might become corrupted, for instance if +your node suffered a power outage and did not shut down properly. In this case, +you can recover without having to change the node ID and rebuilding a cluster +layout. This means that data blocks will not need to be shuffled around, you +must simply find a way to repair the metadata file. The best way is generally +to discard the corrupted file and recover it from another source. + +First of all, start by locating the database file in your metadata directory, +which [depends on your `db_engine` +choice](@/documentation/reference-manual/configuration.md#db_engine). Then, +your recovery options are as follows: + +- **Option 1: resyncing from other nodes.** In case your cluster is replicated + with two or three copies, you can simply delete the database file, and Garage + will resync from other nodes. To do so, stop Garage, delete the database file + or directory, and restart Garage. Then, do a full table repair by calling + `garage repair -a --yes tables`. This will take a bit of time to complete as + the new node will need to receive copies of the metadata tables from the + network. + +- **Option 2: restoring a snapshot taken by Garage.** Since v0.9.4, Garage can + [automatically take regular + snapshots](@/documentation/reference-manual/configuration.md#metadata_auto_snapshot_interval) + of your metadata DB file. This file or directory should be located under + `/snapshots`, and is named according to the UTC time at which it + was taken. Stop Garage, discard the database file/directory and replace it by the + snapshot you want to use. For instance, in the case of LMDB: + + ```bash + cd $METADATA_DIR + mv db.lmdb db.lmdb.bak + cp -r snapshots/2024-03-15T12:13:52Z db.lmdb + ``` + + And for Sqlite: + + ```bash + cd $METADATA_DIR + mv db.sqlite db.sqlite.bak + cp snapshots/2024-03-15T12:13:52Z db.sqlite + ``` + + Then, restart Garage and run a full table repair by calling `garage repair -a + --yes tables`. This should run relatively fast as only the changes that + occurred since the snapshot was taken will need to be resynchronized. Of + course, if your cluster is not replicated, you will lose all changes that + occurred since the snapshot was taken. + +- **Option 3: restoring a filesystem-level snapshot.** If you are using ZFS or + BTRFS to snapshot your metadata partition, refer to their specific + documentation on rolling back or copying files from an old snapshot. diff --git a/doc/book/operations/upgrading.md b/doc/book/operations/upgrading.md index 6b6ea26d..c239bfe4 100644 --- a/doc/book/operations/upgrading.md +++ b/doc/book/operations/upgrading.md @@ -73,6 +73,18 @@ The entire procedure would look something like this: You can do all of the nodes in a single zone at once as that won't impact global cluster availability. Do not try to make a backup of the metadata folder of a running node. + **Since Garage v0.9.4,** you can use the `garage meta snapshot --all` command + to take a simultaneous snapshot of the metadata database files of all your + nodes. This avoids the tedious process of having to take them down one by + one before upgrading. Be careful that if automatic snapshotting is enabled, + Garage only keeps the last two snapshots and deletes older ones, so you might + want to disable automatic snapshotting in your upgraded configuration file + until you have confirmed that the upgrade ran successfully. In addition to + snapshotting the metadata databases of your nodes, you should back-up at + least the `cluster_layout` file of one of your Garage instances (this file + should be the same on all nodes and you can copy it safely while Garage is + running). + 3. Prepare your binaries and configuration files for the new Garage version 4. Restart all nodes simultaneously in the new version diff --git a/doc/book/reference-manual/configuration.md b/doc/book/reference-manual/configuration.md index 8e87b7d8..de800ec0 100644 --- a/doc/book/reference-manual/configuration.md +++ b/doc/book/reference-manual/configuration.md @@ -15,6 +15,7 @@ data_dir = "/var/lib/garage/data" metadata_fsync = true data_fsync = false disable_scrub = false +metadata_auto_snapshot_interval = "6h" db_engine = "lmdb" @@ -90,6 +91,7 @@ Top-level configuration options: [`db_engine`](#db_engine), [`disable_scrub`](#disable_scrub), [`lmdb_map_size`](#lmdb_map_size), +[`metadata_auto_snapshot_interval`](#metadata_auto_snapshot_interval), [`metadata_dir`](#metadata_dir), [`metadata_fsync`](#metadata_fsync), [`replication_mode`](#replication_mode), @@ -346,6 +348,25 @@ at the cost of a moderate drop in write performance. Similarly to `metatada_fsync`, this is likely not necessary if geographical replication is used. +#### `metadata_auto_snapshot_interval` (since Garage v0.9.4) {#metadata_auto_snapshot_interval} + +If this value is set, Garage will automatically take a snapshot of the metadata +DB file at a regular interval and save it in the metadata directory. +This can allow to recover from situations where the metadata DB file is corrupted, +for instance after an unclean shutdown. +See [this page](@/documentation/operations/recovering.md#corrupted_meta) for details. + +Garage keeps only the two most recent snapshots of the metadata DB and deletes +older ones automatically. + +Note that taking a metadata snapshot is a relatively intensive operation as the +entire data file is copied. A snapshot being taken might have performance +impacts on the Garage node while it is running. If the cluster is under heavy +write load when a snapshot operation is running, this might also cause the +database file to grow in size significantly as pages cannot be recycled easily. +For this reason, it might be better to use filesystem-level snapshots instead +if possible. + #### `disable_scrub` {#disable_scrub} By default, Garage runs a scrub of the data directory approximately once per -- cgit v1.2.3