Merge branch 'main' into next-0.10

author: Alex Auvolat <alex@adnab.me> 2024-03-18 20:17:54 +0100
committer: Alex Auvolat <alex@adnab.me> 2024-03-18 20:19:30 +0100
commit: 0038ca8a78f147b9c0ec07ef0121773aaf110dc9 (patch)
tree: 43f39f30c63a6affa62eeea62cfec674f217c2b4 /doc
parent: 81191d2d92e58ff82ace0f4d82b275c157673ade (diff)
parent: 1a0bffae3491fae6af5a8d4defc5c6b84839e197 (diff)
download: garage-0038ca8a78f147b9c0ec07ef0121773aaf110dc9.tar.gz
garage-0038ca8a78f147b9c0ec07ef0121773aaf110dc9.zip
7 files changed, 177 insertions, 27 deletions
diff --git a/doc/book/cookbook/from-source.md b/doc/book/cookbook/from-source.md
index f7fd17ce..f0e185a4 100644
--- a/doc/book/cookbook/from-source.md
+++ b/doc/book/cookbook/from-source.md
@@ -91,4 +91,4 @@ The following feature flags are available in v0.8.0:
 | `metrics` | *by default* | Enable collection of metrics in Prometheus format on the admin API |
 | `telemetry-otlp` | optional | Enable collection of execution traces using OpenTelemetry |
 | `lmdb` | *by default* | Enable using LMDB to store Garage's metadata |
-| `sqlite` | optional | Enable using Sqlite3 to store Garage's metadata |
+| `sqlite` | *by default* | Enable using Sqlite3 to store Garage's metadata |
diff --git a/doc/book/cookbook/real-world.md b/doc/book/cookbook/real-world.md
index 30be4907..cd42bb0c 100644
--- a/doc/book/cookbook/real-world.md
+++ b/doc/book/cookbook/real-world.md
@@ -27,7 +27,7 @@ To run a real-world deployment, make sure the following conditions are met:
   [Yggdrasil](https://yggdrasil-network.github.io/) are approaches to consider
   in addition to building out your own VPN tunneling.
 
-- This guide will assume you are using Docker containers to deploy Garage on each node. 
+- This guide will assume you are using Docker containers to deploy Garage on each node.
   Garage can also be run independently, for instance as a [Systemd service](@/documentation/cookbook/systemd.md).
   You can also use an orchestrator such as Nomad or Kubernetes to automatically manage
   Docker containers on a fleet of nodes.
@@ -53,9 +53,9 @@ to store 2 TB of data in total.
 
 ### Best practices
 
-- If you have fast dedicated networking between all your nodes, and are planing to store
-  very large files, bump the `block_size` configuration parameter to 10 MB
-  (`block_size = 10485760`).
+- If you have reasonably fast networking between all your nodes, and are planing to store
+  mostly large files, bump the `block_size` configuration parameter to 10 MB
+  (`block_size = "10M"`).
 
 - Garage stores its files in two locations: it uses a metadata directory to store frequently-accessed
   small metadata items, and a data directory to store data blocks of uploaded objects.
@@ -73,14 +73,25 @@ to store 2 TB of data in total.
   help a lot with performance.  The default LMDB database engine is the most tested
   and has good performance.
 
-- For the metadata storage, Garage does not do checksumming and integrity
-  verification on its own. If you are afraid of bitrot/data corruption,
-  put your metadata directory on a ZFS or BTRFS partition. Otherwise, just use regular
-  EXT4 or XFS.
-
 - Servers with multiple HDDs are supported natively by Garage without resorting
   to RAID, see [our dedicated documentation page](@/documentation/operations/multi-hdd.md).
 
+- For the metadata storage, Garage does not do checksumming and integrity
+  verification on its own, so it is better to use a robust filesystem such as
+  BTRFS or ZFS. Users have reported that when using the LMDB database engine
+  (the default), database files have a tendency of becoming corrupted after an
+  unclean shutdown (e.g. a power outage), so you should take regular snapshots
+  to be able to recover from such a situation.  This can be done using Garage's
+  built-in automatic snapshotting (since v0.9.4), or by using filesystem level
+  snapshots. If you cannot do so, you might want to switch to Sqlite which is
+  more robust.
+
+- LMDB is the fastest and most tested database engine, but it has the following
+  weaknesses: 1/ data files are not architecture-independent, you cannot simply
+  move a Garage metadata directory between nodes running different architectures,
+  and 2/ LMDB is not suited for 32-bit platforms. Sqlite is a viable alternative
+  if any of these are of concern.
+
 ## Get a Docker image
 
 Our docker image is currently named `dxflrs/garage` and is stored on the [Docker Hub](https://hub.docker.com/r/dxflrs/garage/tags?page=1&ordering=last_updated).
@@ -114,6 +125,7 @@ A valid `/etc/garage.toml` for our cluster would look as follows:
 metadata_dir = "/var/lib/garage/meta"
 data_dir = "/var/lib/garage/data"
 db_engine = "lmdb"
+metadata_auto_snapshot_interval = "6h"
 
 replication_factor = 3
 
@@ -186,7 +198,7 @@ upgrades.  With the containerized setup proposed here, the upgrade process
 will require stopping and removing the existing container, and re-creating it
 with the upgraded version.
 
-## Controling the daemon
+## Controlling the daemon
 
 The `garage` binary has two purposes:
   - it acts as a daemon when launched with `garage server`
@@ -244,7 +256,7 @@ You can then instruct nodes to connect to one another as follows:
 Venus$ garage node connect 563e1ac825ee3323aa441e72c26d1030d6d4414aeb3dd25287c531e7fc2bc95d@[fc00:1::1]:3901
 ```
 
-You don't nead to instruct all node to connect to all other nodes:
+You don't need to instruct all node to connect to all other nodes:
 nodes will discover one another transitively.
 
 Now if your run `garage status` on any node, you should have an output that looks as follows:
@@ -327,8 +339,8 @@ Given the information above, we will configure our cluster as follow:
 ```bash
 garage layout assign 563e -z par1 -c 1T -t mercury
 garage layout assign 86f0 -z par1 -c 2T -t venus
-garage layout assign 6814 -z lon1 -c 2T -t earth 
-garage layout assign 212f -z bru1 -c 1.5T -t mars 
+garage layout assign 6814 -z lon1 -c 2T -t earth
+garage layout assign 212f -z bru1 -c 1.5T -t mars
 ```
 
 At this point, the changes in the cluster layout have not yet been applied.
diff --git a/doc/book/operations/durability-repairs.md b/doc/book/operations/durability-repairs.md
index 578899a8..c76dc39e 100644
--- a/doc/book/operations/durability-repairs.md
+++ b/doc/book/operations/durability-repairs.md
@@ -19,7 +19,7 @@ connecting to. To run on all nodes, add the `-a` flag as follows:
 
 # Data block operations
 
-## Data store scrub
+## Data store scrub {#scrub}
 
 Scrubbing the data store means examining each individual data block to check that
 their content is correct, by verifying their hash. Any block found to be corrupted
@@ -104,6 +104,24 @@ operation will also move out all data from locations marked as read-only.
 
 # Metadata operations
 
+## Metadata snapshotting
+
+It is good practice to setup automatic snapshotting of your metadata database
+file, to recover from situations where it becomes corrupted on disk. This can
+be done at the filesystem level if you are using ZFS or BTRFS.
+
+Since Garage v0.9.4, Garage is able to take snapshots of the metadata database
+itself. This basically amounts to copying the database file, except that it can
+be run live while Garage is running without the risk of corruption or
+inconsistencies.  This can be setup to run automatically on a schedule using
+[`metadata_auto_snapshot_interval`](@/documentation/reference-manual/configuration.md#metadata_auto_snapshot_interval).
+A snapshot can also be triggered manually using the `garage meta snapshot`
+command. Note that taking a snapshot using this method is very intensive as it
+requires making a full copy of the database file, so you might prefer using
+filesystem-level snapshots if possible. To recover a corrupted node from such a
+snapshot, read the instructions
+[here](@/documentation/operations/recovering.md#corrupted_meta).
+
 ## Metadata table resync
 
 Garage automatically resyncs all entries stored in the metadata tables every hour,
diff --git a/doc/book/operations/recovering.md b/doc/book/operations/recovering.md
index 7a830788..6e19db0e 100644
--- a/doc/book/operations/recovering.md
+++ b/doc/book/operations/recovering.md
@@ -108,3 +108,57 @@ garage layout apply   # once satisfied, apply the changes
 
 Garage will then start synchronizing all required data on the new node.
 This process can be monitored using the `garage stats -a` command.
+
+## Replacement scenario 3: corrupted metadata {#corrupted_meta}
+
+In some cases, your metadata DB file might become corrupted, for instance if
+your node suffered a power outage and did not shut down properly. In this case,
+you can recover without having to change the node ID and rebuilding a cluster
+layout. This means that data blocks will not need to be shuffled around, you
+must simply find a way to repair the metadata file. The best way is generally
+to discard the corrupted file and recover it from another source.
+
+First of all, start by locating the database file in your metadata directory,
+which [depends on your `db_engine`
+choice](@/documentation/reference-manual/configuration.md#db_engine).  Then,
+your recovery options are as follows:
+
+- **Option 1: resyncing from other nodes.** In case your cluster is replicated
+  with two or three copies, you can simply delete the database file, and Garage
+  will resync from other nodes. To do so, stop Garage, delete the database file
+  or directory, and restart Garage. Then, do a full table repair by calling
+  `garage repair -a --yes tables`.  This will take a bit of time to complete as
+  the new node will need to receive copies of the metadata tables from the
+  network.
+
+- **Option 2: restoring a snapshot taken by Garage.** Since v0.9.4, Garage can
+  [automatically take regular
+  snapshots](@/documentation/reference-manual/configuration.md#metadata_auto_snapshot_interval)
+  of your metadata DB file. This file or directory should be located under
+  `<metadata_dir>/snapshots`, and is named according to the UTC time at which it
+  was taken. Stop Garage, discard the database file/directory and replace it by the
+  snapshot you want to use. For instance, in the case of LMDB:
+
+  ```bash
+  cd $METADATA_DIR
+  mv db.lmdb db.lmdb.bak
+  cp -r snapshots/2024-03-15T12:13:52Z db.lmdb
+  ```
+
+  And for Sqlite:
+
+  ```bash
+  cd $METADATA_DIR
+  mv db.sqlite db.sqlite.bak
+  cp snapshots/2024-03-15T12:13:52Z db.sqlite
+  ```
+
+  Then, restart Garage and run a full table repair by calling `garage repair -a
+  --yes tables`.  This should run relatively fast as only the changes that
+  occurred since the snapshot was taken will need to be resynchronized. Of
+  course, if your cluster is not replicated, you will lose all changes that
+  occurred since the snapshot was taken.
+
+- **Option 3: restoring a filesystem-level snapshot.** If you are using ZFS or
+  BTRFS to snapshot your metadata partition, refer to their specific
+  documentation on rolling back or copying files from an old snapshot.
diff --git a/doc/book/operations/upgrading.md b/doc/book/operations/upgrading.md
index 6b6ea26d..c239bfe4 100644
--- a/doc/book/operations/upgrading.md
+++ b/doc/book/operations/upgrading.md
@@ -73,6 +73,18 @@ The entire procedure would look something like this:
   You can do all of the nodes in a single zone at once as that won't impact global cluster availability.
   Do not try to make a backup of the metadata folder of a running node.
 
+  **Since Garage v0.9.4,** you can use the `garage meta snapshot --all` command
+  to take a simultaneous snapshot of the metadata database files of all your
+  nodes.  This avoids the tedious process of having to take them down one by
+  one before upgrading. Be careful that if automatic snapshotting is enabled,
+  Garage only keeps the last two snapshots and deletes older ones, so you might
+  want to disable automatic snapshotting in your upgraded configuration file
+  until you have confirmed that the upgrade ran successfully.  In addition to
+  snapshotting the metadata databases of your nodes, you should back-up at
+  least the `cluster_layout` file of one of your Garage instances (this file
+  should be the same on all nodes and you can copy it safely while Garage is
+  running).
+
 3. Prepare your binaries and configuration files for the new Garage version
 
 4. Restart all nodes simultaneously in the new version
diff --git a/doc/book/quick-start/_index.md b/doc/book/quick-start/_index.md
index be9fe329..9619f388 100644
--- a/doc/book/quick-start/_index.md
+++ b/doc/book/quick-start/_index.md
@@ -57,7 +57,7 @@ to generate unique and private secrets for security reasons:
 cat > garage.toml <<EOF
 metadata_dir = "/tmp/meta"
 data_dir = "/tmp/data"
-db_engine = "lmdb"
+db_engine = "sqlite"
 
 replication_factor = 1
 
diff --git a/doc/book/reference-manual/configuration.md b/doc/book/reference-manual/configuration.md
index 4df2d0df..a21f945b 100644
--- a/doc/book/reference-manual/configuration.md
+++ b/doc/book/reference-manual/configuration.md
@@ -15,6 +15,8 @@ metadata_dir = "/var/lib/garage/meta"
 data_dir = "/var/lib/garage/data"
 metadata_fsync = true
 data_fsync = false
+disable_scrub = false
+metadata_auto_snapshot_interval = "6h"
 
 db_engine = "lmdb"
 
@@ -86,7 +88,9 @@ Top-level configuration options:
 [`data_dir`](#data_dir),
 [`data_fsync`](#data_fsync),
 [`db_engine`](#db_engine),
+[`disable_scrub`](#disable_scrub),
 [`lmdb_map_size`](#lmdb_map_size),
+[`metadata_auto_snapshot_interval`](#metadata_auto_snapshot_interval),
 [`metadata_dir`](#metadata_dir),
 [`metadata_fsync`](#metadata_fsync),
 [`replication_factor`](#replication_factor),
@@ -277,18 +281,33 @@ old Sled metadata databases to another engine.
 
 Performance characteristics of the different DB engines are as follows:
 
-- LMDB: the recommended database engine on 64-bit systems, much more
-  space-efficient and slightly faster. Note that the data format of LMDB is not
-  portable between architectures, so for instance the Garage database of an
-  x86-64 node cannot be moved to an ARM64 node. Also note that, while LMDB can
-  technically be used on 32-bit systems, this will limit your node to very
-  small database sizes due to how LMDB works; it is therefore not recommended.
+- LMDB: the recommended database engine for high-performance distributed clusters.
+LMDB works very well, but is known to have the following limitations:
+
+  - The data format of LMDB is not portable between architectures, so for
+    instance the Garage database of an x86-64 node cannot be moved to an ARM64
+    node.
+
+  - While LMDB can technically be used on 32-bit systems, this will limit your
+    node to very small database sizes due to how LMDB works; it is therefore
+    not recommended.
+
+  - Several users have reported corrupted LMDB database files after an unclean
+    shutdown (e.g. a power outage). This situation can generally be recovered
+    from if your cluster is geo-replicated (by rebuilding your metadata db from
+    other nodes), or if you have saved regular snapshots at the filesystem
+    level.
+
+  - Keys in LMDB are limited to 511 bytes. This limit translates to limits on
+    object keys in S3 and sort keys in K2V that are limted to 479 bytes.
 
 - Sqlite: Garage supports Sqlite as an alternative storage backend for
-  metadata, and although it has not been tested as much, it is expected to work
-  satisfactorily.  Since Garage v0.9.0, performance issues have largely been
-  fixed by allowing for a no-fsync mode (see `metadata_fsync`). Sqlite does not
-  have the database size limitation of LMDB on 32-bit systems.
+  metadata, which does not have the issues listed above for LMDB.
+  On versions 0.8.x and earlier, Sqlite should be avoided due to abysmal
+  performance, which was fixed with the addition of `metadata_fsync`.
+  Sqlite is still probably slower than LMDB due to the way we use it,
+  so it is not the best choice for high-performance storage clusters,
+  but it should work fine in many cases.
 
 It is possible to convert Garage's metadata directory from one format to another
 using the `garage convert-db` command, which should be used as follows:
@@ -315,7 +334,7 @@ Using this option reduces the risk of simultaneous metadata corruption on severa
 cluster nodes, which could lead to data loss.
 
 If multi-site replication is used, this option is most likely not necessary, as
-it is extremely unlikely that two nodes in different locations will have a 
+it is extremely unlikely that two nodes in different locations will have a
 power failure at the exact same time.
 
 (Metadata corruption on a single node is not an issue, the corrupted data file
@@ -343,6 +362,41 @@ at the cost of a moderate drop in write performance.
 Similarly to `metatada_fsync`, this is likely not necessary
 if geographical replication is used.
 
+#### `metadata_auto_snapshot_interval` (since Garage v0.9.4) {#metadata_auto_snapshot_interval}
+
+If this value is set, Garage will automatically take a snapshot of the metadata
+DB file at a regular interval and save it in the metadata directory.
+This can allow to recover from situations where the metadata DB file is corrupted,
+for instance after an unclean shutdown.
+See [this page](@/documentation/operations/recovering.md#corrupted_meta) for details.
+
+Garage keeps only the two most recent snapshots of the metadata DB and deletes
+older ones automatically.
+
+Note that taking a metadata snapshot is a relatively intensive operation as the
+entire data file is copied. A snapshot being taken might have performance
+impacts on the Garage node while it is running. If the cluster is under heavy
+write load when a snapshot operation is running, this might also cause the
+database file to grow in size significantly as pages cannot be recycled easily.
+For this reason, it might be better to use filesystem-level snapshots instead
+if possible.
+
+#### `disable_scrub` {#disable_scrub}
+
+By default, Garage runs a scrub of the data directory approximately once per
+month, with a random delay to avoid all nodes running at the same time.  When
+it scrubs the data directory, Garage will read all of the data files stored on
+disk to check their integrity, and will rebuild any data files that it finds
+corrupted, using the remaining valid copies stored on other nodes.
+See [this page](@/documentation/operations/durability-repair.md#scrub) for details.
+
+Set the `disable_scrub` configuration value to `true` if you don't need Garage
+to scrub the data directory, for instance if you are already scrubbing at the
+filesystem level. Note that in this case, if you find a corrupted data file,
+you should delete it from the data directory and then call `garage repair
+blocks` on the node to ensure that it re-obtains a copy from another node on
+the network.
+
 #### `block_size` {#block_size}
 
 Garage splits stored objects in consecutive chunks of size `block_size`
author	Alex Auvolat <alex@adnab.me>	2024-03-18 20:17:54 +0100
committer	Alex Auvolat <alex@adnab.me>	2024-03-18 20:19:30 +0100
commit	0038ca8a78f147b9c0ec07ef0121773aaf110dc9 (patch)
tree	43f39f30c63a6affa62eeea62cfec674f217c2b4 /doc
parent	81191d2d92e58ff82ace0f4d82b275c157673ade (diff)
parent	1a0bffae3491fae6af5a8d4defc5c6b84839e197 (diff)
download	garage-0038ca8a78f147b9c0ec07ef0121773aaf110dc9.tar.gz garage-0038ca8a78f147b9c0ec07ef0121773aaf110dc9.zip