From 990205dc3b2f552e9b68168c0aad3e96e2d6f2f0 Mon Sep 17 00:00:00 2001 From: Alex Auvolat Date: Thu, 14 Mar 2024 16:57:44 +0100 Subject: [disable-scrub] document `disable_scrub` config option --- doc/book/operations/durability-repairs.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'doc/book/operations') diff --git a/doc/book/operations/durability-repairs.md b/doc/book/operations/durability-repairs.md index 578899a8..f4450dae 100644 --- a/doc/book/operations/durability-repairs.md +++ b/doc/book/operations/durability-repairs.md @@ -19,7 +19,7 @@ connecting to. To run on all nodes, add the `-a` flag as follows: # Data block operations -## Data store scrub +## Data store scrub {#scrub} Scrubbing the data store means examining each individual data block to check that their content is correct, by verifying their hash. Any block found to be corrupted -- cgit v1.2.3 From 8cf3d24875d41d79ab08d637cd38d2a5b9e527dd Mon Sep 17 00:00:00 2001 From: Alex Auvolat Date: Fri, 15 Mar 2024 13:16:41 +0100 Subject: [db-snapshot] documentation for metadata db snapshots --- doc/book/operations/durability-repairs.md | 18 +++++++++++ doc/book/operations/recovering.md | 54 +++++++++++++++++++++++++++++++ doc/book/operations/upgrading.md | 12 +++++++ 3 files changed, 84 insertions(+) (limited to 'doc/book/operations') diff --git a/doc/book/operations/durability-repairs.md b/doc/book/operations/durability-repairs.md index f4450dae..c76dc39e 100644 --- a/doc/book/operations/durability-repairs.md +++ b/doc/book/operations/durability-repairs.md @@ -104,6 +104,24 @@ operation will also move out all data from locations marked as read-only. # Metadata operations +## Metadata snapshotting + +It is good practice to setup automatic snapshotting of your metadata database +file, to recover from situations where it becomes corrupted on disk. This can +be done at the filesystem level if you are using ZFS or BTRFS. + +Since Garage v0.9.4, Garage is able to take snapshots of the metadata database +itself. This basically amounts to copying the database file, except that it can +be run live while Garage is running without the risk of corruption or +inconsistencies. This can be setup to run automatically on a schedule using +[`metadata_auto_snapshot_interval`](@/documentation/reference-manual/configuration.md#metadata_auto_snapshot_interval). +A snapshot can also be triggered manually using the `garage meta snapshot` +command. Note that taking a snapshot using this method is very intensive as it +requires making a full copy of the database file, so you might prefer using +filesystem-level snapshots if possible. To recover a corrupted node from such a +snapshot, read the instructions +[here](@/documentation/operations/recovering.md#corrupted_meta). + ## Metadata table resync Garage automatically resyncs all entries stored in the metadata tables every hour, diff --git a/doc/book/operations/recovering.md b/doc/book/operations/recovering.md index 7a830788..6e19db0e 100644 --- a/doc/book/operations/recovering.md +++ b/doc/book/operations/recovering.md @@ -108,3 +108,57 @@ garage layout apply # once satisfied, apply the changes Garage will then start synchronizing all required data on the new node. This process can be monitored using the `garage stats -a` command. + +## Replacement scenario 3: corrupted metadata {#corrupted_meta} + +In some cases, your metadata DB file might become corrupted, for instance if +your node suffered a power outage and did not shut down properly. In this case, +you can recover without having to change the node ID and rebuilding a cluster +layout. This means that data blocks will not need to be shuffled around, you +must simply find a way to repair the metadata file. The best way is generally +to discard the corrupted file and recover it from another source. + +First of all, start by locating the database file in your metadata directory, +which [depends on your `db_engine` +choice](@/documentation/reference-manual/configuration.md#db_engine). Then, +your recovery options are as follows: + +- **Option 1: resyncing from other nodes.** In case your cluster is replicated + with two or three copies, you can simply delete the database file, and Garage + will resync from other nodes. To do so, stop Garage, delete the database file + or directory, and restart Garage. Then, do a full table repair by calling + `garage repair -a --yes tables`. This will take a bit of time to complete as + the new node will need to receive copies of the metadata tables from the + network. + +- **Option 2: restoring a snapshot taken by Garage.** Since v0.9.4, Garage can + [automatically take regular + snapshots](@/documentation/reference-manual/configuration.md#metadata_auto_snapshot_interval) + of your metadata DB file. This file or directory should be located under + `/snapshots`, and is named according to the UTC time at which it + was taken. Stop Garage, discard the database file/directory and replace it by the + snapshot you want to use. For instance, in the case of LMDB: + + ```bash + cd $METADATA_DIR + mv db.lmdb db.lmdb.bak + cp -r snapshots/2024-03-15T12:13:52Z db.lmdb + ``` + + And for Sqlite: + + ```bash + cd $METADATA_DIR + mv db.sqlite db.sqlite.bak + cp snapshots/2024-03-15T12:13:52Z db.sqlite + ``` + + Then, restart Garage and run a full table repair by calling `garage repair -a + --yes tables`. This should run relatively fast as only the changes that + occurred since the snapshot was taken will need to be resynchronized. Of + course, if your cluster is not replicated, you will lose all changes that + occurred since the snapshot was taken. + +- **Option 3: restoring a filesystem-level snapshot.** If you are using ZFS or + BTRFS to snapshot your metadata partition, refer to their specific + documentation on rolling back or copying files from an old snapshot. diff --git a/doc/book/operations/upgrading.md b/doc/book/operations/upgrading.md index 6b6ea26d..c239bfe4 100644 --- a/doc/book/operations/upgrading.md +++ b/doc/book/operations/upgrading.md @@ -73,6 +73,18 @@ The entire procedure would look something like this: You can do all of the nodes in a single zone at once as that won't impact global cluster availability. Do not try to make a backup of the metadata folder of a running node. + **Since Garage v0.9.4,** you can use the `garage meta snapshot --all` command + to take a simultaneous snapshot of the metadata database files of all your + nodes. This avoids the tedious process of having to take them down one by + one before upgrading. Be careful that if automatic snapshotting is enabled, + Garage only keeps the last two snapshots and deletes older ones, so you might + want to disable automatic snapshotting in your upgraded configuration file + until you have confirmed that the upgrade ran successfully. In addition to + snapshotting the metadata databases of your nodes, you should back-up at + least the `cluster_layout` file of one of your Garage instances (this file + should be the same on all nodes and you can copy it safely while Garage is + running). + 3. Prepare your binaries and configuration files for the new Garage version 4. Restart all nodes simultaneously in the new version -- cgit v1.2.3