aboutsummaryrefslogtreecommitdiff
path: root/doc/book/cookbook/durability-repairs.md
diff options
context:
space:
mode:
Diffstat (limited to 'doc/book/cookbook/durability-repairs.md')
-rw-r--r--doc/book/cookbook/durability-repairs.md114
1 files changed, 0 insertions, 114 deletions
diff --git a/doc/book/cookbook/durability-repairs.md b/doc/book/cookbook/durability-repairs.md
deleted file mode 100644
index 46eb25b8..00000000
--- a/doc/book/cookbook/durability-repairs.md
+++ /dev/null
@@ -1,114 +0,0 @@
-+++
-title = "Durability & Repairs"
-weight = 50
-+++
-
-To ensure the best durability of your data and to fix any inconsistencies that may
-pop up in a distributed system, Garage provides a serires of repair operations.
-This guide will explain the meaning of each of them and when they should be applied.
-
-
-# General syntax of repair operations
-
-Repair operations described below are of the form `garage repair <repair_name>`.
-These repairs will not launch without the `--yes` flag, which should
-be added as follows: `garage repair --yes <repair_name>`.
-By default these repair procedures will only run on the Garage node your CLI is
-connecting to. To run on all nodes, add the `-a` flag as follows:
-`garage repair -a --yes <repair_name>`.
-
-# Data block operations
-
-## Data store scrub
-
-Scrubbing the data store means examining each individual data block to check that
-their content is correct, by verifying their hash. Any block found to be corrupted
-(e.g. by bitrot or by an accidental manipulation of the datastore) will be
-restored from another node that holds a valid copy.
-
-A scrub is run automatically by Garage every 30 days. It can also be launched
-manually using `garage repair scrub start`.
-
-To view the status of an ongoing scrub, first find the task ID of the scrub worker
-using `garage worker list`. Then, run `garage worker info <scrub_task_id>` to
-view detailed runtime statistics of the scrub. To gather cluster-wide information,
-this command has to be run on each individual node.
-
-A scrub is a very disk-intensive operation that might slow down your cluster.
-You may pause an ongoing scrub using `garage repair scrub pause`, but note that
-the scrub will resume automatically 24 hours later as Garage will not let your
-cluster run without a regular scrub. If the scrub procedure is too intensive
-for your servers and is slowing down your workload, the recommended solution
-is to increase the "scrub tranquility" using `garage repair scrub set-tranquility`.
-A higher tranquility value will make Garage take longer pauses between two block
-verifications. Of course, scrubbing the entire data store will also take longer.
-
-## Block check and resync
-
-In some cases, nodes hold a reference to a block but do not actually have the block
-stored on disk. Conversely, they may also have on disk blocks that are not referenced
-any more. To fix both cases, a block repair may be run with `garage repair blocks`.
-This will scan the entire block reference counter table to check that the blocks
-exist on disk, and will scan the entire disk store to check that stored blocks
-are referenced.
-
-It is recommended to run this procedure when changing your cluster layout,
-after the metadata tables have finished synchronizing between nodes
-(usually a few hours after `garage layout apply`).
-
-## Inspecting lost blocks
-
-In extremely rare situations, data blocks may be unavailable from the entire cluster.
-This means that even using `garage repair blocks`, some nodes may be unable
-to fetch data blocks for which they hold a reference.
-
-These errors are stored on each node in a list of "block resync errors", i.e.
-blocks for which the last resync operation failed.
-This list can be inspected using `garage block list-errors`.
-These errors usually fall into one of the following categories:
-
-1. a block is still referenced but the object was deleted, this is a case
- of metadata reference inconsistency (see below for the fix)
-2. a block is referenced by a non-deleted object, but could not be fetched due
- to a transient error such as a network failure
-3. a block is referenced by a non-deleted object, but could not be fetched due
- to a permanent error such as there not being any valid copy of the block on the
- entire cluster
-
-To help make the difference between cases 1 and cases 2 and 3, you may use the
-`garage block info` command to see which objects hold a reference to each block.
-
-In the second case (transient errors), Garage will try to fetch the block again
-after a certain time, so the error should disappear natuarlly. You can also
-request Garage to try to fetch the block immediately using `garage block retry-now`
-if you have fixed the transient issue.
-
-If you are confident that you are in the third scenario and that your data block
-is definitely lost, then there is no other choice than to declare your S3 objects
-as unrecoverable, and to delete them properly from the data store. This can be done
-using the `garage block purge` command.
-
-
-# Metadata operations
-
-## Metadata table resync
-
-Garage automatically resyncs all entries stored in the metadata tables every hour,
-to ensure that all nodes have the most up-to-date version of all the information
-they should be holding.
-The resync procedure is based on a Merkle tree that allows to efficiently find
-differences between nodes.
-
-In some special cases, e.g. before an upgrade, you might want to run a table
-resync manually. This can be done using `garage repair tables`.
-
-## Metadata table reference fixes
-
-In some very rare cases where nodes are unavailable, some references between objects
-are broken. For instance, if an object is deleted, the underlying versions or data
-blocks may still be held by Garage. If you suspect that such corruption has occurred
-in your cluster, you can run one of the following repair procedures:
-
-- `garage repair versions`: checks that all versions belong to a non-deleted object, and purges any orphan version
-- `garage repair block_refs`: checks that all block references belong to a non-deleted object version, and purges any orphan block reference (this will then allow the blocks to be garbage-collected)
-