aboutsummaryrefslogtreecommitdiff
path: root/doc/book
diff options
context:
space:
mode:
Diffstat (limited to 'doc/book')
-rw-r--r--doc/book/connect/apps/index.md8
-rw-r--r--doc/book/connect/backup.md2
-rw-r--r--doc/book/connect/repositories.md4
-rw-r--r--doc/book/cookbook/real-world.md88
-rw-r--r--doc/book/operations/durability-repairs.md11
-rw-r--r--doc/book/operations/layout.md221
-rw-r--r--doc/book/operations/multi-hdd.md101
-rw-r--r--doc/book/operations/upgrading.md2
-rw-r--r--doc/book/quick-start/_index.md21
-rw-r--r--doc/book/reference-manual/configuration.md58
-rw-r--r--doc/book/reference-manual/s3-compatibility.md32
-rw-r--r--doc/book/working-documents/migration-09.md72
12 files changed, 532 insertions, 88 deletions
diff --git a/doc/book/connect/apps/index.md b/doc/book/connect/apps/index.md
index 3f59530a..f67a29c9 100644
--- a/doc/book/connect/apps/index.md
+++ b/doc/book/connect/apps/index.md
@@ -37,7 +37,7 @@ Second, we suppose you have created a key and a bucket.
As a reminder, you can create a key for your nextcloud instance as follow:
```bash
-garage key new --name nextcloud-key
+garage key create nextcloud-key
```
Keep the Key ID and the Secret key in a pad, they will be needed later.
@@ -139,7 +139,7 @@ a reasonable trade-off for some instances.
Create a key for Peertube:
```bash
-garage key new --name peertube-key
+garage key create peertube-key
```
Keep the Key ID and the Secret key in a pad, they will be needed later.
@@ -253,7 +253,7 @@ As such, your Garage cluster should be configured appropriately for good perform
This is the usual Garage setup:
```bash
-garage key new --name mastodon-key
+garage key create mastodon-key
garage bucket create mastodon-data
garage bucket allow mastodon-data --read --write --key mastodon-key
```
@@ -379,7 +379,7 @@ Supposing you have a working synapse installation, you can add the module with p
Now create a bucket and a key for your matrix instance (note your Key ID and Secret Key somewhere, they will be needed later):
```bash
-garage key new --name matrix-key
+garage key create matrix-key
garage bucket create matrix
garage bucket allow matrix --read --write --key matrix-key
```
diff --git a/doc/book/connect/backup.md b/doc/book/connect/backup.md
index d20c3c96..585ec469 100644
--- a/doc/book/connect/backup.md
+++ b/doc/book/connect/backup.md
@@ -54,7 +54,7 @@ how to configure this.
Create your key and bucket:
```bash
-garage key new my-key
+garage key create my-key
garage bucket create backup
garage bucket allow backup --read --write --key my-key
```
diff --git a/doc/book/connect/repositories.md b/doc/book/connect/repositories.md
index 4b14bb46..66365d64 100644
--- a/doc/book/connect/repositories.md
+++ b/doc/book/connect/repositories.md
@@ -23,7 +23,7 @@ You can configure a different target for each data type (check `[lfs]` and `[att
Let's start by creating a key and a bucket (your key id and secret will be needed later, keep them somewhere):
```bash
-garage key new --name gitea-key
+garage key create gitea-key
garage bucket create gitea
garage bucket allow gitea --read --write --key gitea-key
```
@@ -118,7 +118,7 @@ through another support, like a git repository.
As a first step, we will need to create a bucket on Garage and enabling website access on it:
```bash
-garage key new --name nix-key
+garage key create nix-key
garage bucket create nix.example.com
garage bucket allow nix.example.com --read --write --key nix-key
garage bucket website nix.example.com --allow
diff --git a/doc/book/cookbook/real-world.md b/doc/book/cookbook/real-world.md
index 7061069f..ea4ce1f9 100644
--- a/doc/book/cookbook/real-world.md
+++ b/doc/book/cookbook/real-world.md
@@ -19,9 +19,10 @@ To run a real-world deployment, make sure the following conditions are met:
- You have at least three machines with sufficient storage space available.
-- Each machine has a public IP address which is reachable by other machines. It
- is highly recommended that you use IPv6 for this end-to-end connectivity. If
- IPv6 is not available, then using a mesh VPN such as
+- Each machine has an IP address which makes it directly reachable by all other machines.
+ In many cases, nodes will be behind a NAT and will not each have a public
+ IPv4 addresses. In this case, is recommended that you use IPv6 for this
+ end-to-end connectivity if it is available. Otherwise, using a mesh VPN such as
[Nebula](https://github.com/slackhq/nebula) or
[Yggdrasil](https://yggdrasil-network.github.io/) are approaches to consider
in addition to building out your own VPN tunneling.
@@ -42,7 +43,7 @@ For our example, we will suppose the following infrastructure with IPv6 connecti
| Brussels | Mars | fc00:F::1 | 1.5 TB |
Note that Garage will **always** store the three copies of your data on nodes at different
-locations. This means that in the case of this small example, the available capacity
+locations. This means that in the case of this small example, the usable capacity
of the cluster is in fact only 1.5 TB, because nodes in Brussels can't store more than that.
This also means that nodes in Paris and London will be under-utilized.
To make better use of the available hardware, you should ensure that the capacity
@@ -75,28 +76,23 @@ to store 2 TB of data in total.
- For the metadata storage, Garage does not do checksumming and integrity
verification on its own. If you are afraid of bitrot/data corruption,
- put your metadata directory on a BTRFS partition. Otherwise, just use regular
+ put your metadata directory on a ZFS or BTRFS partition. Otherwise, just use regular
EXT4 or XFS.
-- Having a single server with several storage drives is currently not very well
- supported in Garage ([#218](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/218)).
- For an easy setup, just put all your drives in a RAID0 or a ZFS RAIDZ array.
- If you're adventurous, you can try to format each of your disk as
- a separate XFS partition, and then run one `garage` daemon per disk drive,
- or use something like [`mergerfs`](https://github.com/trapexit/mergerfs) to merge
- all your disks in a single union filesystem that spreads load over them.
+- Servers with multiple HDDs are supported natively by Garage without resorting
+ to RAID, see [our dedicated documentation page](@/documentation/operations/multi-hdd.md).
## Get a Docker image
Our docker image is currently named `dxflrs/garage` and is stored on the [Docker Hub](https://hub.docker.com/r/dxflrs/garage/tags?page=1&ordering=last_updated).
-We encourage you to use a fixed tag (eg. `v0.8.0`) and not the `latest` tag.
-For this example, we will use the latest published version at the time of the writing which is `v0.8.0` but it's up to you
+We encourage you to use a fixed tag (eg. `v0.9.0`) and not the `latest` tag.
+For this example, we will use the latest published version at the time of the writing which is `v0.9.0` but it's up to you
to check [the most recent versions on the Docker Hub](https://hub.docker.com/r/dxflrs/garage/tags?page=1&ordering=last_updated).
For example:
```
-sudo docker pull dxflrs/garage:v0.8.0
+sudo docker pull dxflrs/garage:v0.9.0
```
## Deploying and configuring Garage
@@ -161,12 +157,13 @@ docker run \
-v /etc/garage.toml:/etc/garage.toml \
-v /var/lib/garage/meta:/var/lib/garage/meta \
-v /var/lib/garage/data:/var/lib/garage/data \
- dxflrs/garage:v0.8.0
+ dxflrs/garage:v0.9.0
```
-It should be restarted automatically at each reboot.
-Please note that we use host networking as otherwise Docker containers
-can not communicate with IPv6.
+With this command line, Garage should be started automatically at each boot.
+Please note that we use host networking as otherwise the network indirection
+added by Docker would prevent Garage nodes from communicating with one another
+(especially if using IPv6).
If you want to use `docker-compose`, you may use the following `docker-compose.yml` file as a reference:
@@ -174,7 +171,7 @@ If you want to use `docker-compose`, you may use the following `docker-compose.y
version: "3"
services:
garage:
- image: dxflrs/garage:v0.8.0
+ image: dxflrs/garage:v0.9.0
network_mode: "host"
restart: unless-stopped
volumes:
@@ -183,10 +180,12 @@ services:
- /var/lib/garage/data:/var/lib/garage/data
```
-Upgrading between Garage versions should be supported transparently,
-but please check the relase notes before doing so!
-To upgrade, simply stop and remove this container and
-start again the command with a new version of Garage.
+If you wish to upgrade your cluster, make sure to read the corresponding
+[documentation page](@/documentation/operations/upgrading.md) first, as well as
+the documentation relevant to your version of Garage in the case of major
+upgrades. With the containerized setup proposed here, the upgrade process
+will require stopping and removing the existing container, and re-creating it
+with the upgraded version.
## Controling the daemon
@@ -270,12 +269,12 @@ of a role that is assigned to each active cluster node.
For our example, we will suppose we have the following infrastructure
(Capacity, Identifier and Zone are specific values to Garage described in the following):
-| Location | Name | Disk Space | `Capacity` | `Identifier` | `Zone` |
-|----------|---------|------------|------------|--------------|--------------|
-| Paris | Mercury | 1 TB | `10` | `563e` | `par1` |
-| Paris | Venus | 2 TB | `20` | `86f0` | `par1` |
-| London | Earth | 2 TB | `20` | `6814` | `lon1` |
-| Brussels | Mars | 1.5 TB | `15` | `212f` | `bru1` |
+| Location | Name | Disk Space | Identifier | Zone (`-z`) | Capacity (`-c`) |
+|----------|---------|------------|------------|-------------|-----------------|
+| Paris | Mercury | 1 TB | `563e` | `par1` | `1T` |
+| Paris | Venus | 2 TB | `86f0` | `par1` | `2T` |
+| London | Earth | 2 TB | `6814` | `lon1` | `2T` |
+| Brussels | Mars | 1.5 TB | `212f` | `bru1` | `1.5T` |
#### Node identifiers
@@ -297,6 +296,8 @@ garage status
It will display the IP address associated with each node;
from the IP address you will be able to recognize the node.
+We will now use the `garage layout assign` command to configure the correct parameters for each node.
+
#### Zones
Zones are simply a user-chosen identifier that identify a group of server that are grouped together logically.
@@ -306,29 +307,29 @@ In most cases, a zone will correspond to a geographical location (i.e. a datacen
Behind the scene, Garage will use zone definition to try to store the same data on different zones,
in order to provide high availability despite failure of a zone.
+Zones are passed to Garage using the `-z` flag of `garage layout assign` (see below).
+
#### Capacity
-Garage reasons on an abstract metric about disk storage that is named the *capacity* of a node.
-The capacity configured in Garage must be proportional to the disk space dedicated to the node.
+Garage needs to know the storage capacity (disk space) it can/should use on
+each node, to be able to correctly balance data.
+
+Capacity values are expressed in bytes and are passed to Garage using the `-c` flag of `garage layout assign` (see below).
-Capacity values must be **integers** but can be given any signification.
-Here we chose that 1 unit of capacity = 100 GB.
+#### Tags
-Note that the amount of data stored by Garage on each server may not be strictly proportional to
-its capacity value, as Garage will priorize having 3 copies of data in different zones,
-even if this means that capacities will not be strictly respected. For example in our above examples,
-nodes Earth and Mars will always store a copy of everything each, and the third copy will
-have 66% chance of being stored by Venus and 33% chance of being stored by Mercury.
+You can add additional tags to nodes using the `-t` flag of `garage layout assign` (see below).
+Tags have no specific meaning for Garage and can be used at your convenience.
#### Injecting the topology
Given the information above, we will configure our cluster as follow:
```bash
-garage layout assign 563e -z par1 -c 10 -t mercury
-garage layout assign 86f0 -z par1 -c 20 -t venus
-garage layout assign 6814 -z lon1 -c 20 -t earth
-garage layout assign 212f -z bru1 -c 15 -t mars
+garage layout assign 563e -z par1 -c 1T -t mercury
+garage layout assign 86f0 -z par1 -c 2T -t venus
+garage layout assign 6814 -z lon1 -c 2T -t earth
+garage layout assign 212f -z bru1 -c 1.5T -t mars
```
At this point, the changes in the cluster layout have not yet been applied.
@@ -338,6 +339,7 @@ To show the new layout that will be applied, call:
garage layout show
```
+Make sure to read carefully the output of `garage layout show`.
Once you are satisfied with your new layout, apply it with:
```bash
diff --git a/doc/book/operations/durability-repairs.md b/doc/book/operations/durability-repairs.md
index 498c8fda..b0d2c78a 100644
--- a/doc/book/operations/durability-repairs.md
+++ b/doc/book/operations/durability-repairs.md
@@ -91,6 +91,16 @@ is definitely lost, then there is no other choice than to declare your S3 object
as unrecoverable, and to delete them properly from the data store. This can be done
using the `garage block purge` command.
+## Rebalancing data directories
+
+In [multi-HDD setups](@/documentation/operations/multi-hdd.md), to ensure that
+data blocks are well balanced between storage locations, you may run a
+rebalance operation using `garage repair rebalance`. This is usefull when
+adding storage locations or when capacities of the storage locations have been
+changed. Once this is finished, Garage will know for each block of a single
+possible location where it can be, which can increase access speed. This
+operation will also move out all data from locations marked as read-only.
+
# Metadata operations
@@ -114,4 +124,3 @@ in your cluster, you can run one of the following repair procedures:
- `garage repair versions`: checks that all versions belong to a non-deleted object, and purges any orphan version
- `garage repair block_refs`: checks that all block references belong to a non-deleted object version, and purges any orphan block reference (this will then allow the blocks to be garbage-collected)
-
diff --git a/doc/book/operations/layout.md b/doc/book/operations/layout.md
index 5e314246..ece17ddb 100644
--- a/doc/book/operations/layout.md
+++ b/doc/book/operations/layout.md
@@ -9,18 +9,30 @@ a certain capacity, or a gateway node that does not store data and is only
used as an API entry point for faster cluster access.
An introduction to building cluster layouts can be found in the [production deployment](@/documentation/cookbook/real-world.md) page.
+In Garage, all of the data that can be stored in a given cluster is divided
+into slices which we call *partitions*. Each partition is stored by
+one or several nodes in the cluster
+(see [`replication_mode`](@/documentation/reference-manual/configuration.md#replication-mode)).
+The layout determines the correspondence between these partition,
+which exist on a logical level, and actual storage nodes.
+
## How cluster layouts work in Garage
-In Garage, a cluster layout is composed of the following components:
+A cluster layout is composed of the following components:
-- a table of roles assigned to nodes
+- a table of roles assigned to nodes, defined by the user
+- an optimal assignation of partitions to nodes, computed by an algorithm that is ran once when calling `garage layout apply` or the ApplyClusterLayout API endpoint
- a version number
Garage nodes will always use the cluster layout with the highest version number.
Garage nodes also maintain and synchronize between them a set of proposed role
changes that haven't yet been applied. These changes will be applied (or
-canceled) in the next version of the layout
+canceled) in the next version of the layout.
+
+All operations on the layout can be realized using the `garage` CLI or using the
+[administration API endpoint](@/documentation/reference-manual/admin-api.md).
+We give here a description of CLI commands, the admin API semantics are very similar.
The following commands insert modifications to the set of proposed role changes
for the next layout version (but they do not create the new layout immediately):
@@ -51,7 +63,7 @@ commands will fail otherwise.
## Warnings about Garage cluster layout management
-**Warning: never make several calls to `garage layout apply` or `garage layout
+**⚠️ Never make several calls to `garage layout apply` or `garage layout
revert` with the same value of the `--version` flag. Doing so can lead to the
creation of several different layouts with the same version number, in which
case your Garage cluster will become inconsistent until fixed.** If a call to
@@ -65,13 +77,198 @@ shell, you shouldn't have much issues as long as you run commands one after
the other and take care of checking the output of `garage layout show`
before applying any changes.
-If you are using the `garage` CLI to script layout changes, follow the following recommendations:
+If you are using the `garage` CLI or the admin API to script layout changes,
+follow the following recommendations:
+
+- If using the CLI, make all of your `garage` CLI calls to the same RPC host.
+ If using the admin API, make all of your API calls to the same Garage node. Do
+ not connect to individual nodes to send them each a piece of the layout changes
+ you are making, as the changes propagate asynchronously between nodes and might
+ not all be taken into account at the time when the new layout is applied.
+
+- **Only call `garage layout apply`/ApplyClusterLayout once**, and call it
+ **strictly after** all of the `layout assign` and `layout remove`
+ commands/UpdateClusterLayout API calls have returned.
+
+
+## Understanding unexpected layout calculations
+
+When adding, removing or modifying nodes in a cluster layout, sometimes
+unexpected assigntations of partitions to node can occur. These assignations
+are in fact normal and logical, given the objectives of the algorihtm. Indeed,
+**the layout algorithm prioritizes moving less data between nodes over the fact
+of achieving equal distribution of load. It also tries to use all links between
+pairs of nodes in equal proportions when moving data.** This section presents
+two examples and illustrates how one can control Garage's behavior to obtain
+the desired results.
+
+### Example 1
+
+In this example, a cluster is originally composed of 3 nodes in 3 different
+zones (data centers). The three nodes are of equal capacity, therefore they
+are all fully exploited and all store a copy of all of the data in the cluster.
+
+Then, a fourth node of the same size is added in the datacenter `dc1`.
+As illustrated by the following, **Garage will by default not store any data on the new node**:
+
+```
+$ garage layout show
+==== CURRENT CLUSTER LAYOUT ====
+ID Tags Zone Capacity Usable capacity
+b10c110e4e854e5a node1 dc1 1000.0 MB 1000.0 MB (100.0%)
+a235ac7695e0c54d node2 dc2 1000.0 MB 1000.0 MB (100.0%)
+62b218d848e86a64 node3 dc3 1000.0 MB 1000.0 MB (100.0%)
+
+Zone redundancy: maximum
+
+Current cluster layout version: 6
+
+==== STAGED ROLE CHANGES ====
+ID Tags Zone Capacity
+a11c7cf18af29737 node4 dc1 1000.0 MB
+
+
+==== NEW CLUSTER LAYOUT AFTER APPLYING CHANGES ====
+ID Tags Zone Capacity Usable capacity
+b10c110e4e854e5a node1 dc1 1000.0 MB 1000.0 MB (100.0%)
+a11c7cf18af29737 node4 dc1 1000.0 MB 0 B (0.0%)
+a235ac7695e0c54d node2 dc2 1000.0 MB 1000.0 MB (100.0%)
+62b218d848e86a64 node3 dc3 1000.0 MB 1000.0 MB (100.0%)
+
+Zone redundancy: maximum
+
+==== COMPUTATION OF A NEW PARTITION ASSIGNATION ====
+
+Partitions are replicated 3 times on at least 3 distinct zones.
+
+Optimal partition size: 3.9 MB (3.9 MB in previous layout)
+Usable capacity / total cluster capacity: 3.0 GB / 4.0 GB (75.0 %)
+Effective capacity (replication factor 3): 1000.0 MB
+
+A total of 0 new copies of partitions need to be transferred.
+
+dc1 Tags Partitions Capacity Usable capacity
+ b10c110e4e854e5a node1 256 (0 new) 1000.0 MB 1000.0 MB (100.0%)
+ a11c7cf18af29737 node4 0 (0 new) 1000.0 MB 0 B (0.0%)
+ TOTAL 256 (256 unique) 2.0 GB 1000.0 MB (50.0%)
+
+dc2 Tags Partitions Capacity Usable capacity
+ a235ac7695e0c54d node2 256 (0 new) 1000.0 MB 1000.0 MB (100.0%)
+ TOTAL 256 (256 unique) 1000.0 MB 1000.0 MB (100.0%)
+
+dc3 Tags Partitions Capacity Usable capacity
+ 62b218d848e86a64 node3 256 (0 new) 1000.0 MB 1000.0 MB (100.0%)
+ TOTAL 256 (256 unique) 1000.0 MB 1000.0 MB (100.0%)
+```
+
+While unexpected, this is logical because of the following facts:
+
+- storing some data on the new node does not help increase the total quantity
+ of data that can be stored on the cluster, as the two other zones (`dc2` and
+ `dc3`) still need to store a full copy of everything, and their capacity is
+ still the same;
+
+- there is therefore no need to move any data on the new node as this would be pointless;
+
+- moving data to the new node has a cost which the algorithm decides to not pay if not necessary.
+
+This distribution of data can however not be what the administrator wanted: if
+they added a new node to `dc1`, it might be because the existing node is too
+slow, and they wish to divide its load by half. In that case, what they need to
+do to force Garage to distribute the data between the two nodes is to attribute
+only half of the capacity to each node in `dc1` (in our example, 500M instead of 1G).
+In that case, Garage would determine that to be able to store 1G in total, it
+would need to store 500M on the old node and 500M on the added one.
+
+
+### Example 2
+
+The following example is a slightly different scenario, where `dc1` had two
+nodes that were used at 50%, and `dc2` and `dc3` each have one node that is
+100% used. All node capacities are the same.
+
+Then, a node from `dc1` is moved into `dc3`. One could expect that the roles of
+`dc1` and `dc3` would simply be swapped: the remaining node in `dc1` would be
+used at 100%, and the two nodes now in `dc3` would be used at 50%. Instead,
+this happens:
+
+```
+==== CURRENT CLUSTER LAYOUT ====
+ID Tags Zone Capacity Usable capacity
+b10c110e4e854e5a node1 dc1 1000.0 MB 500.0 MB (50.0%)
+a11c7cf18af29737 node4 dc1 1000.0 MB 500.0 MB (50.0%)
+a235ac7695e0c54d node2 dc2 1000.0 MB 1000.0 MB (100.0%)
+62b218d848e86a64 node3 dc3 1000.0 MB 1000.0 MB (100.0%)
+
+Zone redundancy: maximum
+
+Current cluster layout version: 8
+
+==== STAGED ROLE CHANGES ====
+ID Tags Zone Capacity
+a11c7cf18af29737 node4 dc3 1000.0 MB
+
+
+==== NEW CLUSTER LAYOUT AFTER APPLYING CHANGES ====
+ID Tags Zone Capacity Usable capacity
+b10c110e4e854e5a node1 dc1 1000.0 MB 1000.0 MB (100.0%)
+a235ac7695e0c54d node2 dc2 1000.0 MB 1000.0 MB (100.0%)
+62b218d848e86a64 node3 dc3 1000.0 MB 753.9 MB (75.4%)
+a11c7cf18af29737 node4 dc3 1000.0 MB 246.1 MB (24.6%)
+
+Zone redundancy: maximum
+
+==== COMPUTATION OF A NEW PARTITION ASSIGNATION ====
+
+Partitions are replicated 3 times on at least 3 distinct zones.
+
+Optimal partition size: 3.9 MB (3.9 MB in previous layout)
+Usable capacity / total cluster capacity: 3.0 GB / 4.0 GB (75.0 %)
+Effective capacity (replication factor 3): 1000.0 MB
+
+A total of 128 new copies of partitions need to be transferred.
+
+dc1 Tags Partitions Capacity Usable capacity
+ b10c110e4e854e5a node1 256 (128 new) 1000.0 MB 1000.0 MB (100.0%)
+ TOTAL 256 (256 unique) 1000.0 MB 1000.0 MB (100.0%)
+
+dc2 Tags Partitions Capacity Usable capacity
+ a235ac7695e0c54d node2 256 (0 new) 1000.0 MB 1000.0 MB (100.0%)
+ TOTAL 256 (256 unique) 1000.0 MB 1000.0 MB (100.0%)
+
+dc3 Tags Partitions Capacity Usable capacity
+ 62b218d848e86a64 node3 193 (0 new) 1000.0 MB 753.9 MB (75.4%)
+ a11c7cf18af29737 node4 63 (0 new) 1000.0 MB 246.1 MB (24.6%)
+ TOTAL 256 (256 unique) 2.0 GB 1000.0 MB (50.0%)
+```
+
+As we can see, the node that was moved to `dc3` (node4) is only used at 25% (approximatively),
+whereas the node that was already in `dc3` (node3) is used at 75%.
+
+This can be explained by the following:
+
+- node1 will now be the only node remaining in `dc1`, thus it has to store all
+ of the data in the cluster. Since it was storing only half of it before, it has
+ to retrieve the other half from other nodes in the cluster.
+
+- The data which it does not have is entirely stored by the other node that was
+ in `dc1` and that is now in `dc3` (node4). There is also a copy of it on node2
+ and node3 since both these nodes have a copy of everything.
+
+- node3 and node4 are the two nodes that will now be in a datacenter that is
+ under-utilized (`dc3`), this means that those are the two candidates from which
+ data can be removed to be moved to node1.
+
+- Garage will move data in equal proportions from all possible sources, in this
+ case it means that it will tranfer 25% of the entire data set from node3 to
+ node1 and another 25% from node4 to node1.
-- Make all of your `garage` CLI calls to the same RPC host. Do not use the
- `garage` CLI to connect to individual nodes to send them each a piece of the
- layout changes you are making, as the changes propagate asynchronously
- between nodes and might not all be taken into account at the time when the
- new layout is applied.
+This explains why node3 ends with 75% utilization (100% from before minus 25%
+that is moved to node1), and node4 ends with 25% (50% from before minus 25%
+that is moved to node1).
-- **Only call `garage layout apply` once**, and call it **strictly after** all
- of the `layout assign` and `layout remove` commands have returned.
+This illustrates the second principle of the layout computation: **if there is
+a choice in moving data out of some nodes, then all links between pairs of
+nodes are used in equal proportions** (this is approximately true, there is
+randomness in the algorihtm to achieve this so there might be some small
+fluctuations, as we see above).
diff --git a/doc/book/operations/multi-hdd.md b/doc/book/operations/multi-hdd.md
new file mode 100644
index 00000000..36445b0a
--- /dev/null
+++ b/doc/book/operations/multi-hdd.md
@@ -0,0 +1,101 @@
++++
+title = "Multi-HDD support"
+weight = 15
++++
+
+
+Since v0.9, Garage natively supports nodes that have several storage drives
+for storing data blocks (not for metadata storage).
+
+## Initial setup
+
+To set up a new Garage storage node with multiple HDDs,
+format and mount all your drives in different directories,
+and use a Garage configuration as follows:
+
+```toml
+data_dir = [
+ { path = "/path/to/hdd1", capacity = "2T" },
+ { path = "/path/to/hdd2", capacity = "4T" },
+]
+```
+
+Garage will automatically balance all blocks stored by the node
+among the different specified directories, proportionnally to the
+specified capacities.
+
+## Updating the list of storage locations
+
+If you add new storage locations to your `data_dir`,
+Garage will not rebalance existing data between storage locations.
+Newly written blocks will be balanced proportionnally to the specified capacities,
+and existing data may be moved between drives to improve balancing,
+but only opportunistically when a data block is re-written (e.g. an object
+is re-uploaded, or an object with a duplicate block is uploaded).
+
+To understand precisely what is happening, we need to dive in to how Garage
+splits data among the different storage locations.
+
+First of all, Garage divides the set of all possible block hashes
+in a fixed number of slices (currently 1024), and assigns
+to each slice a primary storage location among the specified data directories.
+The number of slices having their primary location in each data directory
+is proportionnal to the capacity specified in the config file.
+
+When Garage receives a block to write, it will always write it in the primary
+directory of the slice that contains its hash.
+
+Now, to be able to not lose existing data blocks when storage locations
+are added, Garage also keeps a list of secondary data directories
+for all of the hash slices. Secondary data directories for a slice indicates
+storage locations that once were primary directories for that slice, i.e. where
+Garage knows that data blocks of that slice might be stored.
+When Garage is requested to read a certain data block,
+it will first look in the primary storage directory of its slice,
+and if it doesn't find it there it goes through all of the secondary storage
+locations until it finds it. This allows Garage to continue operating
+normally when storage locations are added, without having to shuffle
+files between drives to place them in the correct location.
+
+This relatively simple strategy works well but does not ensure that data
+is correctly balanced among drives according to their capacity.
+To rebalance data, two strategies can be used:
+
+- Lazy rebalancing: when a block is re-written (e.g. the object is re-uploaded),
+ Garage checks whether the existing copy is in the primary directory of the slice
+ or in a secondary directory. If the current copy is in a secondary directory,
+ Garage re-writes a copy in the primary directory and deletes the one from the
+ secondary directory. This might never end up rebalancing everything if there
+ are data blocks that are only read and never written.
+
+- Active rebalancing: an operator of a Garage node can explicitly launch a repair
+ procedure that rebalances the data directories, moving all blocks to their
+ primary location. Once done, all secondary locations for all hash slices are
+ removed so that they won't be checked anymore when looking for a data block.
+
+## Read-only storage locations
+
+If you would like to move all data blocks from an existing data directory to one
+or several new data directories, mark the old directory as read-only:
+
+```toml
+data_dir = [
+ { path = "/path/to/old_data", read_only = true },
+ { path = "/path/to/new_hdd1", capacity = "2T" },
+ { path = "/path/to/new_hdd2", capacity = "4T" },
+]
+```
+
+Garage will be able to read requested blocks from the read-only directory.
+Garage will also move data out of the read-only directory either progressively
+(lazy rebalancing) or if requested explicitly (active rebalancing).
+
+Once an active rebalancing has finished, your read-only directory should be empty:
+it might still contain subdirectories, but no data files. You can check that
+it contains no files using:
+
+```bash
+find -type f /path/to/old_data # should not print anything
+```
+
+at which point it can be removed from the `data_dir` list in your config file.
diff --git a/doc/book/operations/upgrading.md b/doc/book/operations/upgrading.md
index e8919a19..9a738282 100644
--- a/doc/book/operations/upgrading.md
+++ b/doc/book/operations/upgrading.md
@@ -80,6 +80,6 @@ The entire procedure would look something like this:
5. If any specific migration procedure is required, it is usually in one of the two cases:
- It can be run on online nodes after the new version has started, during regular cluster operation.
- - it has to be run offline
+ - it has to be run offline, in which case you will have to again take all nodes offline one after the other to run the repair
For this last step, please refer to the specific documentation pertaining to the version upgrade you are doing.
diff --git a/doc/book/quick-start/_index.md b/doc/book/quick-start/_index.md
index 08932775..1b129f36 100644
--- a/doc/book/quick-start/_index.md
+++ b/doc/book/quick-start/_index.md
@@ -84,9 +84,8 @@ admin_token = "$(openssl rand -base64 32)"
EOF
```
-Now that your configuration file has been created, you can put
-it in the right place. By default, garage looks at **`/etc/garage.toml`.**
-
+Now that your configuration file has been created, you may save it to the directory of your choice.
+By default, Garage looks for **`/etc/garage.toml`.**
You can also store it somewhere else, but you will have to specify `-c path/to/garage.toml`
at each invocation of the `garage` binary (for example: `garage -c ./garage.toml server`, `garage -c ./garage.toml status`).
@@ -103,12 +102,14 @@ your data to be persisted properly.
### Launching the Garage server
-Use the following command to launch the Garage server with our configuration file:
+Use the following command to launch the Garage server:
```
-garage server
+garage -c path/to/garage.toml server
```
+If you have placed the `garage.toml` file in `/etc` (its default location), you can simply run `garage server`.
+
You can tune Garage's verbosity as follows (from less verbose to more verbose):
```
@@ -126,7 +127,7 @@ Log level `debug` can help you check why your S3 API calls are not working.
The `garage` utility is also used as a CLI tool to configure your Garage deployment.
It uses values from the TOML configuration file to find the Garage daemon running on the
local node, therefore if your configuration file is not at `/etc/garage.toml` you will
-again have to specify `-c path/to/garage.toml`.
+again have to specify `-c path/to/garage.toml` at each invocation.
If the `garage` CLI is able to correctly detect the parameters of your local Garage node,
the following command should be enough to show the status of your cluster:
@@ -140,7 +141,7 @@ This should show something like this:
```
==== HEALTHY NODES ====
ID Hostname Address Tag Zone Capacity
-563e1ac825ee3323… linuxbox 127.0.0.1:3901 NO ROLE ASSIGNED
+563e1ac825ee3323 linuxbox 127.0.0.1:3901 NO ROLE ASSIGNED
```
## Creating a cluster layout
@@ -153,12 +154,12 @@ For our test deployment, we are using only one node. The way in which we configu
it does not matter, you can simply write:
```bash
-garage layout assign -z dc1 -c 1 <node_id>
+garage layout assign -z dc1 -c 1G <node_id>
```
where `<node_id>` corresponds to the identifier of the node shown by `garage status` (first column).
You can enter simply a prefix of that identifier.
-For instance here you could write just `garage layout assign -z dc1 -c 1 563e`.
+For instance here you could write just `garage layout assign -z dc1 -c 1G 563e`.
The layout then has to be applied to the cluster, using:
@@ -209,7 +210,7 @@ one key can access multiple buckets, multiple keys can access one bucket.
Create an API key using the following command:
```
-garage key new --name nextcloud-app-key
+garage key create nextcloud-app-key
```
The output should look as follows:
diff --git a/doc/book/reference-manual/configuration.md b/doc/book/reference-manual/configuration.md
index 2a8c5df5..0d59b570 100644
--- a/doc/book/reference-manual/configuration.md
+++ b/doc/book/reference-manual/configuration.md
@@ -10,6 +10,8 @@ Here is an example `garage.toml` configuration file that illustrates all of the
```toml
metadata_dir = "/var/lib/garage/meta"
data_dir = "/var/lib/garage/data"
+metadata_fsync = true
+data_fsync = false
db_engine = "lmdb"
@@ -90,6 +92,19 @@ This folder can be placed on an HDD. The space available for `data_dir`
should be counted to determine a node's capacity
when [adding it to the cluster layout](@/documentation/cookbook/real-world.md).
+Since `v0.9.0`, Garage supports multiple data directories with the following syntax:
+
+```toml
+data_dir = [
+ { path = "/path/to/old_data", read_only = true },
+ { path = "/path/to/new_hdd1", capacity = "2T" },
+ { path = "/path/to/new_hdd2", capacity = "4T" },
+]
+```
+
+See [the dedicated documentation page](@/documentation/operations/multi-hdd.md)
+on how to operate Garage in such a setup.
+
### `db_engine` (since `v0.8.0`)
By default, Garage uses the Sled embedded database library
@@ -131,6 +146,49 @@ convert-db -a <input db engine> -i <input db path> \
Make sure to specify the full database path as presented in the table above,
and not just the path to the metadata directory.
+### `metadata_fsync`
+
+Whether to enable synchronous mode for the database engine or not.
+This is disabled (`false`) by default.
+
+This reduces the risk of metadata corruption in case of power failures,
+at the cost of a significant drop in write performance,
+as Garage will have to pause to sync data to disk much more often
+(several times for API calls such as PutObject).
+
+Using this option reduces the risk of simultaneous metadata corruption on several
+cluster nodes, which could lead to data loss.
+
+If multi-site replication is used, this option is most likely not necessary, as
+it is extremely unlikely that two nodes in different locations will have a
+power failure at the exact same time.
+
+(Metadata corruption on a single node is not an issue, the corrupted data file
+can always be deleted and reconstructed from the other nodes in the cluster.)
+
+Here is how this option impacts the different database engines:
+
+| Database | `metadata_fsync = false` (default) | `metadata_fsync = true` |
+|----------|------------------------------------|-------------------------------|
+| Sled | default options | *unsupported* |
+| Sqlite | `PRAGMA synchronous = OFF` | `PRAGMA synchronous = NORMAL` |
+| LMDB | `MDB_NOMETASYNC` + `MDB_NOSYNC` | `MDB_NOMETASYNC` |
+
+Note that the Sqlite database is always ran in `WAL` mode (`PRAGMA journal_mode = WAL`).
+
+### `data_fsync`
+
+Whether to `fsync` data blocks and their containing directory after they are
+saved to disk.
+This is disabled (`false`) by default.
+
+This might reduce the risk that a data block is lost in rare
+situations such as simultaneous node losing power,
+at the cost of a moderate drop in write performance.
+
+Similarly to `metatada_fsync`, this is likely not necessary
+if geographical replication is used.
+
### `block_size`
Garage splits stored objects in consecutive chunks of size `block_size`
diff --git a/doc/book/reference-manual/s3-compatibility.md b/doc/book/reference-manual/s3-compatibility.md
index 15b29bd1..1bcfd123 100644
--- a/doc/book/reference-manual/s3-compatibility.md
+++ b/doc/book/reference-manual/s3-compatibility.md
@@ -75,16 +75,13 @@ but these endpoints are documented in [Red Hat Ceph Storage - Chapter 2. Ceph Ob
| Endpoint | Garage | [Openstack Swift](https://docs.openstack.org/swift/latest/s3_compat.html) | [Ceph Object Gateway](https://docs.ceph.com/en/latest/radosgw/s3/) | [Riak CS](https://docs.riak.com/riak/cs/2.1.1/references/apis/storage/s3/index.html) | [OpenIO](https://docs.openio.io/latest/source/arch-design/s3_compliancy.html) |
|------------------------------|----------------------------------|-----------------|---------------|---------|-----|
-| [AbortMultipartUpload](https://docs.aws.amazon.com/AmazonS3/latest/API/API_AbortMultipartUpload.html) | ✅ Implemented | ✅ | ✅ | ✅ | ✅ |
-| [CompleteMultipartUpload](https://docs.aws.amazon.com/AmazonS3/latest/API/API_CompleteMultipartUpload.html) | ✅ Implemented (see details below) | ✅ | ✅ | ✅ | ✅ |
-| [CreateMultipartUpload](https://docs.aws.amazon.com/AmazonS3/latest/API/API_CreateMultipartUpload.html) | ✅ Implemented | ✅| ✅ | ✅ | ✅ |
-| [ListMultipartUpload](https://docs.aws.amazon.com/AmazonS3/latest/API/API_ListMultipartUpload.html) | ✅ Implemented | ✅ | ✅ | ✅ | ✅ |
-| [ListParts](https://docs.aws.amazon.com/AmazonS3/latest/API/API_ListParts.html) | ✅ Implemented | ✅ | ✅ | ✅ | ✅ |
-| [UploadPart](https://docs.aws.amazon.com/AmazonS3/latest/API/API_UploadPart.html) | ✅ Implemented (see details below) | ✅ | ✅| ✅ | ✅ |
-| [UploadPartCopy](https://docs.aws.amazon.com/AmazonS3/latest/API/API_UploadPartCopy.html) | ✅ Implemented | ✅ | ✅ | ✅ | ✅ |
-
-Our implementation of Multipart Upload is currently a bit more restrictive than Amazon's one in some edge cases.
-For more information, please refer to our [issue tracker](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/204).
+| [AbortMultipartUpload](https://docs.aws.amazon.com/AmazonS3/latest/API/API_AbortMultipartUpload.html) | ✅ Implemented | ✅ | ✅ | ✅ | ✅ |
+| [CompleteMultipartUpload](https://docs.aws.amazon.com/AmazonS3/latest/API/API_CompleteMultipartUpload.html) | ✅ Implemented | ✅ | ✅ | ✅ | ✅ |
+| [CreateMultipartUpload](https://docs.aws.amazon.com/AmazonS3/latest/API/API_CreateMultipartUpload.html) | ✅ Implemented | ✅| ✅ | ✅ | ✅ |
+| [ListMultipartUpload](https://docs.aws.amazon.com/AmazonS3/latest/API/API_ListMultipartUpload.html) | ✅ Implemented | ✅ | ✅ | ✅ | ✅ |
+| [ListParts](https://docs.aws.amazon.com/AmazonS3/latest/API/API_ListParts.html) | ✅ Implemented | ✅ | ✅ | ✅ | ✅ |
+| [UploadPart](https://docs.aws.amazon.com/AmazonS3/latest/API/API_UploadPart.html) | ✅ Implemented | ✅ | ✅| ✅ | ✅ |
+| [UploadPartCopy](https://docs.aws.amazon.com/AmazonS3/latest/API/API_UploadPartCopy.html) | ✅ Implemented | ✅ | ✅ | ✅ | ✅ |
### Website endpoints
@@ -127,15 +124,22 @@ If you need this feature, please [share your use case in our dedicated issue](ht
| Endpoint | Garage | [Openstack Swift](https://docs.openstack.org/swift/latest/s3_compat.html) | [Ceph Object Gateway](https://docs.ceph.com/en/latest/radosgw/s3/) | [Riak CS](https://docs.riak.com/riak/cs/2.1.1/references/apis/storage/s3/index.html) | [OpenIO](https://docs.openio.io/latest/source/arch-design/s3_compliancy.html) |
|------------------------------|----------------------------------|-----------------|---------------|---------|-----|
-| [DeleteBucketLifecycle](https://docs.aws.amazon.com/AmazonS3/latest/API/API_DeleteBucketLifecycle.html) | ❌ Missing | ❌| ✅| ❌| ✅|
-| [GetBucketLifecycleConfiguration](https://docs.aws.amazon.com/AmazonS3/latest/API/API_GetBucketLifecycleConfiguration.html) | ❌ Missing | ❌| ✅ | ❌| ✅|
-| [PutBucketLifecycleConfiguration](https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutBucketLifecycleConfiguration.html) | ❌ Missing | ❌| ✅ | ❌| ✅|
+| [DeleteBucketLifecycle](https://docs.aws.amazon.com/AmazonS3/latest/API/API_DeleteBucketLifecycle.html) | ✅ Implemented | ❌| ✅| ❌| ✅|
+| [GetBucketLifecycleConfiguration](https://docs.aws.amazon.com/AmazonS3/latest/API/API_GetBucketLifecycleConfiguration.html) | ✅ Implemented | ❌| ✅ | ❌| ✅|
+| [PutBucketLifecycleConfiguration](https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutBucketLifecycleConfiguration.html) | ⚠ Partially implemented (see below) | ❌| ✅ | ❌| ✅|
| [GetBucketVersioning](https://docs.aws.amazon.com/AmazonS3/latest/API/API_GetBucketVersioning.html) | ❌ Stub (see below) | ✅| ✅ | ❌| ✅|
| [ListObjectVersions](https://docs.aws.amazon.com/AmazonS3/latest/API/API_ListObjectVersions.html) | ❌ Missing | ❌| ✅ | ❌| ✅|
| [PutBucketVersioning](https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutBucketVersioning.html) | ❌ Missing | ❌| ✅| ❌| ✅|
+**PutBucketLifecycleConfiguration:** The only actions supported are
+`AbortIncompleteMultipartUpload` and `Expiration` (without the
+`ExpiredObjectDeleteMarker` field). All other operations are dependent on
+either bucket versionning or storage classes which Garage currently does not
+implement. The deprecated `Prefix` member directly in the the `Rule`
+structure/XML tag is not supported, specified prefixes must be inside the
+`Filter` structure/XML tag.
-**GetBucketVersioning:** Stub implementation (Garage does not yet support versionning so this always returns "versionning not enabled").
+**GetBucketVersioning:** Stub implementation which always returns "versionning not enabled", since Garage does not yet support bucket versionning.
### Replication endpoints
diff --git a/doc/book/working-documents/migration-09.md b/doc/book/working-documents/migration-09.md
new file mode 100644
index 00000000..ba758093
--- /dev/null
+++ b/doc/book/working-documents/migration-09.md
@@ -0,0 +1,72 @@
++++
+title = "Migrating from 0.8 to 0.9"
+weight = 12
++++
+
+**This guide explains how to migrate to 0.9 if you have an existing 0.8 cluster.
+We don't recommend trying to migrate to 0.9 directly from 0.7 or older.**
+
+This migration procedure has been tested on several clusters without issues.
+However, it is still a *critical procedure* that might cause issues.
+**Make sure to back up all your data before attempting it!**
+
+You might also want to read our [general documentation on upgrading Garage](@/documentation/operations/upgrading.md).
+
+The following are **breaking changes** in Garage v0.9 that require your attention when migrating:
+
+- LMDB is now the default metadata db engine and Sled is deprecated. If you were using Sled, make sure to specify `db_engine = "sled"` in your configuration file, or take the time to [convert your database](https://garagehq.deuxfleurs.fr/documentation/reference-manual/configuration/#db-engine-since-v0-8-0).
+
+- Capacity values are now in actual byte units. The translation from the old layout will assign 1 capacity = 1Gb by default, which might be wrong for your cluster. This does not cause any data to be moved around, but you might want to re-assign correct capacity values post-migration.
+
+- Multipart uploads that were started in Garage v0.8 will not be visible in Garage v0.9 and will have to be restarted from scratch.
+
+- Changes to the admin API: some `v0/` endpoints have been replaced by `v1/` counterparts with updated/uniformized syntax. All other endpoints have also moved to `v1/` by default, without syntax changes, but are still available under `v0/` for compatibility.
+
+
+## Simple migration procedure (takes cluster offline for a while)
+
+The migration steps are as follows:
+
+1. Disable API and web access. You may do this by stopping your reverse proxy or by commenting out
+ the `api_bind_addr` values in your `config.toml` file and restarting Garage.
+2. Do `garage repair --all-nodes --yes tables` and `garage repair --all-nodes --yes blocks`,
+ check the logs and check that all data seems to be synced correctly between
+ nodes. If you have time, do additional checks (`versions`, `block_refs`, etc.)
+3. Check that the block resync queue and Merkle queue are empty:
+ run `garage stats -a` to query them or inspect metrics in the Grafana dashboard.
+4. Turn off Garage v0.8
+5. **Backup the metadata folder of all your nodes!** For instance, use the following command
+ if your metadata directory is `/var/lib/garage/meta`: `cd /var/lib/garage ; tar -acf meta-v0.8.tar.zst meta/`
+6. Install Garage v0.9
+7. Update your configuration file if necessary.
+8. Turn on Garage v0.9
+9. Do `garage repair --all-nodes --yes tables` and `garage repair --all-nodes --yes blocks`.
+ Wait for a full table sync to run.
+10. Your upgraded cluster should be in a working state. Re-enable API and Web
+ access and check that everything went well.
+11. Monitor your cluster in the next hours to see if it works well under your production load, report any issue.
+12. You might want to assign correct capacity values to all your nodes. Doing so might cause data to be moved
+ in your cluster, which should also be monitored carefully.
+
+## Minimal downtime migration procedure
+
+The migration to Garage v0.9 can be done with almost no downtime,
+by restarting all nodes at once in the new version.
+
+The migration steps are as follows:
+
+1. Do `garage repair --all-nodes --yes tables` and `garage repair --all-nodes --yes blocks`,
+ check the logs and check that all data seems to be synced correctly between
+ nodes. If you have time, do additional checks (`versions`, `block_refs`, etc.)
+
+2. Turn off each node individually; back up its metadata folder (see above); turn it back on again.
+ This will allow you to take a backup of all nodes without impacting global cluster availability.
+ You can do all nodes of a single zone at once as this does not impact the availability of Garage.
+
+3. Prepare your binaries and configuration files for Garage v0.9
+
+4. Shut down all v0.8 nodes simultaneously, and restart them all simultaneously in v0.9.
+ Use your favorite deployment tool (Ansible, Kubernetes, Nomad) to achieve this as fast as possible.
+ Garage v0.9 should be in a working state as soon as it starts.
+
+5. Proceed with repair and monitoring as described in steps 9-12 above.