aboutsummaryrefslogtreecommitdiff
path: root/doc/book
diff options
context:
space:
mode:
authorAlex Auvolat <alex@adnab.me>2023-04-25 12:34:26 +0200
committerAlex Auvolat <alex@adnab.me>2023-04-25 12:34:26 +0200
commitfa78d806e3ae40031e80eebb86e4eb1756d7baea (patch)
tree144662fb430c484093f6f9a585a2441c2ff26494 /doc/book
parent654999e254e6c1f46bb5d668bc1230f226575716 (diff)
parenta16eb7e4b8344d2f58c09a249b7b1bd17d339a35 (diff)
downloadgarage-fa78d806e3ae40031e80eebb86e4eb1756d7baea.tar.gz
garage-fa78d806e3ae40031e80eebb86e4eb1756d7baea.zip
Merge branch 'main' into next
Diffstat (limited to 'doc/book')
-rw-r--r--doc/book/connect/_index.md7
-rw-r--r--doc/book/connect/apps/index.md65
-rw-r--r--doc/book/connect/backup.md34
-rw-r--r--doc/book/connect/cli.md72
-rw-r--r--doc/book/connect/observability.md57
-rw-r--r--doc/book/cookbook/_index.md14
-rw-r--r--doc/book/cookbook/ansible.md51
-rw-r--r--doc/book/cookbook/binary-packages.md28
-rw-r--r--doc/book/cookbook/gateways.md2
-rw-r--r--doc/book/cookbook/kubernetes.md3
-rw-r--r--doc/book/cookbook/monitoring.md251
-rw-r--r--doc/book/cookbook/real-world.md10
-rw-r--r--doc/book/cookbook/reverse-proxy.md182
-rw-r--r--doc/book/cookbook/upgrading.md81
-rw-r--r--doc/book/design/_index.md12
-rw-r--r--doc/book/design/related-work.md3
-rw-r--r--doc/book/development/devenv.md2
-rw-r--r--doc/book/development/release-process.md2
-rw-r--r--doc/book/quick-start/_index.md6
-rw-r--r--doc/book/reference-manual/admin-api.md2
-rw-r--r--doc/book/reference-manual/configuration.md44
-rw-r--r--doc/book/reference-manual/k2v.md4
-rw-r--r--doc/book/reference-manual/monitoring.md285
-rw-r--r--doc/book/reference-manual/s3-compatibility.md2
-rw-r--r--doc/book/working-documents/migration-08.md25
25 files changed, 850 insertions, 394 deletions
diff --git a/doc/book/connect/_index.md b/doc/book/connect/_index.md
index ca44ac17..93a2b87e 100644
--- a/doc/book/connect/_index.md
+++ b/doc/book/connect/_index.md
@@ -10,11 +10,12 @@ Garage implements the Amazon S3 protocol, which makes it compatible with many ex
In particular, you will find here instructions to connect it with:
- - [Browsing tools](@/documentation/connect/cli.md)
- [Applications](@/documentation/connect/apps/index.md)
- - [Website hosting](@/documentation/connect/websites.md)
- - [Software repositories](@/documentation/connect/repositories.md)
+ - [Browsing tools](@/documentation/connect/cli.md)
- [FUSE](@/documentation/connect/fs.md)
+ - [Observability](@/documentation/connect/observability.md)
+ - [Software repositories](@/documentation/connect/repositories.md)
+ - [Website hosting](@/documentation/connect/websites.md)
### Generic instructions
diff --git a/doc/book/connect/apps/index.md b/doc/book/connect/apps/index.md
index 78d9310d..e2d007c3 100644
--- a/doc/book/connect/apps/index.md
+++ b/doc/book/connect/apps/index.md
@@ -13,7 +13,7 @@ In this section, we cover the following web applications:
| [Matrix](#matrix) | ✅ | Tested with `synapse-s3-storage-provider` |
| [Pixelfed](#pixelfed) | ❓ | Not yet tested |
| [Pleroma](#pleroma) | ❓ | Not yet tested |
-| [Lemmy](#lemmy) | ❓ | Not yet tested |
+| [Lemmy](#lemmy) | ✅ | Supported with pict-rs |
| [Funkwhale](#funkwhale) | ❓ | Not yet tested |
| [Misskey](#misskey) | ❓ | Not yet tested |
| [Prismo](#prismo) | ❓ | Not yet tested |
@@ -484,7 +484,68 @@ And add a new line. For example, to run it every 10 minutes:
## Lemmy
-Lemmy uses pict-rs that [supports S3 backends](https://git.asonix.dog/asonix/pict-rs/commit/f9f4fc63d670f357c93f24147c2ee3e1278e2d97)
+Lemmy uses pict-rs that [supports S3 backends](https://git.asonix.dog/asonix/pict-rs/commit/f9f4fc63d670f357c93f24147c2ee3e1278e2d97).
+This feature requires `pict-rs >= 4.0.0`.
+
+### Creating your bucket
+
+This is the usual Garage setup:
+
+```bash
+garage key new --name pictrs-key
+garage bucket create pictrs-data
+garage bucket allow pictrs-data --read --write --key pictrs-key
+```
+
+Note the Key ID and Secret Key.
+
+### Migrating your data
+
+If your pict-rs instance holds existing data, you first need to migrate to the S3 bucket.
+
+Stop pict-rs, then run the migration utility from local filesystem to the bucket:
+
+```
+pict-rs \
+ filesystem -p /path/to/existing/files \
+ object-store \
+ -e my-garage-instance.mydomain.tld:3900 \
+ -b pictrs-data \
+ -r garage \
+ -a GK... \
+ -s abcdef0123456789...
+```
+
+This is pretty slow, so hold on while migrating.
+
+### Running pict-rs with an S3 backend
+
+Pict-rs supports both a configuration file and environment variables.
+
+Either set the following section in your `pict-rs.toml`:
+
+```
+[store]
+type = 'object_storage'
+endpoint = 'http://my-garage-instance.mydomain.tld:3900'
+bucket_name = 'pictrs-data'
+region = 'garage'
+access_key = 'GK...'
+secret_key = 'abcdef0123456789...'
+```
+
+... or set these environment variables:
+
+
+```
+PICTRS__STORE__TYPE=object_storage
+PICTRS__STORE__ENDPOINT=http:/my-garage-instance.mydomain.tld:3900
+PICTRS__STORE__BUCKET_NAME=pictrs-data
+PICTRS__STORE__REGION=garage
+PICTRS__STORE__ACCESS_KEY=GK...
+PICTRS__STORE__SECRET_KEY=abcdef0123456789...
+```
+
## Funkwhale
diff --git a/doc/book/connect/backup.md b/doc/book/connect/backup.md
index 919e78c3..97a89e36 100644
--- a/doc/book/connect/backup.md
+++ b/doc/book/connect/backup.md
@@ -13,7 +13,41 @@ Borg Backup is very popular among the backup tools but it is not yet compatible
We recommend using any other tool listed in this guide because they are all compatible with the S3 API.
If you still want to use Borg, you can use it with `rclone mount`.
+## git-annex
+[git-annex](https://git-annex.branchable.com/) supports synchronizing files
+with its [S3 special remote](https://git-annex.branchable.com/special_remotes/S3/).
+
+Note that `git-annex` requires to be compiled with Haskell package version
+`aws-0.24` to work with Garage.
+
+```bash
+garage key new --name my-key
+garage bucket create my-git-annex
+garage bucket allow my-git-annex --read --write --key my-key
+```
+
+Register your Key ID and Secret key in your environment:
+
+```bash
+export AWS_ACCESS_KEY_ID=GKxxx
+export AWS_SECRET_ACCESS_KEY=xxxx
+```
+
+Within a git-annex enabled repository, configure your Garage S3 endpoint with
+the following command:
+
+```bash
+git annex initremote garage type=S3 encryption=none host=my-garage-instance.mydomain.tld protocol=https bucket=my-git-annex requeststyle=path region=garage signature=v4
+```
+
+Files can now be synchronized using the usual `git-annex` `copy` or `get`
+commands.
+
+Note that for simplicity - this example does not enable encryption for the files
+sent to Garage - please refer to the
+[git-annex encryption page](https://git-annex.branchable.com/encryption/) for
+how to configure this.
## Restic
diff --git a/doc/book/connect/cli.md b/doc/book/connect/cli.md
index 74e2d7ed..591ac151 100644
--- a/doc/book/connect/cli.md
+++ b/doc/book/connect/cli.md
@@ -12,6 +12,7 @@ These tools are particularly suitable for debug, backups, website deployments or
| [AWS CLI](#aws-cli) | ✅ | Recommended |
| [rclone](#rclone) | ✅ | |
| [s3cmd](#s3cmd) | ✅ | |
+| [s5cmd](#s5cmd) | ✅ | |
| [(Cyber)duck](#cyberduck) | ✅ | |
| [WinSCP (libs3)](#winscp) | ✅ | CLI instructions only |
| [sftpgo](#sftpgo) | ✅ | |
@@ -178,59 +179,34 @@ s3cmd put /tmp/hello.txt s3://my-bucket/
s3cmd get s3://my-bucket/hello.txt hello.txt
```
-## Cyberduck & duck {#cyberduck}
+## `s5cmd`
-Both Cyberduck (the GUI) and duck (the CLI) have a concept of "Connection Profiles" that contain some presets for a specific provider.
-We wrote the following connection profile for Garage:
-
-```xml
-<?xml version="1.0" encoding="UTF-8"?>
-<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
-<plist version="1.0">
- <dict>
- <key>Protocol</key>
- <string>s3</string>
- <key>Vendor</key>
- <string>garage</string>
- <key>Scheme</key>
- <string>https</string>
- <key>Description</key>
- <string>GarageS3</string>
- <key>Default Hostname</key>
- <string>127.0.0.1</string>
- <key>Default Port</key>
- <string>4443</string>
- <key>Hostname Configurable</key>
- <false/>
- <key>Port Configurable</key>
- <false/>
- <key>Username Configurable</key>
- <true/>
- <key>Username Placeholder</key>
- <string>Access Key ID (GK...)</string>
- <key>Password Placeholder</key>
- <string>Secret Key</string>
- <key>Properties</key>
- <array>
- <string>s3service.disable-dns-buckets=true</string>
- </array>
- <key>Region</key>
- <string>garage</string>
- <key>Regions</key>
- <array>
- <string>garage</string>
- </array>
- </dict>
-</plist>
+Configure a credentials file as follows:
+
+```bash
+export AWS_ACCESS_KEY_ID=GK...
+export AWS_SECRET_ACCESS_KEY=
+export AWS_DEFAULT_REGION='garage'
+export AWS_ENDPOINT='http://localhost:3900'
```
-*Note: If your garage instance is configured with vhost access style, you can remove `s3service.disable-dns-buckets=true`.*
+After adding these environment variables in your shell, `s5cmd` can be used
+with:
-### Instructions for the GUI
+```bash
+s5cmd --endpoint-url=$AWS_ENDPOINT ls
+```
+
+See its usage output for other commands available.
+
+## Cyberduck & duck {#cyberduck}
+
+Both Cyberduck (the GUI) and duck (the CLI) have a concept of "Connection Profiles" that contain some presets for a specific provider.
-Copy the connection profile, and save it anywhere as `garage.cyberduckprofile`.
-Then find this file with your file explorer and double click on it: Cyberduck will open a connection wizard for this profile.
-Simply follow the wizard and you should be done!
+Within Cyberduck, a
+[Garage connection profile](https://docs.cyberduck.io/protocols/s3/garage/) is
+available within the `Preferences -> Profiles` section. This can enabled and
+then connections to Garage may be configured.
### Instuctions for the CLI
diff --git a/doc/book/connect/observability.md b/doc/book/connect/observability.md
new file mode 100644
index 00000000..c5037fa4
--- /dev/null
+++ b/doc/book/connect/observability.md
@@ -0,0 +1,57 @@
++++
+title = "Observability"
+weight = 25
++++
+
+An object store can be used as data storage location for metrics, and logs which
+can then be leveraged for systems observability.
+
+## Metrics
+
+### Prometheus
+
+Prometheus itself has no object store capabilities, however two projects exist
+which support storing metrics in an object store:
+
+ - [Cortex](https://cortexmetrics.io/)
+ - [Thanos](https://thanos.io/)
+
+## System logs
+
+### Vector
+
+[Vector](https://vector.dev/) natively supports S3 as a
+[data sink](https://vector.dev/docs/reference/configuration/sinks/aws_s3/)
+(and [source](https://vector.dev/docs/reference/configuration/sources/aws_s3/)).
+
+This can be configured with Garage with the following:
+
+```bash
+garage key new --name vector-system-logs
+garage bucket create system-logs
+garage bucket allow system-logs --read --write --key vector-system-logs
+```
+
+The `vector.toml` can then be configured as follows:
+
+```toml
+[sources.journald]
+type = "journald"
+current_boot_only = true
+
+[sinks.out]
+encoding.codec = "json"
+type = "aws_s3"
+inputs = [ "journald" ]
+bucket = "system-logs"
+key_prefix = "%F/"
+compression = "none"
+region = "garage"
+endpoint = "https://my-garage-instance.mydomain.tld"
+auth.access_key_id = ""
+auth.secret_access_key = ""
+```
+
+This is an example configuration - please refer to the Vector documentation for
+all configuration and transformation possibilities. Also note that Garage
+performs its own compression, so this should be disabled in Vector.
diff --git a/doc/book/cookbook/_index.md b/doc/book/cookbook/_index.md
index 6e279363..07bf6ebf 100644
--- a/doc/book/cookbook/_index.md
+++ b/doc/book/cookbook/_index.md
@@ -6,7 +6,7 @@ sort_by = "weight"
+++
A cookbook, when you cook, is a collection of recipes.
-Similarly, Garage's cookbook contains a collection of recipes that are known to works well!
+Similarly, Garage's cookbook contains a collection of recipes that are known to work well!
This chapter could also be referred as "Tutorials" or "Best practices".
- **[Multi-node deployment](@/documentation/cookbook/real-world.md):** This page will walk you through all of the necessary
@@ -16,6 +16,10 @@ This chapter could also be referred as "Tutorials" or "Best practices".
source in case a binary is not provided for your architecture, or if you want to
hack with us!
+- **[Binary packages](@/documentation/cookbook/binary-packages.md):** This page
+ lists the different platforms that provide ready-built software packages for
+ Garage.
+
- **[Integration with Systemd](@/documentation/cookbook/systemd.md):** This page explains how to run Garage
as a Systemd service (instead of as a Docker container).
@@ -26,6 +30,14 @@ This chapter could also be referred as "Tutorials" or "Best practices".
- **[Configuring a reverse-proxy](@/documentation/cookbook/reverse-proxy.md):** This page explains how to configure a reverse-proxy to add TLS support to your S3 api endpoint.
+- **[Deploying on Kubernetes](@/documentation/cookbook/kubernetes.md):** This page explains how to deploy Garage on Kubernetes using our Helm chart.
+
+- **[Deploying with Ansible](@/documentation/cookbook/ansible.md):** This page lists available Ansible roles developed by the community to deploy Garage.
+
+- **[Monitoring Garage](@/documentation/cookbook/monitoring.md)** This page
+ explains the Prometheus metrics available for monitoring the Garage
+ cluster/nodes.
+
- **[Recovering from failures](@/documentation/cookbook/recovering.md):** Garage's first selling point is resilience
to hardware failures. This section explains how to recover from such a failure in the
best possible way.
diff --git a/doc/book/cookbook/ansible.md b/doc/book/cookbook/ansible.md
new file mode 100644
index 00000000..6d624c9c
--- /dev/null
+++ b/doc/book/cookbook/ansible.md
@@ -0,0 +1,51 @@
++++
+title = "Deploying with Ansible"
+weight = 35
++++
+
+While Ansible is not officially supported to deploy Garage, several community members
+have published Ansible roles. We list them and compare them below.
+
+## Comparison of Ansible roles
+
+| Feature | [ansible-role-garage](#zorun-ansible-role-garage) | [garage-docker-ansible-deploy](#moan0s-garage-docker-ansible-deploy) |
+|------------------------------------|---------------------------------------------|---------------------------------------------------------------|
+| **Runtime** | Systemd | Docker |
+| **Target OS** | Any Linux | Any Linux |
+| **Architecture** | amd64, arm64, i686 | amd64, arm64 |
+| **Additional software** | None | Traefik |
+| **Automatic node connection** | ❌ | ✅ |
+| **Layout management** | ❌ | ✅ |
+| **Manage buckets & keys** | ❌ | ✅ (basic) |
+| **Allow custom Garage config** | ✅ | ❌ |
+| **Facilitate Garage upgrades** | ✅ | ❌ |
+| **Multiple instances on one host** | ✅ | ✅ |
+
+
+## zorun/ansible-role-garage
+
+[Source code](https://github.com/zorun/ansible-role-garage), [Ansible galaxy](https://galaxy.ansible.com/zorun/garage)
+
+This role is voluntarily simple: it relies on the official Garage static
+binaries and only requires Systemd. As such, it should work on any
+Linux-based OS.
+
+To make things more flexible, the user has to provide a Garage
+configuration template. This allows to customize Garage configuration in
+any way.
+
+Some more features might be added, such as a way to automatically connect
+nodes to each other or to define a layout.
+
+## moan0s/garage-docker-ansible-deploy
+
+[Source code](https://github.com/moan0s/garage-docker-ansible-deploy), [Blog post](https://hyteck.de/post/garage/)
+
+This role is based on the Docker image for Garage, and comes with
+"batteries included": it will additionally install Docker and Traefik. In
+addition, it is "opinionated" in the sense that it expects a particular
+deployment structure (one instance per disk, one gateway per host,
+structured DNS names, etc).
+
+As a result, this role makes it easier to start with Garage on Ansible,
+but is less flexible.
diff --git a/doc/book/cookbook/binary-packages.md b/doc/book/cookbook/binary-packages.md
new file mode 100644
index 00000000..606de2b6
--- /dev/null
+++ b/doc/book/cookbook/binary-packages.md
@@ -0,0 +1,28 @@
++++
+title = "Binary packages"
+weight = 11
++++
+
+Garage is also available in binary packages on:
+
+## Alpine Linux
+
+```bash
+apk install garage
+```
+
+## Arch Linux
+
+Garage is available in the [AUR](https://aur.archlinux.org/packages/garage).
+
+## FreeBSD
+
+```bash
+pkg install garage
+```
+
+## NixOS
+
+```bash
+nix-shell -p garage
+```
diff --git a/doc/book/cookbook/gateways.md b/doc/book/cookbook/gateways.md
index 62ed0fe2..ce4c7fa8 100644
--- a/doc/book/cookbook/gateways.md
+++ b/doc/book/cookbook/gateways.md
@@ -21,7 +21,7 @@ You can configure Garage as a gateway on all nodes that will consume your S3 API
The instructions are similar to a regular node, the only option that is different is while configuring the node, you must set the `--gateway` parameter:
```bash
-garage layout assign --gateway --tag gw1 <node_id>
+garage layout assign --gateway --tag gw1 -z dc1 <node_id>
garage layout show # review the changes you are making
garage layout apply # once satisfied, apply the changes
```
diff --git a/doc/book/cookbook/kubernetes.md b/doc/book/cookbook/kubernetes.md
index 9eafe3e1..dfeb3281 100644
--- a/doc/book/cookbook/kubernetes.md
+++ b/doc/book/cookbook/kubernetes.md
@@ -48,7 +48,8 @@ garage:
replicationMode: "2"
# Start 4 instances (StatefulSets) of garage
-replicaCount: 4
+deployment:
+ replicaCount: 4
# Override default storage class and size
persistence:
diff --git a/doc/book/cookbook/monitoring.md b/doc/book/cookbook/monitoring.md
index 8206f645..8313daa9 100644
--- a/doc/book/cookbook/monitoring.md
+++ b/doc/book/cookbook/monitoring.md
@@ -52,255 +52,6 @@ or make your own.
We detail below the list of exposed metrics and their meaning.
-
## List of exported metrics
-
-### Metrics of the API endpoints
-
-#### `api_admin_request_counter` (counter)
-
-Counts the number of requests to a given endpoint of the administration API. Example:
-
-```
-api_admin_request_counter{api_endpoint="Metrics"} 127041
-```
-
-#### `api_admin_request_duration` (histogram)
-
-Evaluates the duration of API calls to the various administration API endpoint. Example:
-
-```
-api_admin_request_duration_bucket{api_endpoint="Metrics",le="0.5"} 127041
-api_admin_request_duration_sum{api_endpoint="Metrics"} 605.250344830999
-api_admin_request_duration_count{api_endpoint="Metrics"} 127041
-```
-
-#### `api_s3_request_counter` (counter)
-
-Counts the number of requests to a given endpoint of the S3 API. Example:
-
-```
-api_s3_request_counter{api_endpoint="CreateMultipartUpload"} 1
-```
-
-#### `api_s3_error_counter` (counter)
-
-Counts the number of requests to a given endpoint of the S3 API that returned an error. Example:
-
-```
-api_s3_error_counter{api_endpoint="GetObject",status_code="404"} 39
-```
-
-#### `api_s3_request_duration` (histogram)
-
-Evaluates the duration of API calls to the various S3 API endpoints. Example:
-
-```
-api_s3_request_duration_bucket{api_endpoint="CreateMultipartUpload",le="0.5"} 1
-api_s3_request_duration_sum{api_endpoint="CreateMultipartUpload"} 0.046340762
-api_s3_request_duration_count{api_endpoint="CreateMultipartUpload"} 1
-```
-
-#### `api_k2v_request_counter` (counter), `api_k2v_error_counter` (counter), `api_k2v_error_duration` (histogram)
-
-Same as for S3, for the K2V API.
-
-
-### Metrics of the Web endpoint
-
-
-#### `web_request_counter` (counter)
-
-Number of requests to the web endpoint
-
-```
-web_request_counter{method="GET"} 80
-```
-
-#### `web_request_duration` (histogram)
-
-Duration of requests to the web endpoint
-
-```
-web_request_duration_bucket{method="GET",le="0.5"} 80
-web_request_duration_sum{method="GET"} 1.0528433229999998
-web_request_duration_count{method="GET"} 80
-```
-
-#### `web_error_counter` (counter)
-
-Number of requests to the web endpoint resulting in errors
-
-```
-web_error_counter{method="GET",status_code="404 Not Found"} 64
-```
-
-
-### Metrics of the data block manager
-
-#### `block_bytes_read`, `block_bytes_written` (counter)
-
-Number of bytes read/written to/from disk in the data storage directory.
-
-```
-block_bytes_read 120586322022
-block_bytes_written 3386618077
-```
-
-#### `block_read_duration`, `block_write_duration` (histograms)
-
-Evaluates the duration of the reading/writing of individual data blocks in the data storage directory.
-
-```
-block_read_duration_bucket{le="0.5"} 169229
-block_read_duration_sum 2761.6902550310056
-block_read_duration_count 169240
-block_write_duration_bucket{le="0.5"} 3559
-block_write_duration_sum 195.59170078500006
-block_write_duration_count 3571
-```
-
-#### `block_delete_counter` (counter)
-
-Counts the number of data blocks that have been deleted from storage.
-
-```
-block_delete_counter 122
-```
-
-#### `block_resync_counter` (counter), `block_resync_duration` (histogram)
-
-Counts the number of resync operations the node has executed, and evaluates their duration.
-
-```
-block_resync_counter 308897
-block_resync_duration_bucket{le="0.5"} 308892
-block_resync_duration_sum 139.64204196100016
-block_resync_duration_count 308897
-```
-
-#### `block_resync_queue_length` (gauge)
-
-The number of block hashes currently queued for a resync.
-This is normal to be nonzero for long periods of time.
-
-```
-block_resync_queue_length 0
-```
-
-#### `block_resync_errored_blocks` (gauge)
-
-The number of block hashes that we were unable to resync last time we tried.
-**THIS SHOULD BE ZERO, OR FALL BACK TO ZERO RAPIDLY, IN A HEALTHY CLUSTER.**
-Persistent nonzero values indicate that some data is likely to be lost.
-
-```
-block_resync_errored_blocks 0
-```
-
-
-### Metrics related to RPCs (remote procedure calls) between nodes
-
-#### `rpc_netapp_request_counter` (counter)
-
-Number of RPC requests emitted
-
-```
-rpc_request_counter{from="<this node>",rpc_endpoint="garage_block/manager.rs/Rpc",to="<remote node>"} 176
-```
-
-#### `rpc_netapp_error_counter` (counter)
-
-Number of communication errors (errors in the Netapp library, generally due to disconnected nodes)
-
-```
-rpc_netapp_error_counter{from="<this node>",rpc_endpoint="garage_block/manager.rs/Rpc",to="<remote node>"} 354
-```
-
-#### `rpc_timeout_counter` (counter)
-
-Number of RPC timeouts, should be close to zero in a healthy cluster.
-
-```
-rpc_timeout_counter{from="<this node>",rpc_endpoint="garage_rpc/membership.rs/SystemRpc",to="<remote node>"} 1
-```
-
-#### `rpc_duration` (histogram)
-
-The duration of internal RPC calls between Garage nodes.
-
-```
-rpc_duration_bucket{from="<this node>",rpc_endpoint="garage_block/manager.rs/Rpc",to="<remote node>",le="0.5"} 166
-rpc_duration_sum{from="<this node>",rpc_endpoint="garage_block/manager.rs/Rpc",to="<remote node>"} 35.172253716
-rpc_duration_count{from="<this node>",rpc_endpoint="garage_block/manager.rs/Rpc",to="<remote node>"} 174
-```
-
-
-### Metrics of the metadata table manager
-
-#### `table_gc_todo_queue_length` (gauge)
-
-Table garbage collector TODO queue length
-
-```
-table_gc_todo_queue_length{table_name="block_ref"} 0
-```
-
-#### `table_get_request_counter` (counter), `table_get_request_duration` (histogram)
-
-Number of get/get_range requests internally made on each table, and their duration.
-
-```
-table_get_request_counter{table_name="bucket_alias"} 315
-table_get_request_duration_bucket{table_name="bucket_alias",le="0.5"} 315
-table_get_request_duration_sum{table_name="bucket_alias"} 0.048509778000000024
-table_get_request_duration_count{table_name="bucket_alias"} 315
-```
-
-
-#### `table_put_request_counter` (counter), `table_put_request_duration` (histogram)
-
-Number of insert/insert_many requests internally made on this table, and their duration
-
-```
-table_put_request_counter{table_name="block_ref"} 677
-table_put_request_duration_bucket{table_name="block_ref",le="0.5"} 677
-table_put_request_duration_sum{table_name="block_ref"} 61.617528636
-table_put_request_duration_count{table_name="block_ref"} 677
-```
-
-#### `table_internal_delete_counter` (counter)
-
-Number of value deletions in the tree (due to GC or repartitioning)
-
-```
-table_internal_delete_counter{table_name="block_ref"} 2296
-```
-
-#### `table_internal_update_counter` (counter)
-
-Number of value updates where the value actually changes (includes creation of new key and update of existing key)
-
-```
-table_internal_update_counter{table_name="block_ref"} 5996
-```
-
-#### `table_merkle_updater_todo_queue_length` (gauge)
-
-Merkle tree updater TODO queue length (should fall to zero rapidly)
-
-```
-table_merkle_updater_todo_queue_length{table_name="block_ref"} 0
-```
-
-#### `table_sync_items_received`, `table_sync_items_sent` (counters)
-
-Number of data items sent to/recieved from other nodes during resync procedures
-
-```
-table_sync_items_received{from="<remote node>",table_name="bucket_v2"} 3
-table_sync_items_sent{table_name="block_ref",to="<remote node>"} 2
-```
-
-
+See our [dedicated page](@/documentation/reference-manual/monitoring.md) in the Reference manual section.
diff --git a/doc/book/cookbook/real-world.md b/doc/book/cookbook/real-world.md
index 5423bbab..08266b23 100644
--- a/doc/book/cookbook/real-world.md
+++ b/doc/book/cookbook/real-world.md
@@ -19,8 +19,12 @@ To run a real-world deployment, make sure the following conditions are met:
- You have at least three machines with sufficient storage space available.
-- Each machine has a public IP address which is reachable by other machines.
- Running behind a NAT is likely to be possible but hasn't been tested for the latest version (TODO).
+- Each machine has a public IP address which is reachable by other machines. It
+ is highly recommended that you use IPv6 for this end-to-end connectivity. If
+ IPv6 is not available, then using a mesh VPN such as
+ [Nebula](https://github.com/slackhq/nebula) or
+ [Yggdrasil](https://yggdrasil-network.github.io/) are approaches to consider
+ in addition to building out your own VPN tunneling.
- This guide will assume you are using Docker containers to deploy Garage on each node.
Garage can also be run independently, for instance as a [Systemd service](@/documentation/cookbook/systemd.md).
@@ -345,5 +349,5 @@ and is covered in the [quick start guide](@/documentation/quick-start/_index.md)
Remember also that the CLI is self-documented thanks to the `--help` flag and
the `help` subcommand (e.g. `garage help`, `garage key --help`).
-Configuring S3-compatible applicatiosn to interact with Garage
+Configuring S3-compatible applications to interact with Garage
is covered in the [Integrations](@/documentation/connect/_index.md) section.
diff --git a/doc/book/cookbook/reverse-proxy.md b/doc/book/cookbook/reverse-proxy.md
index c8fde28d..9c833ad0 100644
--- a/doc/book/cookbook/reverse-proxy.md
+++ b/doc/book/cookbook/reverse-proxy.md
@@ -168,40 +168,65 @@ Here is [a basic configuration file](https://doc.traefik.io/traefik/https/acme/#
### Add Garage service
-To add Garage on Traefik you should declare a new service using its IP address (or hostname) and port:
+To add Garage on Traefik you should declare two new services using its IP
+address (or hostname) and port, these are used for the S3, and web components
+of Garage:
```toml
[http.services]
- [http.services.my_garage_service.loadBalancer]
- [[http.services.my_garage_service.loadBalancer.servers]]
+ [http.services.garage-s3-service.loadBalancer]
+ [[http.services.garage-s3-service.loadBalancer.servers]]
url = "http://xxx.xxx.xxx.xxx"
port = 3900
+
+ [http.services.garage-web-service.loadBalancer]
+ [[http.services.garage-web-service.loadBalancer.servers]]
+ url = "http://xxx.xxx.xxx.xxx"
+ port = 3902
```
It's possible to declare multiple Garage servers as back-ends:
```toml
[http.services]
- [[http.services.my_garage_service.loadBalancer.servers]]
+ [[http.services.garage-s3-service.loadBalancer.servers]]
url = "http://xxx.xxx.xxx.xxx"
port = 3900
- [[http.services.my_garage_service.loadBalancer.servers]]
+ [[http.services.garage-s3-service.loadBalancer.servers]]
url = "http://yyy.yyy.yyy.yyy"
port = 3900
- [[http.services.my_garage_service.loadBalancer.servers]]
+ [[http.services.garage-s3-service.loadBalancer.servers]]
url = "http://zzz.zzz.zzz.zzz"
port = 3900
+
+ [[http.services.garage-web-service.loadBalancer.servers]]
+ url = "http://xxx.xxx.xxx.xxx"
+ port = 3902
+ [[http.services.garage-web-service.loadBalancer.servers]]
+ url = "http://yyy.yyy.yyy.yyy"
+ port = 3902
+ [[http.services.garage-web-service.loadBalancer.servers]]
+ url = "http://zzz.zzz.zzz.zzz"
+ port = 3902
```
Traefik can remove unhealthy servers automatically with [a health check configuration](https://doc.traefik.io/traefik/routing/services/#health-check):
```
[http.services]
- [http.services.my_garage_service.loadBalancer]
- [http.services.my_garage_service.loadBalancer.healthCheck]
- path = "/"
- interval = "60s"
- timeout = "5s"
+ [http.services.garage-s3-service.loadBalancer]
+ [http.services.garage-s3-service.loadBalancer.healthCheck]
+ path = "/health"
+ port = "3903"
+ #interval = "15s"
+ #timeout = "2s"
+
+ [http.services.garage-web-service.loadBalancer]
+ [http.services.garage-web-service.loadBalancer.healthCheck]
+ path = "/health"
+ port = "3903"
+ #interval = "15s"
+ #timeout = "2s"
```
### Adding a website
@@ -210,10 +235,15 @@ To add a new website, add the following declaration to your Traefik configuratio
```toml
[http.routers]
+ [http.routers.garage-s3]
+ rule = "Host(`s3.example.org`)"
+ service = "garage-s3-service"
+ entryPoints = ["websecure"]
+
[http.routers.my_website]
rule = "Host(`yoururl.example.org`)"
- service = "my_garage_service"
- entryPoints = ["web"]
+ service = "garage-web-service"
+ entryPoints = ["websecure"]
```
Enable HTTPS access to your website with the following configuration section ([documentation](https://doc.traefik.io/traefik/https/overview/)):
@@ -226,7 +256,7 @@ Enable HTTPS access to your website with the following configuration section ([d
...
```
-### Adding gzip compression
+### Adding compression
Add the following configuration section [to compress response](https://doc.traefik.io/traefik/middlewares/http/compress/) using [gzip](https://developer.mozilla.org/en-US/docs/Glossary/GZip_compression) before sending them to the client:
@@ -234,10 +264,10 @@ Add the following configuration section [to compress response](https://doc.traef
[http.routers]
[http.routers.my_website]
...
- middlewares = ["gzip_compress"]
+ middlewares = ["compression"]
...
[http.middlewares]
- [http.middlewares.gzip_compress.compress]
+ [http.middlewares.compression.compress]
```
### Add caching response
@@ -262,27 +292,54 @@ Traefik's caching middleware is only available on [entreprise version](https://d
entryPoint = "web"
[http.routers]
+ [http.routers.garage-s3]
+ rule = "Host(`s3.example.org`)"
+ service = "garage-s3-service"
+ entryPoints = ["websecure"]
+
[http.routers.my_website]
rule = "Host(`yoururl.example.org`)"
- service = "my_garage_service"
- middlewares = ["gzip_compress"]
+ service = "garage-web-service"
+ middlewares = ["compression"]
entryPoints = ["websecure"]
[http.services]
- [http.services.my_garage_service.loadBalancer]
- [http.services.my_garage_service.loadBalancer.healthCheck]
- path = "/"
- interval = "60s"
- timeout = "5s"
- [[http.services.my_garage_service.loadBalancer.servers]]
+ [http.services.garage-s3-service.loadBalancer]
+ [http.services.garage-s3-service.loadBalancer.healthCheck]
+ path = "/health"
+ port = "3903"
+ #interval = "15s"
+ #timeout = "2s"
+
+ [http.services.garage-web-service.loadBalancer]
+ [http.services.garage-web-service.loadBalancer.healthCheck]
+ path = "/health"
+ port = "3903"
+ #interval = "15s"
+ #timeout = "2s"
+
+ [[http.services.garage-s3-service.loadBalancer.servers]]
+ url = "http://xxx.xxx.xxx.xxx"
+ port = 3900
+ [[http.services.garage-s3-service.loadBalancer.servers]]
+ url = "http://yyy.yyy.yyy.yyy"
+ port = 3900
+ [[http.services.garage-s3-service.loadBalancer.servers]]
+ url = "http://zzz.zzz.zzz.zzz"
+ port = 3900
+
+ [[http.services.garage-web-service.loadBalancer.servers]]
url = "http://xxx.xxx.xxx.xxx"
- [[http.services.my_garage_service.loadBalancer.servers]]
+ port = 3902
+ [[http.services.garage-web-service.loadBalancer.servers]]
url = "http://yyy.yyy.yyy.yyy"
- [[http.services.my_garage_service.loadBalancer.servers]]
+ port = 3902
+ [[http.services.garage-web-service.loadBalancer.servers]]
url = "http://zzz.zzz.zzz.zzz"
+ port = 3902
[http.middlewares]
- [http.middlewares.gzip_compress.compress]
+ [http.middlewares.compression.compress]
```
## Caddy
@@ -291,18 +348,83 @@ Your Caddy configuration can be as simple as:
```caddy
s3.garage.tld, *.s3.garage.tld {
- reverse_proxy localhost:3900 192.168.1.2:3900 example.tld:3900
+ reverse_proxy localhost:3900 192.168.1.2:3900 example.tld:3900 {
+ health_uri /health
+ health_port 3903
+ #health_interval 15s
+ #health_timeout 5s
+ }
}
*.web.garage.tld {
- reverse_proxy localhost:3902 192.168.1.2:3900 example.tld:3900
+ reverse_proxy localhost:3902 192.168.1.2:3902 example.tld:3902 {
+ health_uri /health
+ health_port 3903
+ #health_interval 15s
+ #health_timeout 5s
+ }
}
admin.garage.tld {
- reverse_proxy localhost:3903
+ reverse_proxy localhost:3903 {
+ health_uri /health
+ health_port 3903
+ #health_interval 15s
+ #health_timeout 5s
+ }
}
```
But at the same time, the `reverse_proxy` is very flexible.
For a production deployment, you should [read its documentation](https://caddyserver.com/docs/caddyfile/directives/reverse_proxy) as it supports features like DNS discovery of upstreams, load balancing with checks, streaming parameters, etc.
+### On-demand TLS
+
+Caddy supports a technique called
+[on-demand TLS](https://caddyserver.com/docs/automatic-https#on-demand-tls), by
+which one can configure the webserver to provision TLS certificates when a
+client first connects to it.
+
+In order to prevent an attack vector whereby domains are simply pointed at your
+webserver and certificates are requested for them - Caddy can be configured to
+ask Garage if a domain is authorized for web hosting, before it then requests
+a TLS certificate.
+
+This 'check' endpoint, which is on the admin port (3903 by default), can be
+configured in Caddy's global section as follows:
+
+```caddy
+{
+ ...
+ on_demand_tls {
+ ask http://localhost:3903/check
+ interval 2m
+ burst 5
+ }
+ ...
+}
+```
+
+The host section can then be configured with (note that this uses the web
+endpoint instead):
+
+```caddy
+# For a specific set of subdomains
+*.web.garage.tld {
+ tls {
+ on_demand
+ }
+
+ reverse_proxy localhost:3902 192.168.1.2:3902 example.tld:3902
+}
+
+# Accept all domains on HTTPS
+# Never configure this without global section above
+https:// {
+ tls {
+ on_demand
+ }
+
+ reverse_proxy localhost:3902 192.168.1.2:3902 example.tld:3902
+}
+```
diff --git a/doc/book/cookbook/upgrading.md b/doc/book/cookbook/upgrading.md
index 9f2ba73b..9d60a988 100644
--- a/doc/book/cookbook/upgrading.md
+++ b/doc/book/cookbook/upgrading.md
@@ -6,45 +6,80 @@ weight = 60
Garage is a stateful clustered application, where all nodes are communicating together and share data structures.
It makes upgrade more difficult than stateless applications so you must be more careful when upgrading.
On a new version release, there is 2 possibilities:
- - protocols and data structures remained the same ➡️ this is a **straightforward upgrade**
- - protocols or data structures changed ➡️ this is an **advanced upgrade**
+ - protocols and data structures remained the same ➡️ this is a **minor upgrade**
+ - protocols or data structures changed ➡️ this is a **major upgrade**
-You can quickly now what type of update you will have to operate by looking at the version identifier.
-Following the [SemVer ](https://semver.org/) terminology, if only the *patch* number changed, it will only need a straightforward upgrade.
-Example: an upgrade from v0.6.0 from v0.6.1 is a straightforward upgrade.
-If the *minor* or *major* number changed however, you will have to do an advanced upgrade. Example: from v0.6.1 to v0.7.0.
+You can quickly now what type of update you will have to operate by looking at the version identifier:
+when we require our users to do a major upgrade, we will always bump the first nonzero component of the version identifier
+(e.g. from v0.7.2 to v0.8.0).
+Conversely, for versions that only require a minor upgrade, the first nonzero component will always stay the same (e.g. from v0.8.0 to v0.8.1).
-Migrations are designed to be run only between contiguous versions (from a *major*.*minor* perspective, *patches* can be skipped).
-Example: migrations from v0.6.1 to v0.7.0 and from v0.6.0 to v0.7.0 are supported but migrations from v0.5.0 to v0.7.0 are not supported.
+Major upgrades are designed to be run only between contiguous versions.
+Example: migrations from v0.7.1 to v0.8.0 and from v0.7.0 to v0.8.2 are supported but migrations from v0.6.0 to v0.8.0 are not supported.
-## Straightforward upgrades
+The `garage_build_info`
+[Prometheus metric](@/documentation/reference-manual/monitoring.md) provides
+an overview for which Garage versions are currently in use within a cluster.
-Straightforward upgrades do not imply cluster downtime.
+## Minor upgrades
+
+Minor upgrades do not imply cluster downtime.
Before upgrading, you should still read [the changelog](https://git.deuxfleurs.fr/Deuxfleurs/garage/releases) and ideally test your deployment on a staging cluster before.
When you are ready, start by checking the health of your cluster.
-You can force some checks with `garage repair`, we recommend at least running `garage repair --all-nodes --yes` that is very quick to run (less than a minute).
-You will see that the command correctly terminated in the logs of your daemon.
+You can force some checks with `garage repair`, we recommend at least running `garage repair --all-nodes --yes tables` which is very quick to run (less than a minute).
+You will see that the command correctly terminated in the logs of your daemon, or using `garage worker list` (the repair workers should be in the `Done` state).
-Finally, you can simply upgrades nodes one by one.
-For each node: stop it, install the new binary, edit the configuration if needed, restart it.
+Finally, you can simply upgrade nodes one by one.
+For each node: stop it, install the new binary, edit the configuration if needed, restart it.
-## Advanced upgrades
+## Major upgrades
-Advanced upgrades will imply cluster downtime.
+Major upgrades can be done with minimal downtime with a bit of preparation, but the simplest way is usually to put the cluster offline for the duration of the migration.
Before upgrading, you must read [the changelog](https://git.deuxfleurs.fr/Deuxfleurs/garage/releases) and you must test your deployment on a staging cluster before.
-From a high level perspective, an advanced upgrade looks like this:
- 1. Make sure the health of your cluster is good (see `garage repair`)
- 2. Disable API access (comment the configuration in your reverse proxy)
- 3. Check that your cluster is idle
+We write guides for each major upgrade, they are stored under the "Working Documents" section of this documentation.
+
+### Major upgrades with full downtime
+
+From a high level perspective, a major upgrade looks like this:
+
+ 1. Disable API access (for instance in your reverse proxy, or by commenting the corresponding section in your Garage configuration file and restarting Garage)
+ 2. Check that your cluster is idle
+ 3. Make sure the health of your cluster is good (see `garage repair`)
4. Stop the whole cluster
- 5. Backup the metadata folder of all your nodes, so that you will be able to restore it quickly if the upgrade fails (blocks being immutable, they should not be impacted)
+ 5. Back up the metadata folder of all your nodes, so that you will be able to restore it if the upgrade fails (data blocks being immutable, they should not be impacted)
6. Install the new binary, update the configuration
7. Start the whole cluster
8. If needed, run the corresponding migration from `garage migrate`
9. Make sure the health of your cluster is good
- 10. Enable API access (uncomment the configuration in your reverse proxy)
+ 10. Enable API access (reverse step 1)
11. Monitor your cluster while load comes back, check that all your applications are happy with this new version
-We write guides for each advanced upgrade, they are stored under the "Working Documents" section of this documentation.
+### Major upgarades with minimal downtime
+
+There is only one operation that has to be coordinated cluster-wide: the passage of one version of the internal RPC protocol to the next.
+This means that an upgrade with very limited downtime can simply be performed from one major version to the next by restarting all nodes
+simultaneously in the new version.
+The downtime will simply be the time required for all nodes to stop and start again, which should be less than a minute.
+If all nodes fail to stop and restart simultaneously, some nodes might be temporarily shut out from the cluster as nodes using different RPC protocol
+versions are prevented to talk to one another.
+
+The entire procedure would look something like this:
+
+1. Make sure the health of your cluster is good (see `garage repair`)
+
+2. Take each node offline individually to back up its metadata folder, bring them back online once the backup is done.
+ You can do all of the nodes in a single zone at once as that won't impact global cluster availability.
+ Do not try to make a backup of the metadata folder of a running node.
+
+3. Prepare your binaries and configuration files for the new Garage version
+
+4. Restart all nodes simultaneously in the new version
+
+5. If any specific migration procedure is required, it is usually in one of the two cases:
+
+ - It can be run on online nodes after the new version has started, during regular cluster operation.
+ - it has to be run offline
+
+ For this last step, please refer to the specific documentation pertaining to the version upgrade you are doing.
diff --git a/doc/book/design/_index.md b/doc/book/design/_index.md
index a3a6ac11..50933139 100644
--- a/doc/book/design/_index.md
+++ b/doc/book/design/_index.md
@@ -20,12 +20,16 @@ and could not do, etc.
We love to talk and hear about Garage, that's why we keep a log here:
- - [(fr, 2021-11-13, video) Garage : Mille et une façons de stocker vos données](https://video.tedomum.net/w/moYKcv198dyMrT8hCS5jz9) and [slides (html)](https://rfid.deuxfleurs.fr/presentations/2021-11-13/garage/) - during [RFID#1](https://rfid.deuxfleurs.fr/programme/2021-11-13/) event
+ - [(en, 2023-01-18) Presentation of Garage with some details on CRDTs and data partitioning among nodes](https://git.deuxfleurs.fr/Deuxfleurs/garage/src/commit/4cff37397f626ef063dad29e5b5e97ab1206015d/doc/talks/2023-01-18-tocatta/talk.pdf)
+
+ - [(fr, 2022-11-19) De l'auto-hébergement à l'entre-hébergement : Garage, pour conserver ses données ensemble](https://git.deuxfleurs.fr/Deuxfleurs/garage/src/commit/4cff37397f626ef063dad29e5b5e97ab1206015d/doc/talks/2022-11-19-Capitole-du-Libre/pr%C3%A9sentation.pdf)
- - [(en, 2021-04-28) Distributed object storage is centralised](https://git.deuxfleurs.fr/Deuxfleurs/garage/raw/commit/b1f60579a13d3c5eba7f74b1775c84639ea9b51a/doc/talks/2021-04-28_spirals-team/talk.pdf)
+ - [(en, 2022-06-23) General presentation of Garage](https://git.deuxfleurs.fr/Deuxfleurs/garage/src/commit/4cff37397f626ef063dad29e5b5e97ab1206015d/doc/talks/2022-06-23-stack/talk.pdf)
+
+ - [(fr, 2021-11-13, video) Garage : Mille et une façons de stocker vos données](https://video.tedomum.net/w/moYKcv198dyMrT8hCS5jz9) and [slides (html)](https://rfid.deuxfleurs.fr/presentations/2021-11-13/garage/) - during [RFID#1](https://rfid.deuxfleurs.fr/programme/2021-11-13/) event
- - [(fr, 2020-12-02) Garage : jouer dans la cour des grands quand on est un hébergeur associatif](https://git.deuxfleurs.fr/Deuxfleurs/garage/raw/commit/b1f60579a13d3c5eba7f74b1775c84639ea9b51a/doc/talks/2020-12-02_wide-team/talk.pdf)
+ - [(en, 2021-04-28) Distributed object storage is centralised](https://git.deuxfleurs.fr/Deuxfleurs/garage/src/commit/b1f60579a13d3c5eba7f74b1775c84639ea9b51a/doc/talks/2021-04-28_spirals-team/talk.pdf)
-*Did you write or talk about Garage? [Open a pull request](https://git.deuxfleurs.fr/Deuxfleurs/garage/) to add a link here!*
+ - [(fr, 2020-12-02) Garage : jouer dans la cour des grands quand on est un hébergeur associatif](https://git.deuxfleurs.fr/Deuxfleurs/garage/src/commit/b1f60579a13d3c5eba7f74b1775c84639ea9b51a/doc/talks/2020-12-02_wide-team/talk.pdf)
diff --git a/doc/book/design/related-work.md b/doc/book/design/related-work.md
index f96c6618..6c1a6b12 100644
--- a/doc/book/design/related-work.md
+++ b/doc/book/design/related-work.md
@@ -72,8 +72,7 @@ We considered there v2's design but concluded that it does not fit both our *Sel
**[Riak CS](https://docs.riak.com/riak/cs/2.1.1/index.html):**
*Not written yet*
-**[IPFS](https://ipfs.io/):**
-*Not written yet*
+**[IPFS](https://ipfs.io/):** IPFS has design goals radically different from Garage, we have [a blog post](@/blog/2022-ipfs/index.md) talking about it.
## Specific research papers
diff --git a/doc/book/development/devenv.md b/doc/book/development/devenv.md
index c2ef4e7d..8d7d2e95 100644
--- a/doc/book/development/devenv.md
+++ b/doc/book/development/devenv.md
@@ -39,7 +39,7 @@ Now you can enter our nix-shell, all the required packages will be downloaded bu
nix-shell
```
-You can use the traditionnal Rust development workflow:
+You can use the traditional Rust development workflow:
```bash
cargo build # compile the project
diff --git a/doc/book/development/release-process.md b/doc/book/development/release-process.md
index f6db971a..3fed4add 100644
--- a/doc/book/development/release-process.md
+++ b/doc/book/development/release-process.md
@@ -11,7 +11,7 @@ We define them as our release process.
While we run some tests on every commits, we do not make a release for all of them.
A release can be triggered manually by "promoting" a successful build.
-Otherwise, every weeks, a release build is triggered on the `main` branch.
+Otherwise, every night, a release build is triggered on the `main` branch.
If the build is from a tag following the regex: `v[0-9]+\.[0-9]+\.[0-9]+`, it will be listed as stable.
If it is a tag but with a different format, it will be listed as Extra.
diff --git a/doc/book/quick-start/_index.md b/doc/book/quick-start/_index.md
index ab83b75a..feab45f0 100644
--- a/doc/book/quick-start/_index.md
+++ b/doc/book/quick-start/_index.md
@@ -290,13 +290,13 @@ sourcing the right file.*
aws s3 ls
# list objects of a bucket
-aws s3 ls s3://my_files
+aws s3 ls s3://nextcloud-bucket
# copy from your filesystem to garage
-aws s3 cp /proc/cpuinfo s3://my_files/cpuinfo.txt
+aws s3 cp /proc/cpuinfo s3://nextcloud-bucket/cpuinfo.txt
# copy from garage to your filesystem
-aws s3 cp s3/my_files/cpuinfo.txt /tmp/cpuinfo.txt
+aws s3 cp s3://nextcloud-bucket/cpuinfo.txt /tmp/cpuinfo.txt
```
Note that you can use `awscli` for more advanced operations like
diff --git a/doc/book/reference-manual/admin-api.md b/doc/book/reference-manual/admin-api.md
index 0b7e2e16..363bc886 100644
--- a/doc/book/reference-manual/admin-api.md
+++ b/doc/book/reference-manual/admin-api.md
@@ -1,6 +1,6 @@
+++
title = "Administration API"
-weight = 60
+weight = 40
+++
The Garage administration API is accessible through a dedicated server whose
diff --git a/doc/book/reference-manual/configuration.md b/doc/book/reference-manual/configuration.md
index 2d9c3f0c..38062bab 100644
--- a/doc/book/reference-manual/configuration.md
+++ b/doc/book/reference-manual/configuration.md
@@ -3,6 +3,8 @@ title = "Configuration file format"
weight = 20
+++
+## Full example
+
Here is an example `garage.toml` configuration file that illustrates all of the possible options:
```toml
@@ -96,7 +98,7 @@ Performance characteristics of the different DB engines are as follows:
- Sled: the default database engine, which tends to produce
large data files and also has performance issues, especially when the metadata folder
- is on a traditionnal HDD and not on SSD.
+ is on a traditional HDD and not on SSD.
- LMDB: the recommended alternative on 64-bit systems,
much more space-efficiant and slightly faster. Note that the data format of LMDB is not portable
between architectures, so for instance the Garage database of an x86-64
@@ -259,13 +261,17 @@ Compression is done synchronously, setting a value too high will add latency to
This value can be different between nodes, compression is done by the node which receive the
API call.
-### `rpc_secret`
+### `rpc_secret`, `rpc_secret_file` or `GARAGE_RPC_SECRET` (env)
+
+Garage uses a secret key, called an RPC secret, that is shared between all
+nodes of the cluster in order to identify these nodes and allow them to
+communicate together. The RPC secret is a 32-byte hex-encoded random string,
+which can be generated with a command such as `openssl rand -hex 32`.
-Garage uses a secret key that is shared between all nodes of the cluster
-in order to identify these nodes and allow them to communicate together.
-This key should be specified here in the form of a 32-byte hex-encoded
-random string. Such a string can be generated with a command
-such as `openssl rand -hex 32`.
+The RPC secret should be specified in the `rpc_secret` configuration variable.
+Since Garage `v0.8.2`, the RPC secret can also be stored in a file whose path is
+given in the configuration variable `rpc_secret_file`, or specified as an
+environment variable `GARAGE_RPC_SECRET`.
### `rpc_bind_addr`
@@ -407,24 +413,30 @@ If specified, Garage will bind an HTTP server to this port and address, on
which it will listen to requests for administration features.
See [administration API reference](@/documentation/reference-manual/admin-api.md) to learn more about these features.
-### `metrics_token` (since version 0.7.2)
+### `metrics_token`, `metrics_token_file` or `GARAGE_METRICS_TOKEN` (env)
-The token for accessing the Metrics endpoint. If this token is not set in
-the config file, the Metrics endpoint can be accessed without access
-control.
+The token for accessing the Metrics endpoint. If this token is not set, the
+Metrics endpoint can be accessed without access control.
You can use any random string for this value. We recommend generating a random token with `openssl rand -hex 32`.
-### `admin_token` (since version 0.7.2)
+`metrics_token` was introduced in Garage `v0.7.2`.
+`metrics_token_file` and the `GARAGE_METRICS_TOKEN` environment variable are supported since Garage `v0.8.2`.
+
+
+### `admin_token`, `admin_token_file` or `GARAGE_ADMIN_TOKEN` (env)
The token for accessing all of the other administration endpoints. If this
-token is not set in the config file, access to these endpoints is disabled
-entirely.
+token is not set, access to these endpoints is disabled entirely.
You can use any random string for this value. We recommend generating a random token with `openssl rand -hex 32`.
+`admin_token` was introduced in Garage `v0.7.2`.
+`admin_token_file` and the `GARAGE_ADMIN_TOKEN` environment variable are supported since Garage `v0.8.2`.
+
+
### `trace_sink`
-Optionnally, the address of an Opentelemetry collector. If specified,
-Garage will send traces in the Opentelemetry format to this endpoint. These
+Optionally, the address of an OpenTelemetry collector. If specified,
+Garage will send traces in the OpenTelemetry format to this endpoint. These
trace allow to inspect Garage's operation when it handles S3 API requests.
diff --git a/doc/book/reference-manual/k2v.md b/doc/book/reference-manual/k2v.md
index 207d056a..ed069b27 100644
--- a/doc/book/reference-manual/k2v.md
+++ b/doc/book/reference-manual/k2v.md
@@ -1,6 +1,6 @@
+++
title = "K2V"
-weight = 70
+weight = 100
+++
Starting with version 0.7.2, Garage introduces an optionnal feature, K2V,
@@ -16,7 +16,7 @@ the `k2v` feature flag enabled can be obtained from our download page under
with `-k2v` (example: `v0.7.2-k2v`).
The specification of the K2V API can be found
-[here](https://git.deuxfleurs.fr/Deuxfleurs/garage/src/branch/k2v/doc/drafts/k2v-spec.md).
+[here](https://git.deuxfleurs.fr/Deuxfleurs/garage/src/branch/main/doc/drafts/k2v-spec.md).
This document also includes a high-level overview of K2V's design.
The K2V API uses AWSv4 signatures for authentification, same as the S3 API.
diff --git a/doc/book/reference-manual/monitoring.md b/doc/book/reference-manual/monitoring.md
new file mode 100644
index 00000000..97c533d3
--- /dev/null
+++ b/doc/book/reference-manual/monitoring.md
@@ -0,0 +1,285 @@
+
++++
+title = "Monitoring"
+weight = 60
++++
+
+
+For information on setting up monitoring, see our [dedicated page](@/documentation/cookbook/monitoring.md) in the Cookbook section.
+
+## List of exported metrics
+
+### Garage system metrics
+
+#### `garage_build_info` (counter)
+
+Exposes the Garage version number running on a node.
+
+```
+garage_build_info{version="1.0"} 1
+```
+
+#### `garage_replication_factor` (counter)
+
+Exposes the Garage replication factor configured on the node
+
+```
+garage_replication_factor 3
+```
+
+### Metrics of the API endpoints
+
+#### `api_admin_request_counter` (counter)
+
+Counts the number of requests to a given endpoint of the administration API. Example:
+
+```
+api_admin_request_counter{api_endpoint="Metrics"} 127041
+```
+
+#### `api_admin_request_duration` (histogram)
+
+Evaluates the duration of API calls to the various administration API endpoint. Example:
+
+```
+api_admin_request_duration_bucket{api_endpoint="Metrics",le="0.5"} 127041
+api_admin_request_duration_sum{api_endpoint="Metrics"} 605.250344830999
+api_admin_request_duration_count{api_endpoint="Metrics"} 127041
+```
+
+#### `api_s3_request_counter` (counter)
+
+Counts the number of requests to a given endpoint of the S3 API. Example:
+
+```
+api_s3_request_counter{api_endpoint="CreateMultipartUpload"} 1
+```
+
+#### `api_s3_error_counter` (counter)
+
+Counts the number of requests to a given endpoint of the S3 API that returned an error. Example:
+
+```
+api_s3_error_counter{api_endpoint="GetObject",status_code="404"} 39
+```
+
+#### `api_s3_request_duration` (histogram)
+
+Evaluates the duration of API calls to the various S3 API endpoints. Example:
+
+```
+api_s3_request_duration_bucket{api_endpoint="CreateMultipartUpload",le="0.5"} 1
+api_s3_request_duration_sum{api_endpoint="CreateMultipartUpload"} 0.046340762
+api_s3_request_duration_count{api_endpoint="CreateMultipartUpload"} 1
+```
+
+#### `api_k2v_request_counter` (counter), `api_k2v_error_counter` (counter), `api_k2v_error_duration` (histogram)
+
+Same as for S3, for the K2V API.
+
+
+### Metrics of the Web endpoint
+
+
+#### `web_request_counter` (counter)
+
+Number of requests to the web endpoint
+
+```
+web_request_counter{method="GET"} 80
+```
+
+#### `web_request_duration` (histogram)
+
+Duration of requests to the web endpoint
+
+```
+web_request_duration_bucket{method="GET",le="0.5"} 80
+web_request_duration_sum{method="GET"} 1.0528433229999998
+web_request_duration_count{method="GET"} 80
+```
+
+#### `web_error_counter` (counter)
+
+Number of requests to the web endpoint resulting in errors
+
+```
+web_error_counter{method="GET",status_code="404 Not Found"} 64
+```
+
+
+### Metrics of the data block manager
+
+#### `block_bytes_read`, `block_bytes_written` (counter)
+
+Number of bytes read/written to/from disk in the data storage directory.
+
+```
+block_bytes_read 120586322022
+block_bytes_written 3386618077
+```
+
+#### `block_compression_level` (counter)
+
+Exposes the block compression level configured for the Garage node.
+
+```
+block_compression_level 3
+```
+
+#### `block_read_duration`, `block_write_duration` (histograms)
+
+Evaluates the duration of the reading/writing of individual data blocks in the data storage directory.
+
+```
+block_read_duration_bucket{le="0.5"} 169229
+block_read_duration_sum 2761.6902550310056
+block_read_duration_count 169240
+block_write_duration_bucket{le="0.5"} 3559
+block_write_duration_sum 195.59170078500006
+block_write_duration_count 3571
+```
+
+#### `block_delete_counter` (counter)
+
+Counts the number of data blocks that have been deleted from storage.
+
+```
+block_delete_counter 122
+```
+
+#### `block_resync_counter` (counter), `block_resync_duration` (histogram)
+
+Counts the number of resync operations the node has executed, and evaluates their duration.
+
+```
+block_resync_counter 308897
+block_resync_duration_bucket{le="0.5"} 308892
+block_resync_duration_sum 139.64204196100016
+block_resync_duration_count 308897
+```
+
+#### `block_resync_queue_length` (gauge)
+
+The number of block hashes currently queued for a resync.
+This is normal to be nonzero for long periods of time.
+
+```
+block_resync_queue_length 0
+```
+
+#### `block_resync_errored_blocks` (gauge)
+
+The number of block hashes that we were unable to resync last time we tried.
+**THIS SHOULD BE ZERO, OR FALL BACK TO ZERO RAPIDLY, IN A HEALTHY CLUSTER.**
+Persistent nonzero values indicate that some data is likely to be lost.
+
+```
+block_resync_errored_blocks 0
+```
+
+
+### Metrics related to RPCs (remote procedure calls) between nodes
+
+#### `rpc_netapp_request_counter` (counter)
+
+Number of RPC requests emitted
+
+```
+rpc_request_counter{from="<this node>",rpc_endpoint="garage_block/manager.rs/Rpc",to="<remote node>"} 176
+```
+
+#### `rpc_netapp_error_counter` (counter)
+
+Number of communication errors (errors in the Netapp library, generally due to disconnected nodes)
+
+```
+rpc_netapp_error_counter{from="<this node>",rpc_endpoint="garage_block/manager.rs/Rpc",to="<remote node>"} 354
+```
+
+#### `rpc_timeout_counter` (counter)
+
+Number of RPC timeouts, should be close to zero in a healthy cluster.
+
+```
+rpc_timeout_counter{from="<this node>",rpc_endpoint="garage_rpc/membership.rs/SystemRpc",to="<remote node>"} 1
+```
+
+#### `rpc_duration` (histogram)
+
+The duration of internal RPC calls between Garage nodes.
+
+```
+rpc_duration_bucket{from="<this node>",rpc_endpoint="garage_block/manager.rs/Rpc",to="<remote node>",le="0.5"} 166
+rpc_duration_sum{from="<this node>",rpc_endpoint="garage_block/manager.rs/Rpc",to="<remote node>"} 35.172253716
+rpc_duration_count{from="<this node>",rpc_endpoint="garage_block/manager.rs/Rpc",to="<remote node>"} 174
+```
+
+
+### Metrics of the metadata table manager
+
+#### `table_gc_todo_queue_length` (gauge)
+
+Table garbage collector TODO queue length
+
+```
+table_gc_todo_queue_length{table_name="block_ref"} 0
+```
+
+#### `table_get_request_counter` (counter), `table_get_request_duration` (histogram)
+
+Number of get/get_range requests internally made on each table, and their duration.
+
+```
+table_get_request_counter{table_name="bucket_alias"} 315
+table_get_request_duration_bucket{table_name="bucket_alias",le="0.5"} 315
+table_get_request_duration_sum{table_name="bucket_alias"} 0.048509778000000024
+table_get_request_duration_count{table_name="bucket_alias"} 315
+```
+
+
+#### `table_put_request_counter` (counter), `table_put_request_duration` (histogram)
+
+Number of insert/insert_many requests internally made on this table, and their duration
+
+```
+table_put_request_counter{table_name="block_ref"} 677
+table_put_request_duration_bucket{table_name="block_ref",le="0.5"} 677
+table_put_request_duration_sum{table_name="block_ref"} 61.617528636
+table_put_request_duration_count{table_name="block_ref"} 677
+```
+
+#### `table_internal_delete_counter` (counter)
+
+Number of value deletions in the tree (due to GC or repartitioning)
+
+```
+table_internal_delete_counter{table_name="block_ref"} 2296
+```
+
+#### `table_internal_update_counter` (counter)
+
+Number of value updates where the value actually changes (includes creation of new key and update of existing key)
+
+```
+table_internal_update_counter{table_name="block_ref"} 5996
+```
+
+#### `table_merkle_updater_todo_queue_length` (gauge)
+
+Merkle tree updater TODO queue length (should fall to zero rapidly)
+
+```
+table_merkle_updater_todo_queue_length{table_name="block_ref"} 0
+```
+
+#### `table_sync_items_received`, `table_sync_items_sent` (counters)
+
+Number of data items sent to/recieved from other nodes during resync procedures
+
+```
+table_sync_items_received{from="<remote node>",table_name="bucket_v2"} 3
+table_sync_items_sent{table_name="block_ref",to="<remote node>"} 2
+```
+
+
diff --git a/doc/book/reference-manual/s3-compatibility.md b/doc/book/reference-manual/s3-compatibility.md
index dd3492a0..15b29bd1 100644
--- a/doc/book/reference-manual/s3-compatibility.md
+++ b/doc/book/reference-manual/s3-compatibility.md
@@ -1,6 +1,6 @@
+++
title = "S3 Compatibility status"
-weight = 40
+weight = 70
+++
## DISCLAIMER
diff --git a/doc/book/working-documents/migration-08.md b/doc/book/working-documents/migration-08.md
index 5f97c45b..b7c4c783 100644
--- a/doc/book/working-documents/migration-08.md
+++ b/doc/book/working-documents/migration-08.md
@@ -12,13 +12,15 @@ back up all your data before attempting it!**
Garage v0.8 introduces new data tables that allow the counting of objects in buckets in order to implement bucket quotas.
A manual migration step is required to first count objects in Garage buckets and populate these tables with accurate data.
+## Simple migration procedure (takes cluster offline for a while)
+
The migration steps are as follows:
1. Disable API and web access. Garage v0.7 does not support disabling
these endpoints but you can change the port number or stop your reverse proxy for instance.
2. Do `garage repair --all-nodes --yes tables` and `garage repair --all-nodes --yes blocks`,
check the logs and check that all data seems to be synced correctly between
- nodes. If you have time, do additional checks (`scrub`, `block_refs`, etc.)
+ nodes. If you have time, do additional checks (`versions`, `block_refs`, etc.)
3. Check that queues are empty: run `garage stats` to query them or inspect metrics in the Grafana dashboard.
4. Turn off Garage v0.7
5. **Backup the metadata folder of all your nodes!** For instance, use the following command
@@ -32,3 +34,24 @@ The migration steps are as follows:
10. Your upgraded cluster should be in a working state. Re-enable API and Web
access and check that everything went well.
11. Monitor your cluster in the next hours to see if it works well under your production load, report any issue.
+
+## Minimal downtime migration procedure
+
+The migration to Garage v0.8 can be done with almost no downtime,
+by restarting all nodes at once in the new version. The only limitation with this
+method is that bucket sizes and item counts will not be estimated correctly
+until all nodes have had a chance to run their offline migration procedure.
+
+The migration steps are as follows:
+
+1. Do `garage repair --all-nodes --yes tables` and `garage repair --all-nodes --yes blocks`,
+ check the logs and check that all data seems to be synced correctly between
+ nodes. If you have time, do additional checks (`versions`, `block_refs`, etc.)
+
+2. Turn off each node individually; back up its metadata folder (see above); turn it back on again. This will allow you to take a backup of all nodes without impacting global cluster availability. You can do all nodes of a single zone at once as this does not impact the availability of Garage.
+
+3. Prepare your binaries and configuration files for Garage v0.8
+
+4. Shut down all v0.7 nodes simultaneously, and restart them all simultaneously in v0.8. Use your favorite deployment tool (Ansible, Kubernetes, Nomad) to achieve this as fast as possible.
+
+5. At this point, Garage will indicate invalid values for the size and number of objects in each bucket (most likely, it will indicate zero). To fix this, take each node offline individually to do the offline migration step: `garage offline-repair --yes object_counters`. Again you can do all nodes of a single zone at once.