diff options
Diffstat (limited to 'doc/book/reference-manual')
-rw-r--r-- | doc/book/reference-manual/admin-api.md | 2 | ||||
-rw-r--r-- | doc/book/reference-manual/configuration.md | 44 | ||||
-rw-r--r-- | doc/book/reference-manual/k2v.md | 4 | ||||
-rw-r--r-- | doc/book/reference-manual/monitoring.md | 285 | ||||
-rw-r--r-- | doc/book/reference-manual/s3-compatibility.md | 2 |
5 files changed, 317 insertions, 20 deletions
diff --git a/doc/book/reference-manual/admin-api.md b/doc/book/reference-manual/admin-api.md index 0b7e2e16..363bc886 100644 --- a/doc/book/reference-manual/admin-api.md +++ b/doc/book/reference-manual/admin-api.md @@ -1,6 +1,6 @@ +++ title = "Administration API" -weight = 60 +weight = 40 +++ The Garage administration API is accessible through a dedicated server whose diff --git a/doc/book/reference-manual/configuration.md b/doc/book/reference-manual/configuration.md index 2d9c3f0c..38062bab 100644 --- a/doc/book/reference-manual/configuration.md +++ b/doc/book/reference-manual/configuration.md @@ -3,6 +3,8 @@ title = "Configuration file format" weight = 20 +++ +## Full example + Here is an example `garage.toml` configuration file that illustrates all of the possible options: ```toml @@ -96,7 +98,7 @@ Performance characteristics of the different DB engines are as follows: - Sled: the default database engine, which tends to produce large data files and also has performance issues, especially when the metadata folder - is on a traditionnal HDD and not on SSD. + is on a traditional HDD and not on SSD. - LMDB: the recommended alternative on 64-bit systems, much more space-efficiant and slightly faster. Note that the data format of LMDB is not portable between architectures, so for instance the Garage database of an x86-64 @@ -259,13 +261,17 @@ Compression is done synchronously, setting a value too high will add latency to This value can be different between nodes, compression is done by the node which receive the API call. -### `rpc_secret` +### `rpc_secret`, `rpc_secret_file` or `GARAGE_RPC_SECRET` (env) + +Garage uses a secret key, called an RPC secret, that is shared between all +nodes of the cluster in order to identify these nodes and allow them to +communicate together. The RPC secret is a 32-byte hex-encoded random string, +which can be generated with a command such as `openssl rand -hex 32`. -Garage uses a secret key that is shared between all nodes of the cluster -in order to identify these nodes and allow them to communicate together. -This key should be specified here in the form of a 32-byte hex-encoded -random string. Such a string can be generated with a command -such as `openssl rand -hex 32`. +The RPC secret should be specified in the `rpc_secret` configuration variable. +Since Garage `v0.8.2`, the RPC secret can also be stored in a file whose path is +given in the configuration variable `rpc_secret_file`, or specified as an +environment variable `GARAGE_RPC_SECRET`. ### `rpc_bind_addr` @@ -407,24 +413,30 @@ If specified, Garage will bind an HTTP server to this port and address, on which it will listen to requests for administration features. See [administration API reference](@/documentation/reference-manual/admin-api.md) to learn more about these features. -### `metrics_token` (since version 0.7.2) +### `metrics_token`, `metrics_token_file` or `GARAGE_METRICS_TOKEN` (env) -The token for accessing the Metrics endpoint. If this token is not set in -the config file, the Metrics endpoint can be accessed without access -control. +The token for accessing the Metrics endpoint. If this token is not set, the +Metrics endpoint can be accessed without access control. You can use any random string for this value. We recommend generating a random token with `openssl rand -hex 32`. -### `admin_token` (since version 0.7.2) +`metrics_token` was introduced in Garage `v0.7.2`. +`metrics_token_file` and the `GARAGE_METRICS_TOKEN` environment variable are supported since Garage `v0.8.2`. + + +### `admin_token`, `admin_token_file` or `GARAGE_ADMIN_TOKEN` (env) The token for accessing all of the other administration endpoints. If this -token is not set in the config file, access to these endpoints is disabled -entirely. +token is not set, access to these endpoints is disabled entirely. You can use any random string for this value. We recommend generating a random token with `openssl rand -hex 32`. +`admin_token` was introduced in Garage `v0.7.2`. +`admin_token_file` and the `GARAGE_ADMIN_TOKEN` environment variable are supported since Garage `v0.8.2`. + + ### `trace_sink` -Optionnally, the address of an Opentelemetry collector. If specified, -Garage will send traces in the Opentelemetry format to this endpoint. These +Optionally, the address of an OpenTelemetry collector. If specified, +Garage will send traces in the OpenTelemetry format to this endpoint. These trace allow to inspect Garage's operation when it handles S3 API requests. diff --git a/doc/book/reference-manual/k2v.md b/doc/book/reference-manual/k2v.md index 207d056a..ed069b27 100644 --- a/doc/book/reference-manual/k2v.md +++ b/doc/book/reference-manual/k2v.md @@ -1,6 +1,6 @@ +++ title = "K2V" -weight = 70 +weight = 100 +++ Starting with version 0.7.2, Garage introduces an optionnal feature, K2V, @@ -16,7 +16,7 @@ the `k2v` feature flag enabled can be obtained from our download page under with `-k2v` (example: `v0.7.2-k2v`). The specification of the K2V API can be found -[here](https://git.deuxfleurs.fr/Deuxfleurs/garage/src/branch/k2v/doc/drafts/k2v-spec.md). +[here](https://git.deuxfleurs.fr/Deuxfleurs/garage/src/branch/main/doc/drafts/k2v-spec.md). This document also includes a high-level overview of K2V's design. The K2V API uses AWSv4 signatures for authentification, same as the S3 API. diff --git a/doc/book/reference-manual/monitoring.md b/doc/book/reference-manual/monitoring.md new file mode 100644 index 00000000..97c533d3 --- /dev/null +++ b/doc/book/reference-manual/monitoring.md @@ -0,0 +1,285 @@ + ++++ +title = "Monitoring" +weight = 60 ++++ + + +For information on setting up monitoring, see our [dedicated page](@/documentation/cookbook/monitoring.md) in the Cookbook section. + +## List of exported metrics + +### Garage system metrics + +#### `garage_build_info` (counter) + +Exposes the Garage version number running on a node. + +``` +garage_build_info{version="1.0"} 1 +``` + +#### `garage_replication_factor` (counter) + +Exposes the Garage replication factor configured on the node + +``` +garage_replication_factor 3 +``` + +### Metrics of the API endpoints + +#### `api_admin_request_counter` (counter) + +Counts the number of requests to a given endpoint of the administration API. Example: + +``` +api_admin_request_counter{api_endpoint="Metrics"} 127041 +``` + +#### `api_admin_request_duration` (histogram) + +Evaluates the duration of API calls to the various administration API endpoint. Example: + +``` +api_admin_request_duration_bucket{api_endpoint="Metrics",le="0.5"} 127041 +api_admin_request_duration_sum{api_endpoint="Metrics"} 605.250344830999 +api_admin_request_duration_count{api_endpoint="Metrics"} 127041 +``` + +#### `api_s3_request_counter` (counter) + +Counts the number of requests to a given endpoint of the S3 API. Example: + +``` +api_s3_request_counter{api_endpoint="CreateMultipartUpload"} 1 +``` + +#### `api_s3_error_counter` (counter) + +Counts the number of requests to a given endpoint of the S3 API that returned an error. Example: + +``` +api_s3_error_counter{api_endpoint="GetObject",status_code="404"} 39 +``` + +#### `api_s3_request_duration` (histogram) + +Evaluates the duration of API calls to the various S3 API endpoints. Example: + +``` +api_s3_request_duration_bucket{api_endpoint="CreateMultipartUpload",le="0.5"} 1 +api_s3_request_duration_sum{api_endpoint="CreateMultipartUpload"} 0.046340762 +api_s3_request_duration_count{api_endpoint="CreateMultipartUpload"} 1 +``` + +#### `api_k2v_request_counter` (counter), `api_k2v_error_counter` (counter), `api_k2v_error_duration` (histogram) + +Same as for S3, for the K2V API. + + +### Metrics of the Web endpoint + + +#### `web_request_counter` (counter) + +Number of requests to the web endpoint + +``` +web_request_counter{method="GET"} 80 +``` + +#### `web_request_duration` (histogram) + +Duration of requests to the web endpoint + +``` +web_request_duration_bucket{method="GET",le="0.5"} 80 +web_request_duration_sum{method="GET"} 1.0528433229999998 +web_request_duration_count{method="GET"} 80 +``` + +#### `web_error_counter` (counter) + +Number of requests to the web endpoint resulting in errors + +``` +web_error_counter{method="GET",status_code="404 Not Found"} 64 +``` + + +### Metrics of the data block manager + +#### `block_bytes_read`, `block_bytes_written` (counter) + +Number of bytes read/written to/from disk in the data storage directory. + +``` +block_bytes_read 120586322022 +block_bytes_written 3386618077 +``` + +#### `block_compression_level` (counter) + +Exposes the block compression level configured for the Garage node. + +``` +block_compression_level 3 +``` + +#### `block_read_duration`, `block_write_duration` (histograms) + +Evaluates the duration of the reading/writing of individual data blocks in the data storage directory. + +``` +block_read_duration_bucket{le="0.5"} 169229 +block_read_duration_sum 2761.6902550310056 +block_read_duration_count 169240 +block_write_duration_bucket{le="0.5"} 3559 +block_write_duration_sum 195.59170078500006 +block_write_duration_count 3571 +``` + +#### `block_delete_counter` (counter) + +Counts the number of data blocks that have been deleted from storage. + +``` +block_delete_counter 122 +``` + +#### `block_resync_counter` (counter), `block_resync_duration` (histogram) + +Counts the number of resync operations the node has executed, and evaluates their duration. + +``` +block_resync_counter 308897 +block_resync_duration_bucket{le="0.5"} 308892 +block_resync_duration_sum 139.64204196100016 +block_resync_duration_count 308897 +``` + +#### `block_resync_queue_length` (gauge) + +The number of block hashes currently queued for a resync. +This is normal to be nonzero for long periods of time. + +``` +block_resync_queue_length 0 +``` + +#### `block_resync_errored_blocks` (gauge) + +The number of block hashes that we were unable to resync last time we tried. +**THIS SHOULD BE ZERO, OR FALL BACK TO ZERO RAPIDLY, IN A HEALTHY CLUSTER.** +Persistent nonzero values indicate that some data is likely to be lost. + +``` +block_resync_errored_blocks 0 +``` + + +### Metrics related to RPCs (remote procedure calls) between nodes + +#### `rpc_netapp_request_counter` (counter) + +Number of RPC requests emitted + +``` +rpc_request_counter{from="<this node>",rpc_endpoint="garage_block/manager.rs/Rpc",to="<remote node>"} 176 +``` + +#### `rpc_netapp_error_counter` (counter) + +Number of communication errors (errors in the Netapp library, generally due to disconnected nodes) + +``` +rpc_netapp_error_counter{from="<this node>",rpc_endpoint="garage_block/manager.rs/Rpc",to="<remote node>"} 354 +``` + +#### `rpc_timeout_counter` (counter) + +Number of RPC timeouts, should be close to zero in a healthy cluster. + +``` +rpc_timeout_counter{from="<this node>",rpc_endpoint="garage_rpc/membership.rs/SystemRpc",to="<remote node>"} 1 +``` + +#### `rpc_duration` (histogram) + +The duration of internal RPC calls between Garage nodes. + +``` +rpc_duration_bucket{from="<this node>",rpc_endpoint="garage_block/manager.rs/Rpc",to="<remote node>",le="0.5"} 166 +rpc_duration_sum{from="<this node>",rpc_endpoint="garage_block/manager.rs/Rpc",to="<remote node>"} 35.172253716 +rpc_duration_count{from="<this node>",rpc_endpoint="garage_block/manager.rs/Rpc",to="<remote node>"} 174 +``` + + +### Metrics of the metadata table manager + +#### `table_gc_todo_queue_length` (gauge) + +Table garbage collector TODO queue length + +``` +table_gc_todo_queue_length{table_name="block_ref"} 0 +``` + +#### `table_get_request_counter` (counter), `table_get_request_duration` (histogram) + +Number of get/get_range requests internally made on each table, and their duration. + +``` +table_get_request_counter{table_name="bucket_alias"} 315 +table_get_request_duration_bucket{table_name="bucket_alias",le="0.5"} 315 +table_get_request_duration_sum{table_name="bucket_alias"} 0.048509778000000024 +table_get_request_duration_count{table_name="bucket_alias"} 315 +``` + + +#### `table_put_request_counter` (counter), `table_put_request_duration` (histogram) + +Number of insert/insert_many requests internally made on this table, and their duration + +``` +table_put_request_counter{table_name="block_ref"} 677 +table_put_request_duration_bucket{table_name="block_ref",le="0.5"} 677 +table_put_request_duration_sum{table_name="block_ref"} 61.617528636 +table_put_request_duration_count{table_name="block_ref"} 677 +``` + +#### `table_internal_delete_counter` (counter) + +Number of value deletions in the tree (due to GC or repartitioning) + +``` +table_internal_delete_counter{table_name="block_ref"} 2296 +``` + +#### `table_internal_update_counter` (counter) + +Number of value updates where the value actually changes (includes creation of new key and update of existing key) + +``` +table_internal_update_counter{table_name="block_ref"} 5996 +``` + +#### `table_merkle_updater_todo_queue_length` (gauge) + +Merkle tree updater TODO queue length (should fall to zero rapidly) + +``` +table_merkle_updater_todo_queue_length{table_name="block_ref"} 0 +``` + +#### `table_sync_items_received`, `table_sync_items_sent` (counters) + +Number of data items sent to/recieved from other nodes during resync procedures + +``` +table_sync_items_received{from="<remote node>",table_name="bucket_v2"} 3 +table_sync_items_sent{table_name="block_ref",to="<remote node>"} 2 +``` + + diff --git a/doc/book/reference-manual/s3-compatibility.md b/doc/book/reference-manual/s3-compatibility.md index dd3492a0..15b29bd1 100644 --- a/doc/book/reference-manual/s3-compatibility.md +++ b/doc/book/reference-manual/s3-compatibility.md @@ -1,6 +1,6 @@ +++ title = "S3 Compatibility status" -weight = 40 +weight = 70 +++ ## DISCLAIMER |