aboutsummaryrefslogtreecommitdiff
path: root/content/blog
diff options
context:
space:
mode:
Diffstat (limited to 'content/blog')
-rw-r--r--content/blog/2022-perf/index.md14
1 files changed, 7 insertions, 7 deletions
diff --git a/content/blog/2022-perf/index.md b/content/blog/2022-perf/index.md
index e8f10a3..599afda 100644
--- a/content/blog/2022-perf/index.md
+++ b/content/blog/2022-perf/index.md
@@ -11,23 +11,23 @@ date=2022-09-26
---
## ⚠️ Disclaimer
-The following results must be taken with a critical grain of salt due to some limitations that are inherent to any benchmark. We try to reference them in this section, some limitations might be missing.
+The following results must be taken with a critical grain of salt due to some limitations that are inherent to any benchmark. We try to reference them as exhaustively as possible in this section, but others limitation might exist.
-Most of our tests are done on simulated networks that can not represent all the diversity of real networks (dynamic drop, jitter, latency, all of them could possibly be correlated with throughput or any other external event). We also limited ourselves to very small workloads that are not representative of a production cluster.
+Most of our tests are done on simulated networks that can not represent all the diversity of real networks (dynamic drop, jitter, latency, all of them could possibly be correlated with throughput or any other external event). We also limited ourselves to very small workloads that are not representative of a production cluster. Furthermore, we only benchmarked some very specific aspects of Garage, on which we are currently working on: our results are thus not an overview of the whole software performances.
For some benchmarks, we used Minio as a reference. It must be noted that we did not try to optimize its configuration as we have done on Garage, and more generally, we have way less knowledge on Minio than on Garage, which can lead to underrated performance measurements for Minio.
It must also be noted that Garage and Minio are systems with different feature sets, *eg.* Minio supports erasure coding for better data density while Garage doesn't, Minio implements way more S3 endpoints than Garage, etc. Such feature have necessarily a cost that you must keep in mind when reading plots. You should consider Minio results as a way to contextualize our results, to check that our improvements are not artificials compared to existing object storage implementations.
Impact of the testing environment is also not evaluated (kernel patches, configuration, parameters, filesystem, hardware configuration, etc.), some of these configurations could favor one configuration/software over another. Especially, it must be noted that most of the tests were done on a consumer-grade computer and SSD only, which will be different from most production setups. Finally, our results are also provided without statistical tests to check their significance, and thus might be statistically not significative.
-When reading this post, please keep in mind that **we are not making any business or technical recommendation here, this is not a scientific paper either**; we only share bits of our development process.
+When reading this post, please keep in mind that **we are not making any business or technical recommendation here, this is not a scientific paper either**; we only share bits of our development process as honestly as possible.
Read [benchmarking crimes](https://gernot-heiser.org/benchmarking-crimes.html), make your own tests if you need to take a decision, and remain supportive and caring with your peers...
## About our testing environment
We started a batch of tests on [Grid5000](https://www.grid5000.fr/w/Grid5000:Home), a large-scale and flexible testbed for experiment-driven research in all areas of computer science, under the [Open Access](https://www.grid5000.fr/w/Grid5000:Open-Access) program. During our tests, we used part of the following clusters: [nova](https://www.grid5000.fr/w/Lyon:Hardware#nova), [paravance](https://www.grid5000.fr/w/Rennes:Hardware#paravance), and [econome](https://www.grid5000.fr/w/Nantes:Hardware#econome) to make a geo-distributed topology. We used the Grid5000 testbed only during our preliminary tests to identify issues when running Garage on many powerful servers, issues that we then reproduced in a controlled environment; don't be surprised then if Grid5000 is not mentioned often on our plots.
-To reproduce some environments locally, we have a small set of Python scripts named [mknet](https://git.deuxfleurs.fr/Deuxfleurs/mknet) tailored to our needs[^1]. Most of the following tests where thus run locally with mknet on a single computer: a Dell Inspiron 27" 7775 AIO, with a Ryzen 5 1400, 16GB of RAM, a 512GB SSD. In term of software, NixOS 22.05 with the 5.15.50 kernel is used with an ext4 encrypted filesystem. The `vm.dirty_background_ratio` and `vm.dirty_ratio` have been reduce to `2` and `1` respectively as, otherwise, the system tends to freeze when it is under heavy I/O load.
+To reproduce some environments locally, we have a small set of Python scripts named [mknet](https://git.deuxfleurs.fr/Deuxfleurs/mknet) tailored to our needs[^1]. Most of the following tests where thus run locally with mknet on a single computer: a Dell Inspiron 27" 7775 AIO, with a Ryzen 5 1400, 16GB of RAM, a 512GB SSD. In term of software, NixOS 22.05 with the 5.15.50 kernel is used with an ext4 encrypted filesystem. The `vm.dirty_background_ratio` and `vm.dirty_ratio` have been reduced to `2` and `1` respectively as, with default values, the system tends to freeze when it is under heavy I/O load.
## Efficient I/O
@@ -43,7 +43,7 @@ We wanted to see if this theory helds in practise: we simulated a low latency bu
As planned, Garage v0.7 that does not support block streaming features TTFB between 1.6s and 2s, which correspond to the computed time to transfer the full block. On the other side of the plot, we see Garage v0.8 with very low TTFB thanks to our streaming approach (the lowest value is 43 ms). Minio sits between our 2 implementations: we suppose that it does some form of batching, but on less than 1MB.
-**Read/write throughput** - As soon as we publicly released Garage, people started benchmarking it, comparing its performances to writing directly on the filesystem, and observed that Garage was slower (eg. [#288](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/288)). To improve the situation, we put costly processing like hashing on a dedicated thread and did many compute optimization ([#342](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/342), [#343](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/343)) which lead us to `v0.8 beta 1`. We also noted that logic we had to better control the resource usage and detect errors (semaphore, timeouts) were artificially limiting performances: we made them less restrictive at the cost of higher resource consumption under load ([#387](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/387)), resulting in `v0.8 beta 2`. Finally, we currently do multiple `fsync` calls each time we write a block. We know that this is expensive and did a test build without any `fsync` call ([see the commit](https://git.deuxfleurs.fr/Deuxfleurs/garage/commit/432131f5b8c2aad113df3b295072a00756da47e7)) that will not be merged yet, just to assess the impact of `fsync`, we refer to it as `no-fsync` in the following plot.
+**Read/write throughput** - As soon as we publicly released Garage, people started benchmarking it, comparing its performances to writing directly on the filesystem, and observed that Garage was slower (eg. [#288](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/288)). To improve the situation, we put costly processing like hashing on a dedicated thread and did many compute optimization ([#342](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/342), [#343](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/343)) which lead us to `v0.8 beta 1`. We also noted that logic we had to better control the resource usage and detect errors (semaphore, timeouts) were artificially limiting performances: we made them less restrictive at the cost of higher resource consumption under load ([#387](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/387)), resulting in `v0.8 beta 2`. Finally, we currently do multiple `fsync` calls each time we write a block. We know that this is expensive and did a test build without any `fsync` call ([see the commit](https://git.deuxfleurs.fr/Deuxfleurs/garage/commit/432131f5b8c2aad113df3b295072a00756da47e7)) that will not be merged, just to assess the impact of `fsync`. We refer to it as `no-fsync` in the following plot.
To assess performance improvements, we used the benchmark tool [minio/warp](https://github.com/minio/warp) in a non-standard configuration, adapted for small scale tests, and we kept only the aggregated result named "cluster total". The goal of this experiment is to get an idea of the cluster performance with a standardized and mixed workload.
@@ -52,7 +52,7 @@ To assess performance improvements, we used the benchmark tool [minio/warp](http
Minio, our ground truth, features the best performances in this test. Considering Garage, we observe that each improvement we made has a visible impact on its performances. We also note that we have a progress margin in term of performances compared to Minio: additional benchmarks, test and monitoring could help better understand the remaining difference.
-## Million of objects
+## A myriad of objects
**Testing metadata engines** - With Garage, we chose to not store metadata directly on the filesystem, like Minio for example, but in an on-disk fancy B-Tree structure, in other words, in an embedded database engine. Until now, the only available option was [sled](https://sled.rs/), but we started having serious issues with it, and we were not alone ([#284](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/284)). With Garage v0.8, we introduce an abstraction semantic over the features we expect from our database, allowing us to switch from one backend to another without touching the rest of our codebase. We added two additional backends: lmdb and sqlite. Keep in mind that they are both experimental: contrarily to sled, we have never run them in production.
@@ -66,7 +66,7 @@ To
![](1million.png)
-## Topology versatility
+## An unpredictable world
- low bandwidth