From d574b395ffb0c5b4e672dfef3b47971b4ec82470 Mon Sep 17 00:00:00 2001 From: Quentin Dufour Date: Mon, 26 Sep 2022 11:12:46 +0200 Subject: Skeleton of our perf article --- content/blog/2022-perf/1million.png | Bin 0 -> 250100 bytes content/blog/2022-perf/amplification.png | Bin 0 -> 145685 bytes content/blog/2022-perf/complexity.png | Bin 0 -> 198406 bytes content/blog/2022-perf/db_engine.png | Bin 0 -> 170092 bytes content/blog/2022-perf/index.md | 75 +++++++++++++++++++++++++++++++ content/blog/2022-perf/io.png | Bin 0 -> 193869 bytes content/blog/2022-perf/ttfb.png | Bin 0 -> 131548 bytes 7 files changed, 75 insertions(+) create mode 100644 content/blog/2022-perf/1million.png create mode 100644 content/blog/2022-perf/amplification.png create mode 100644 content/blog/2022-perf/complexity.png create mode 100644 content/blog/2022-perf/db_engine.png create mode 100644 content/blog/2022-perf/index.md create mode 100644 content/blog/2022-perf/io.png create mode 100644 content/blog/2022-perf/ttfb.png (limited to 'content/blog') diff --git a/content/blog/2022-perf/1million.png b/content/blog/2022-perf/1million.png new file mode 100644 index 0000000..c7ca528 Binary files /dev/null and b/content/blog/2022-perf/1million.png differ diff --git a/content/blog/2022-perf/amplification.png b/content/blog/2022-perf/amplification.png new file mode 100644 index 0000000..8962b65 Binary files /dev/null and b/content/blog/2022-perf/amplification.png differ diff --git a/content/blog/2022-perf/complexity.png b/content/blog/2022-perf/complexity.png new file mode 100644 index 0000000..a5cf631 Binary files /dev/null and b/content/blog/2022-perf/complexity.png differ diff --git a/content/blog/2022-perf/db_engine.png b/content/blog/2022-perf/db_engine.png new file mode 100644 index 0000000..0f22d6d Binary files /dev/null and b/content/blog/2022-perf/db_engine.png differ diff --git a/content/blog/2022-perf/index.md b/content/blog/2022-perf/index.md new file mode 100644 index 0000000..4abcf8d --- /dev/null +++ b/content/blog/2022-perf/index.md @@ -0,0 +1,75 @@ ++++ +title="Bringing theoretical design and real word performances face to face" +date=2022-09-26 ++++ + + +*We* + + + +--- + +## ⚠️ Disclaimer +The following results must be taken with a critical grain of salt due to some limitations that are inherent to any benchmark. We try to reference them in the following. + +Most of our tests are done on simulated networks that can not represent all the diversity of real networks (dynamic drop, jitter, latency, all of them could possibly be correlated with throughput or any other external event). We also limited ourselves to very small workloads that are not representative of a production cluster. + +For some benchmarks, we used Minio as a reference. It must be noted that we did not try to optimize its configuration as we have done on Garage, and more generally, we have way less knowledge on Minio than on Garage. +It must also be noted that Gare and Minio are systems with different feature set, *eg.* Minio supports erasure coding for better data density while Garage doesn't. + +Impact of the testing environment is also not evaluated (kernel patches, configuration, parameters, filesystem, etc.), some of these configurations could favor one configuration/software over another. Finally, our results are also provided without statistical tests to check their significance, and thus might be statistically not significative. + +In any case, we are not making a business or technical recommendation, we only share bits of our development process. +Read [benchmarking crimes](https://gernot-heiser.org/benchmarking-crimes.html) and make your own tests if you need to take a decision! + +## About our testing environment + +- Grid 5k. + +- My own computer. + +## Efficient I/O + +- streaming + +![](ttfb.png) + +- fsync, semaphore, timeouts, etc. + +![](io.png) + +## Million of objects + +- metadata engine + +![](db_engine.png) + +- storing metadata at scale + +![](1million.png) + +## Topology versatility + +- low bandwidth + +![]() + +- high network latency. phenomenon we name amplification + +![](amplification.png) + +- complexity (constant time) + +![](complexity.png) + + +## Future work + +- srpt + +- better analysis of the fsync / data reliability impact + +- analysis and comparison of Garage at scale + +- try to better understand ecosystem (riak cs, minio, ceph, swift) -> some knowledge to get diff --git a/content/blog/2022-perf/io.png b/content/blog/2022-perf/io.png new file mode 100644 index 0000000..f581a22 Binary files /dev/null and b/content/blog/2022-perf/io.png differ diff --git a/content/blog/2022-perf/ttfb.png b/content/blog/2022-perf/ttfb.png new file mode 100644 index 0000000..c0335bd Binary files /dev/null and b/content/blog/2022-perf/ttfb.png differ -- cgit v1.2.3 From be6aaffa0ccf175428a784d0ac0c6cf7470fcf77 Mon Sep 17 00:00:00 2001 From: Quentin Dufour Date: Mon, 26 Sep 2022 12:53:43 +0200 Subject: Introduce testing environment --- content/blog/2022-perf/index.md | 20 +++++++++++--------- 1 file changed, 11 insertions(+), 9 deletions(-) (limited to 'content/blog') diff --git a/content/blog/2022-perf/index.md b/content/blog/2022-perf/index.md index 4abcf8d..da78a76 100644 --- a/content/blog/2022-perf/index.md +++ b/content/blog/2022-perf/index.md @@ -4,30 +4,30 @@ date=2022-09-26 +++ -*We* +*For the past years, we have extensively analyzed possible design decisions and their theoretical tradeoffs on Garage, being it on the network, data structure, or scheduling side. And it worked well enough for our production cluster at Deuxfleurs, but we also knew that people started discovering some unexpected behaviors. We thus started a round of benchmark and performance improvement to make Garage more versatile and better understand what we can expect from it.* --- ## ⚠️ Disclaimer -The following results must be taken with a critical grain of salt due to some limitations that are inherent to any benchmark. We try to reference them in the following. +The following results must be taken with a critical grain of salt due to some limitations that are inherent to any benchmark. We try to reference them in this section, some limitations might be missing. Most of our tests are done on simulated networks that can not represent all the diversity of real networks (dynamic drop, jitter, latency, all of them could possibly be correlated with throughput or any other external event). We also limited ourselves to very small workloads that are not representative of a production cluster. -For some benchmarks, we used Minio as a reference. It must be noted that we did not try to optimize its configuration as we have done on Garage, and more generally, we have way less knowledge on Minio than on Garage. -It must also be noted that Gare and Minio are systems with different feature set, *eg.* Minio supports erasure coding for better data density while Garage doesn't. +For some benchmarks, we used Minio as a reference. It must be noted that we did not try to optimize its configuration as we have done on Garage, and more generally, we have way less knowledge on Minio than on Garage, which can lead to underrated performance measurements for Minio. +It must also be noted that Garage and Minio are systems with different feature sets, *eg.* Minio supports erasure coding for better data density while Garage doesn't, Minio implements way more S3 endpoints than Garage, etc. Such feature have necessarily a cost that you must keep in mind when reading plots. -Impact of the testing environment is also not evaluated (kernel patches, configuration, parameters, filesystem, etc.), some of these configurations could favor one configuration/software over another. Finally, our results are also provided without statistical tests to check their significance, and thus might be statistically not significative. +Impact of the testing environment is also not evaluated (kernel patches, configuration, parameters, filesystem, hardware configuration, etc.), some of these configurations could favor one configuration/software over another. Especially, it must be noted that most of the tests were done on a consumer-grade computer and SSD only, which will be different from most production setups. Finally, our results are also provided without statistical tests to check their significance, and thus might be statistically not significative. -In any case, we are not making a business or technical recommendation, we only share bits of our development process. -Read [benchmarking crimes](https://gernot-heiser.org/benchmarking-crimes.html) and make your own tests if you need to take a decision! +When reading this post, please keep in mind that **we are not making any business or technical recommendation here**, we only share bits of our development process. +Read [benchmarking crimes](https://gernot-heiser.org/benchmarking-crimes.html), make your own tests if you need to take a decision, and remain supportive and caring with your peers... ## About our testing environment -- Grid 5k. +We started a batch of tests on [Grid5000](https://www.grid5000.fr/w/Grid5000:Home), a large-scale and flexible testbed for experiment-driven research in all areas of computer science, under the [Open Access](https://www.grid5000.fr/w/Grid5000:Open-Access) program. During our tests, we used part of the following clusters: [nova](https://www.grid5000.fr/w/Lyon:Hardware#nova), [paravance](https://www.grid5000.fr/w/Rennes:Hardware#paravance), and [econome](https://www.grid5000.fr/w/Nantes:Hardware#econome) to make a geo-distributed topology. We used the Grid5000 testbed only during our preliminary tests to identify issues when running Garage on many powerful servers, issues that we then reproduced in a controlled environment; don't be surprised then if Grid5000 is not mentioned often on our plots. -- My own computer. +To reproduce some environments locally, we have a small set of Python scripts named [mknet](https://git.deuxfleurs.fr/Deuxfleurs/mknet) tailored to our needs[^1]. Most of the following tests where thus run locally with mknet on a single computer: a Dell Inspiron 27" 7775 AIO, with a Ryzen 5 1400, 16GB of RAM, a 512GB SSD. In term of software, NixOS 22.05 with the 5.15.50 kernel is used with an ext4 encrypted filesystem. The `vm.dirty_background_ratio` and `vm.dirty_ratio` have been reduce to `2` and `1` respectively as, otherwise, the system tends to freeze when it is under heavy I/O load. ## Efficient I/O @@ -73,3 +73,5 @@ Read [benchmarking crimes](https://gernot-heiser.org/benchmarking-crimes.html) a - analysis and comparison of Garage at scale - try to better understand ecosystem (riak cs, minio, ceph, swift) -> some knowledge to get + +[^1]: Yes, we are aware of [Jepsen](https://github.com/jepsen-io/jepsen) existence. This tool is far more complex than our set of scripts, but we know that it is also way more versatile. -- cgit v1.2.3 From 87e8780f9496f832cf9ae1cb42b61796ffa63e68 Mon Sep 17 00:00:00 2001 From: Quentin Dufour Date: Mon, 26 Sep 2022 16:36:58 +0200 Subject: Write TTFB --- content/blog/2022-perf/index.md | 19 ++++++++++++++----- content/blog/2022-perf/schema-streaming.png | Bin 0 -> 51925 bytes 2 files changed, 14 insertions(+), 5 deletions(-) create mode 100644 content/blog/2022-perf/schema-streaming.png (limited to 'content/blog') diff --git a/content/blog/2022-perf/index.md b/content/blog/2022-perf/index.md index da78a76..3f597a8 100644 --- a/content/blog/2022-perf/index.md +++ b/content/blog/2022-perf/index.md @@ -1,10 +1,10 @@ +++ -title="Bringing theoretical design and real word performances face to face" +title="Bringing theoretical design and observed performances face to face" date=2022-09-26 +++ -*For the past years, we have extensively analyzed possible design decisions and their theoretical tradeoffs on Garage, being it on the network, data structure, or scheduling side. And it worked well enough for our production cluster at Deuxfleurs, but we also knew that people started discovering some unexpected behaviors. We thus started a round of benchmark and performance improvement to make Garage more versatile and better understand what we can expect from it.* +*For the past years, we have extensively analyzed possible design decisions and their theoretical tradeoffs on Garage, being it on the network, data structure, or scheduling side. And it worked well enough for our production cluster at Deuxfleurs, but we also knew that people started discovering some unexpected behaviors. We thus started a round of benchmark and performance measurement to see how Garage behaves compared to our expectations.* @@ -20,7 +20,7 @@ It must also be noted that Garage and Minio are systems with different feature s Impact of the testing environment is also not evaluated (kernel patches, configuration, parameters, filesystem, hardware configuration, etc.), some of these configurations could favor one configuration/software over another. Especially, it must be noted that most of the tests were done on a consumer-grade computer and SSD only, which will be different from most production setups. Finally, our results are also provided without statistical tests to check their significance, and thus might be statistically not significative. -When reading this post, please keep in mind that **we are not making any business or technical recommendation here**, we only share bits of our development process. +When reading this post, please keep in mind that **we are not making any business or technical recommendation here, this is not a scientific paper either**; we only share bits of our development process. Read [benchmarking crimes](https://gernot-heiser.org/benchmarking-crimes.html), make your own tests if you need to take a decision, and remain supportive and caring with your peers... ## About our testing environment @@ -31,10 +31,19 @@ To reproduce some environments locally, we have a small set of Python scripts na ## Efficient I/O -- streaming +**Time To First Byte** - One specificity of Garage is that we implemented S3 web endpoints, with the idea to make it the platform of choice to publish your static website. When publishing a website, one metric you observe is Time To First Byte (TTFB), as it will impact the perceived reactivity of your wbesite. On Garage, time to first byte was a bit high, especially for objects of 1MB and more. This is not surprising as, until now, the smallest level of granularity was the block level, which are set to at most 1MB by default. Hence, when you were sending a GET request, the block had to be fully retrieved by the gateway node from the storage node before starting sending any data to the client. With Garage v0.8, we integrated a block streaming logic which allows the gateway to send the beginning of a block without having to wait for the full block from the storage node. We can visually represent the difference as follow: -![](ttfb.png) +![A schema depicting how streaming improves the delivery of a block](schema-streaming.png) +As our default block size is only 1MB, the difference will be very small on fast networks: it takes only 8ms to transfer 1MB on a 1Gbps network. However, on a very slow network (or a very congested link with many parallel requests handled), the impact can be much more important: at 5Mbps, it takes 1.6 second to transfer our 1MB block, and streaming could heavily improve user experience. + +We wanted to see if this theory helds in practise: we simulated a low latency but slow network on mknet and did some request with (garage v0.8 beta) and without (garage v0.7) block streaming. We also added Minio as a reference. + +![Plot showing the TTFB observed on Garage v0.8, v0.7 and Minio](ttfb.png) + +As planned, Garage v0.7 that does not support block streaming features TTFB between 1.6s and 2s, which correspond to the computed time to transfer the full block. On the other side of the plot, we see Garage v0.8 with very low TTFB thanks to our streaming approach (the lowest value is 43 ms). Minio sits between our 2 implementations: we suppose that it does some form of batching, but on less than 1MB. + +**Read/write throughput** - - fsync, semaphore, timeouts, etc. ![](io.png) diff --git a/content/blog/2022-perf/schema-streaming.png b/content/blog/2022-perf/schema-streaming.png new file mode 100644 index 0000000..7d4c51c Binary files /dev/null and b/content/blog/2022-perf/schema-streaming.png differ -- cgit v1.2.3 From 373ccb9db08c21b5e49b09815068a7dd9fc1d112 Mon Sep 17 00:00:00 2001 From: Quentin Dufour Date: Mon, 26 Sep 2022 18:06:09 +0200 Subject: Finished I/O section, start Metadata Engine --- content/blog/2022-perf/1million-both.png | Bin 0 -> 322381 bytes content/blog/2022-perf/index.md | 22 +++++++++++++++------- content/blog/2022-perf/schema-streaming.png | Bin 51925 -> 50437 bytes 3 files changed, 15 insertions(+), 7 deletions(-) create mode 100644 content/blog/2022-perf/1million-both.png (limited to 'content/blog') diff --git a/content/blog/2022-perf/1million-both.png b/content/blog/2022-perf/1million-both.png new file mode 100644 index 0000000..33b9968 Binary files /dev/null and b/content/blog/2022-perf/1million-both.png differ diff --git a/content/blog/2022-perf/index.md b/content/blog/2022-perf/index.md index 3f597a8..e8f10a3 100644 --- a/content/blog/2022-perf/index.md +++ b/content/blog/2022-perf/index.md @@ -16,7 +16,7 @@ The following results must be taken with a critical grain of salt due to some li Most of our tests are done on simulated networks that can not represent all the diversity of real networks (dynamic drop, jitter, latency, all of them could possibly be correlated with throughput or any other external event). We also limited ourselves to very small workloads that are not representative of a production cluster. For some benchmarks, we used Minio as a reference. It must be noted that we did not try to optimize its configuration as we have done on Garage, and more generally, we have way less knowledge on Minio than on Garage, which can lead to underrated performance measurements for Minio. -It must also be noted that Garage and Minio are systems with different feature sets, *eg.* Minio supports erasure coding for better data density while Garage doesn't, Minio implements way more S3 endpoints than Garage, etc. Such feature have necessarily a cost that you must keep in mind when reading plots. +It must also be noted that Garage and Minio are systems with different feature sets, *eg.* Minio supports erasure coding for better data density while Garage doesn't, Minio implements way more S3 endpoints than Garage, etc. Such feature have necessarily a cost that you must keep in mind when reading plots. You should consider Minio results as a way to contextualize our results, to check that our improvements are not artificials compared to existing object storage implementations. Impact of the testing environment is also not evaluated (kernel patches, configuration, parameters, filesystem, hardware configuration, etc.), some of these configurations could favor one configuration/software over another. Especially, it must be noted that most of the tests were done on a consumer-grade computer and SSD only, which will be different from most production setups. Finally, our results are also provided without statistical tests to check their significance, and thus might be statistically not significative. @@ -43,18 +43,26 @@ We wanted to see if this theory helds in practise: we simulated a low latency bu As planned, Garage v0.7 that does not support block streaming features TTFB between 1.6s and 2s, which correspond to the computed time to transfer the full block. On the other side of the plot, we see Garage v0.8 with very low TTFB thanks to our streaming approach (the lowest value is 43 ms). Minio sits between our 2 implementations: we suppose that it does some form of batching, but on less than 1MB. -**Read/write throughput** - -- fsync, semaphore, timeouts, etc. +**Read/write throughput** - As soon as we publicly released Garage, people started benchmarking it, comparing its performances to writing directly on the filesystem, and observed that Garage was slower (eg. [#288](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/288)). To improve the situation, we put costly processing like hashing on a dedicated thread and did many compute optimization ([#342](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/342), [#343](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/343)) which lead us to `v0.8 beta 1`. We also noted that logic we had to better control the resource usage and detect errors (semaphore, timeouts) were artificially limiting performances: we made them less restrictive at the cost of higher resource consumption under load ([#387](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/387)), resulting in `v0.8 beta 2`. Finally, we currently do multiple `fsync` calls each time we write a block. We know that this is expensive and did a test build without any `fsync` call ([see the commit](https://git.deuxfleurs.fr/Deuxfleurs/garage/commit/432131f5b8c2aad113df3b295072a00756da47e7)) that will not be merged yet, just to assess the impact of `fsync`, we refer to it as `no-fsync` in the following plot. -![](io.png) +To assess performance improvements, we used the benchmark tool [minio/warp](https://github.com/minio/warp) in a non-standard configuration, adapted for small scale tests, and we kept only the aggregated result named "cluster total". The goal of this experiment is to get an idea of the cluster performance with a standardized and mixed workload. + +![Plot showing IO perf of Garage configs and Minio](io.png) + +Minio, our ground truth, features the best performances in this test. Considering Garage, we observe that each improvement we made has a visible impact on its performances. We also note that we have a progress margin in term of performances compared to Minio: additional benchmarks, test and monitoring could help better understand the remaining difference. + ## Million of objects -- metadata engine +**Testing metadata engines** - With Garage, we chose to not store metadata directly on the filesystem, like Minio for example, but in an on-disk fancy B-Tree structure, in other words, in an embedded database engine. Until now, the only available option was [sled](https://sled.rs/), but we started having serious issues with it, and we were not alone ([#284](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/284)). With Garage v0.8, we introduce an abstraction semantic over the features we expect from our database, allowing us to switch from one backend to another without touching the rest of our codebase. We added two additional backends: lmdb and sqlite. Keep in mind that they are both experimental: contrarily to sled, we have never run them in production. + +To + +![Plot of our metadata engines comparison with Warp](db_engine.png) -![](db_engine.png) +**Storing million of objects** - -- storing metadata at scale +![](1million-both.png) ![](1million.png) diff --git a/content/blog/2022-perf/schema-streaming.png b/content/blog/2022-perf/schema-streaming.png index 7d4c51c..f006484 100644 Binary files a/content/blog/2022-perf/schema-streaming.png and b/content/blog/2022-perf/schema-streaming.png differ -- cgit v1.2.3 From da075684e8f75fc5e933f0a7d89fc8efde648ed9 Mon Sep 17 00:00:00 2001 From: Quentin Dufour Date: Mon, 26 Sep 2022 18:22:49 +0200 Subject: Proof reading existing text --- content/blog/2022-perf/index.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) (limited to 'content/blog') diff --git a/content/blog/2022-perf/index.md b/content/blog/2022-perf/index.md index e8f10a3..599afda 100644 --- a/content/blog/2022-perf/index.md +++ b/content/blog/2022-perf/index.md @@ -11,23 +11,23 @@ date=2022-09-26 --- ## ⚠️ Disclaimer -The following results must be taken with a critical grain of salt due to some limitations that are inherent to any benchmark. We try to reference them in this section, some limitations might be missing. +The following results must be taken with a critical grain of salt due to some limitations that are inherent to any benchmark. We try to reference them as exhaustively as possible in this section, but others limitation might exist. -Most of our tests are done on simulated networks that can not represent all the diversity of real networks (dynamic drop, jitter, latency, all of them could possibly be correlated with throughput or any other external event). We also limited ourselves to very small workloads that are not representative of a production cluster. +Most of our tests are done on simulated networks that can not represent all the diversity of real networks (dynamic drop, jitter, latency, all of them could possibly be correlated with throughput or any other external event). We also limited ourselves to very small workloads that are not representative of a production cluster. Furthermore, we only benchmarked some very specific aspects of Garage, on which we are currently working on: our results are thus not an overview of the whole software performances. For some benchmarks, we used Minio as a reference. It must be noted that we did not try to optimize its configuration as we have done on Garage, and more generally, we have way less knowledge on Minio than on Garage, which can lead to underrated performance measurements for Minio. It must also be noted that Garage and Minio are systems with different feature sets, *eg.* Minio supports erasure coding for better data density while Garage doesn't, Minio implements way more S3 endpoints than Garage, etc. Such feature have necessarily a cost that you must keep in mind when reading plots. You should consider Minio results as a way to contextualize our results, to check that our improvements are not artificials compared to existing object storage implementations. Impact of the testing environment is also not evaluated (kernel patches, configuration, parameters, filesystem, hardware configuration, etc.), some of these configurations could favor one configuration/software over another. Especially, it must be noted that most of the tests were done on a consumer-grade computer and SSD only, which will be different from most production setups. Finally, our results are also provided without statistical tests to check their significance, and thus might be statistically not significative. -When reading this post, please keep in mind that **we are not making any business or technical recommendation here, this is not a scientific paper either**; we only share bits of our development process. +When reading this post, please keep in mind that **we are not making any business or technical recommendation here, this is not a scientific paper either**; we only share bits of our development process as honestly as possible. Read [benchmarking crimes](https://gernot-heiser.org/benchmarking-crimes.html), make your own tests if you need to take a decision, and remain supportive and caring with your peers... ## About our testing environment We started a batch of tests on [Grid5000](https://www.grid5000.fr/w/Grid5000:Home), a large-scale and flexible testbed for experiment-driven research in all areas of computer science, under the [Open Access](https://www.grid5000.fr/w/Grid5000:Open-Access) program. During our tests, we used part of the following clusters: [nova](https://www.grid5000.fr/w/Lyon:Hardware#nova), [paravance](https://www.grid5000.fr/w/Rennes:Hardware#paravance), and [econome](https://www.grid5000.fr/w/Nantes:Hardware#econome) to make a geo-distributed topology. We used the Grid5000 testbed only during our preliminary tests to identify issues when running Garage on many powerful servers, issues that we then reproduced in a controlled environment; don't be surprised then if Grid5000 is not mentioned often on our plots. -To reproduce some environments locally, we have a small set of Python scripts named [mknet](https://git.deuxfleurs.fr/Deuxfleurs/mknet) tailored to our needs[^1]. Most of the following tests where thus run locally with mknet on a single computer: a Dell Inspiron 27" 7775 AIO, with a Ryzen 5 1400, 16GB of RAM, a 512GB SSD. In term of software, NixOS 22.05 with the 5.15.50 kernel is used with an ext4 encrypted filesystem. The `vm.dirty_background_ratio` and `vm.dirty_ratio` have been reduce to `2` and `1` respectively as, otherwise, the system tends to freeze when it is under heavy I/O load. +To reproduce some environments locally, we have a small set of Python scripts named [mknet](https://git.deuxfleurs.fr/Deuxfleurs/mknet) tailored to our needs[^1]. Most of the following tests where thus run locally with mknet on a single computer: a Dell Inspiron 27" 7775 AIO, with a Ryzen 5 1400, 16GB of RAM, a 512GB SSD. In term of software, NixOS 22.05 with the 5.15.50 kernel is used with an ext4 encrypted filesystem. The `vm.dirty_background_ratio` and `vm.dirty_ratio` have been reduced to `2` and `1` respectively as, with default values, the system tends to freeze when it is under heavy I/O load. ## Efficient I/O @@ -43,7 +43,7 @@ We wanted to see if this theory helds in practise: we simulated a low latency bu As planned, Garage v0.7 that does not support block streaming features TTFB between 1.6s and 2s, which correspond to the computed time to transfer the full block. On the other side of the plot, we see Garage v0.8 with very low TTFB thanks to our streaming approach (the lowest value is 43 ms). Minio sits between our 2 implementations: we suppose that it does some form of batching, but on less than 1MB. -**Read/write throughput** - As soon as we publicly released Garage, people started benchmarking it, comparing its performances to writing directly on the filesystem, and observed that Garage was slower (eg. [#288](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/288)). To improve the situation, we put costly processing like hashing on a dedicated thread and did many compute optimization ([#342](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/342), [#343](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/343)) which lead us to `v0.8 beta 1`. We also noted that logic we had to better control the resource usage and detect errors (semaphore, timeouts) were artificially limiting performances: we made them less restrictive at the cost of higher resource consumption under load ([#387](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/387)), resulting in `v0.8 beta 2`. Finally, we currently do multiple `fsync` calls each time we write a block. We know that this is expensive and did a test build without any `fsync` call ([see the commit](https://git.deuxfleurs.fr/Deuxfleurs/garage/commit/432131f5b8c2aad113df3b295072a00756da47e7)) that will not be merged yet, just to assess the impact of `fsync`, we refer to it as `no-fsync` in the following plot. +**Read/write throughput** - As soon as we publicly released Garage, people started benchmarking it, comparing its performances to writing directly on the filesystem, and observed that Garage was slower (eg. [#288](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/288)). To improve the situation, we put costly processing like hashing on a dedicated thread and did many compute optimization ([#342](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/342), [#343](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/343)) which lead us to `v0.8 beta 1`. We also noted that logic we had to better control the resource usage and detect errors (semaphore, timeouts) were artificially limiting performances: we made them less restrictive at the cost of higher resource consumption under load ([#387](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/387)), resulting in `v0.8 beta 2`. Finally, we currently do multiple `fsync` calls each time we write a block. We know that this is expensive and did a test build without any `fsync` call ([see the commit](https://git.deuxfleurs.fr/Deuxfleurs/garage/commit/432131f5b8c2aad113df3b295072a00756da47e7)) that will not be merged, just to assess the impact of `fsync`. We refer to it as `no-fsync` in the following plot. To assess performance improvements, we used the benchmark tool [minio/warp](https://github.com/minio/warp) in a non-standard configuration, adapted for small scale tests, and we kept only the aggregated result named "cluster total". The goal of this experiment is to get an idea of the cluster performance with a standardized and mixed workload. @@ -52,7 +52,7 @@ To assess performance improvements, we used the benchmark tool [minio/warp](http Minio, our ground truth, features the best performances in this test. Considering Garage, we observe that each improvement we made has a visible impact on its performances. We also note that we have a progress margin in term of performances compared to Minio: additional benchmarks, test and monitoring could help better understand the remaining difference. -## Million of objects +## A myriad of objects **Testing metadata engines** - With Garage, we chose to not store metadata directly on the filesystem, like Minio for example, but in an on-disk fancy B-Tree structure, in other words, in an embedded database engine. Until now, the only available option was [sled](https://sled.rs/), but we started having serious issues with it, and we were not alone ([#284](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/284)). With Garage v0.8, we introduce an abstraction semantic over the features we expect from our database, allowing us to switch from one backend to another without touching the rest of our codebase. We added two additional backends: lmdb and sqlite. Keep in mind that they are both experimental: contrarily to sled, we have never run them in production. @@ -66,7 +66,7 @@ To ![](1million.png) -## Topology versatility +## An unpredictable world - low bandwidth -- cgit v1.2.3 From 6d646201d1212fdb1cd034a63e7fe471e477b531 Mon Sep 17 00:00:00 2001 From: Quentin Dufour Date: Tue, 27 Sep 2022 12:09:56 +0200 Subject: Part about metadata engine --- content/blog/2022-perf/db_engine.png | Bin 170092 -> 181046 bytes content/blog/2022-perf/index.md | 21 +++++++++++++++++++-- 2 files changed, 19 insertions(+), 2 deletions(-) (limited to 'content/blog') diff --git a/content/blog/2022-perf/db_engine.png b/content/blog/2022-perf/db_engine.png index 0f22d6d..b1124b0 100644 Binary files a/content/blog/2022-perf/db_engine.png and b/content/blog/2022-perf/db_engine.png differ diff --git a/content/blog/2022-perf/index.md b/content/blog/2022-perf/index.md index 599afda..44df617 100644 --- a/content/blog/2022-perf/index.md +++ b/content/blog/2022-perf/index.md @@ -45,6 +45,8 @@ As planned, Garage v0.7 that does not support block streaming features TTFB betw **Read/write throughput** - As soon as we publicly released Garage, people started benchmarking it, comparing its performances to writing directly on the filesystem, and observed that Garage was slower (eg. [#288](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/288)). To improve the situation, we put costly processing like hashing on a dedicated thread and did many compute optimization ([#342](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/342), [#343](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/343)) which lead us to `v0.8 beta 1`. We also noted that logic we had to better control the resource usage and detect errors (semaphore, timeouts) were artificially limiting performances: we made them less restrictive at the cost of higher resource consumption under load ([#387](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/387)), resulting in `v0.8 beta 2`. Finally, we currently do multiple `fsync` calls each time we write a block. We know that this is expensive and did a test build without any `fsync` call ([see the commit](https://git.deuxfleurs.fr/Deuxfleurs/garage/commit/432131f5b8c2aad113df3b295072a00756da47e7)) that will not be merged, just to assess the impact of `fsync`. We refer to it as `no-fsync` in the following plot. +*A note about fsync: for performance reasons, OS often do not write directly to the disk when you create or update a file in your filesystem: your write will be kept in memory, and flush later in a batch with other writes. If a power loss occures before the OS has time to flush the writes on the disk, they will be lost. To ensure that a write is effectively written on disk, you must use the [fsync(2)](https://man7.org/linux/man-pages/man2/fsync.2.html) system call: it will block until your file or directory has been written from your volatile RAM memory to your persisting storage device. Additionaly, the exact semantic of fsync [differs from one OS to another](https://mjtsai.com/blog/2022/02/17/apple-ssd-benchmarks-and-f_fullsync/) and, even on battle-tested software like Postgres, [they "did it wrong for 20 years"](https://archive.fosdem.org/2019/schedule/event/postgresql_fsync/). Note that on Garage, we are currently working on our "fsync" policy and thus, you should expect limited data durability as we are aware of some inconsistency on this point (which we describe in the following).* + To assess performance improvements, we used the benchmark tool [minio/warp](https://github.com/minio/warp) in a non-standard configuration, adapted for small scale tests, and we kept only the aggregated result named "cluster total". The goal of this experiment is to get an idea of the cluster performance with a standardized and mixed workload. ![Plot showing IO perf of Garage configs and Minio](io.png) @@ -54,12 +56,27 @@ Minio, our ground truth, features the best performances in this test. Considerin ## A myriad of objects -**Testing metadata engines** - With Garage, we chose to not store metadata directly on the filesystem, like Minio for example, but in an on-disk fancy B-Tree structure, in other words, in an embedded database engine. Until now, the only available option was [sled](https://sled.rs/), but we started having serious issues with it, and we were not alone ([#284](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/284)). With Garage v0.8, we introduce an abstraction semantic over the features we expect from our database, allowing us to switch from one backend to another without touching the rest of our codebase. We added two additional backends: lmdb and sqlite. Keep in mind that they are both experimental: contrarily to sled, we have never run them in production. +**Testing metadata engines** - With Garage, we chose to not store metadata directly on the filesystem, like Minio for example, but in an on-disk fancy B-Tree structure, in other words, in an embedded database engine. Until now, the only available option was [sled](https://sled.rs/), but we started having serious issues with it, and we were not alone ([#284](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/284)). With Garage v0.8, we introduce an abstraction semantic over the features we expect from our database, allowing us to switch from one backend to another without touching the rest of our codebase. We added two additional backends: lmdb ([heed](https://github.com/meilisearch/heed)) and sqlite ([rusqlite](https://github.com/rusqlite/rusqlite)). **Keep in mind that they are both experimental: contrarily to sled, we have never run them in production for a long time.** + +Similarly to the impact of fsync on block writing, each database engine we use has its own policy with fsync. Sled flushes its write every 2 seconds by default, this is [configurable](https://garagehq.deuxfleurs.fr/documentation/reference-manual/configuration/#sled-flush-every-ms)). lmdb by default does an `fsync` on each write, on early tests it lead to very slow resynchronizations between nodes. We added 2 flags: [MDB\_NOSYNC](http://www.lmdb.tech/doc/group__mdb__env.html#ga5791dd1adb09123f82dd1f331209e12e) and [MDB\_NOMETASYNC](http://www.lmdb.tech/doc/group__mdb__env.html#ga5021c4e96ffe9f383f5b8ab2af8e4b16) that basically deactivate fsync. On sqlite, it is also possible to deactivate fsync with `pragma synchronous = off;`, but we did not start any optimization work on it: our sqlite implementation fsync all the data on the disk. Additionaly, we are using these engine through a Rust binding that had to do some tradeoff on the concurrency part. **Our comparison will not reflect the raw performances of these database engine, but instead, our integration choices.** + +Still, we think it makes sense to evaluate our implementations in their current state in Garage. We designed a benchmark that is intensive on the metadata part of the software, ie. handling tiny files. We chose again minio/warp but we configure it with the smallest possible object size supported by warp, 256 bytes, to put some pressure on the metadata engine. We evaluate sled twice: with its default configuration, and with a configuration where we set a flush interval of 10 minutes to disable fsync. -To +*Note that S3 has not been designed for such small objects; a regular database, like Cassandra, would be more appropriate for such workloads. This test has only be designed to stress our metadata engine, it is not indicative of real world performances.* ![Plot of our metadata engines comparison with Warp](db_engine.png) +Unsurprinsingly, we observe abysall performances for sqlite, the engine we have the less tested and kept fsync for each write. +lmdb performs twice better than default sled and 60% better than no fsync sled in our benchmark. +Additionaly, and not depicted on these plots, LMDB uses way less disk storage and RAM; we would like to quantify that in the future. +As we are only at the very beginning of our work on metadata engine, it is hard to draw strong conclusions. +Still, we can say that sqlite is not ready for production workloads, +LMDB looks very promising both in term of performances and resource usage, +it is a very good candidate for Garage's default metadata engine in the future, +and we need to define a data policy for Garage that would help us arbitrate between performances and durability. + +*To fsync or not to fsync? Performance is nothing without reliability, so we need to better assess the impact of validating a write and then losing it. Because Garage is a distributed system, even if a node loses its write due to a power loss, it will fetch it back from the 2 other nodes storing it. But rare situations where 1 node is down and the 2 others validated the write and then lost power can occure, what is our policy in this case? For storage durability, we are already supposing that we never loose the storage of more than 2 nodes, should we also expect that we don't loose more than 2 nodes at the same time? What to think about people hosting all their nodes at the same place without an UPS?* + **Storing million of objects** - ![](1million-both.png) -- cgit v1.2.3 From af589aacd612745cde21ee040db44c7117c988c9 Mon Sep 17 00:00:00 2001 From: Quentin Dufour Date: Tue, 27 Sep 2022 15:35:35 +0200 Subject: Myriad of object part --- content/blog/2022-perf/index.md | 52 ++++++++++++++++++++++++++++++++++------- 1 file changed, 44 insertions(+), 8 deletions(-) (limited to 'content/blog') diff --git a/content/blog/2022-perf/index.md b/content/blog/2022-perf/index.md index 44df617..5238c1a 100644 --- a/content/blog/2022-perf/index.md +++ b/content/blog/2022-perf/index.md @@ -11,7 +11,7 @@ date=2022-09-26 --- ## ⚠️ Disclaimer -The following results must be taken with a critical grain of salt due to some limitations that are inherent to any benchmark. We try to reference them as exhaustively as possible in this section, but others limitation might exist. +The following results must be taken with a critical grain of salt due to some limitations that are inherent to any benchmark. We try to reference them as exhaustively as possible in this section, but other limitations might exist. Most of our tests are done on simulated networks that can not represent all the diversity of real networks (dynamic drop, jitter, latency, all of them could possibly be correlated with throughput or any other external event). We also limited ourselves to very small workloads that are not representative of a production cluster. Furthermore, we only benchmarked some very specific aspects of Garage, on which we are currently working on: our results are thus not an overview of the whole software performances. @@ -31,7 +31,11 @@ To reproduce some environments locally, we have a small set of Python scripts na ## Efficient I/O -**Time To First Byte** - One specificity of Garage is that we implemented S3 web endpoints, with the idea to make it the platform of choice to publish your static website. When publishing a website, one metric you observe is Time To First Byte (TTFB), as it will impact the perceived reactivity of your wbesite. On Garage, time to first byte was a bit high, especially for objects of 1MB and more. This is not surprising as, until now, the smallest level of granularity was the block level, which are set to at most 1MB by default. Hence, when you were sending a GET request, the block had to be fully retrieved by the gateway node from the storage node before starting sending any data to the client. With Garage v0.8, we integrated a block streaming logic which allows the gateway to send the beginning of a block without having to wait for the full block from the storage node. We can visually represent the difference as follow: +**Time To First Byte** - One specificity of Garage is that we implemented S3 web endpoints, with the idea to make it the platform of choice to publish your static website. When publishing a website, one metric you observe is Time To First Byte (TTFB), as it will impact the perceived reactivity of your website. On Garage, time to first byte was a bit high. + +This is not surprising as, until now, the smallest level of granularity was blocks. Blocks are 1MB chunks (this is [configurable](https://garagehq.deuxfleurs.fr/documentation/reference-manual/configuration/#block-size)) of a given object. For example, a 4.5MB object will be split in 4 blocks of 1MB and 1 block of 0.5MB. With this design, when you were sending a GET request, the first block had to be fully retrieved by the gateway node from the storage node before starting sending any data to the client. + +With Garage v0.8, we integrated a block streaming logic which allows the gateway to send the beginning of a block without having to wait for the full block from the storage node. We can visually represent the difference as follow: ![A schema depicting how streaming improves the delivery of a block](schema-streaming.png) @@ -43,9 +47,9 @@ We wanted to see if this theory helds in practise: we simulated a low latency bu As planned, Garage v0.7 that does not support block streaming features TTFB between 1.6s and 2s, which correspond to the computed time to transfer the full block. On the other side of the plot, we see Garage v0.8 with very low TTFB thanks to our streaming approach (the lowest value is 43 ms). Minio sits between our 2 implementations: we suppose that it does some form of batching, but on less than 1MB. -**Read/write throughput** - As soon as we publicly released Garage, people started benchmarking it, comparing its performances to writing directly on the filesystem, and observed that Garage was slower (eg. [#288](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/288)). To improve the situation, we put costly processing like hashing on a dedicated thread and did many compute optimization ([#342](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/342), [#343](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/343)) which lead us to `v0.8 beta 1`. We also noted that logic we had to better control the resource usage and detect errors (semaphore, timeouts) were artificially limiting performances: we made them less restrictive at the cost of higher resource consumption under load ([#387](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/387)), resulting in `v0.8 beta 2`. Finally, we currently do multiple `fsync` calls each time we write a block. We know that this is expensive and did a test build without any `fsync` call ([see the commit](https://git.deuxfleurs.fr/Deuxfleurs/garage/commit/432131f5b8c2aad113df3b295072a00756da47e7)) that will not be merged, just to assess the impact of `fsync`. We refer to it as `no-fsync` in the following plot. +**Read/write throughput** - As soon as we publicly released Garage, people started benchmarking it, comparing its performances to writing directly on the filesystem, and observed that Garage was slower (eg. [#288](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/288)). To improve the situation, we put costly processing like hashing on a dedicated thread and did many compute optimization ([#342](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/342), [#343](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/343)) which lead us to `v0.8 beta 1`. We also noted logic we wrote (to better control resources usage and detect errors, like semaphore or timeouts) were artificially limiting performances. In another iteration, we made this logic less restrictive at the cost of higher resource consumption under load ([#387](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/387)), resulting in `v0.8 beta 2`. Finally, we currently do multiple `fsync` calls each time we write a block. We know that this is expensive and did a test build without any `fsync` call ([see the commit](https://git.deuxfleurs.fr/Deuxfleurs/garage/commit/432131f5b8c2aad113df3b295072a00756da47e7)) that will not be merged, just to assess the impact of `fsync`. We refer to it as `no-fsync` in the following plot. -*A note about fsync: for performance reasons, OS often do not write directly to the disk when you create or update a file in your filesystem: your write will be kept in memory, and flush later in a batch with other writes. If a power loss occures before the OS has time to flush the writes on the disk, they will be lost. To ensure that a write is effectively written on disk, you must use the [fsync(2)](https://man7.org/linux/man-pages/man2/fsync.2.html) system call: it will block until your file or directory has been written from your volatile RAM memory to your persisting storage device. Additionaly, the exact semantic of fsync [differs from one OS to another](https://mjtsai.com/blog/2022/02/17/apple-ssd-benchmarks-and-f_fullsync/) and, even on battle-tested software like Postgres, [they "did it wrong for 20 years"](https://archive.fosdem.org/2019/schedule/event/postgresql_fsync/). Note that on Garage, we are currently working on our "fsync" policy and thus, you should expect limited data durability as we are aware of some inconsistency on this point (which we describe in the following).* +*A note about fsync: for performance reasons, operating systems often do not write directly to the disk when a process creates or updates a file in your filesystem, instead, the write is kept in memory, and flushed later in a batch with other writes. If a power loss occures before the OS has time to flush the writes on the disk, data will be lost. To ensure that a write is effectively written on disk, you must use the [fsync(2)](https://man7.org/linux/man-pages/man2/fsync.2.html) system call: it will block until your file or directory has been written from your volatile RAM memory to your persisting storage device. Additionaly, the exact semantic of fsync [differs from one OS to another](https://mjtsai.com/blog/2022/02/17/apple-ssd-benchmarks-and-f_fullsync/) and, even on battle-tested software like Postgres, [they "did it wrong for 20 years"](https://archive.fosdem.org/2019/schedule/event/postgresql_fsync/). Note that on Garage, we are currently working on our "fsync" policy and thus, for now, you should expect limited data durability in case of power loss, as we are aware of some inconsistency on this point (which we describe in the following and plan to solve).* To assess performance improvements, we used the benchmark tool [minio/warp](https://github.com/minio/warp) in a non-standard configuration, adapted for small scale tests, and we kept only the aggregated result named "cluster total". The goal of this experiment is to get an idea of the cluster performance with a standardized and mixed workload. @@ -77,13 +81,45 @@ and we need to define a data policy for Garage that would help us arbitrate betw *To fsync or not to fsync? Performance is nothing without reliability, so we need to better assess the impact of validating a write and then losing it. Because Garage is a distributed system, even if a node loses its write due to a power loss, it will fetch it back from the 2 other nodes storing it. But rare situations where 1 node is down and the 2 others validated the write and then lost power can occure, what is our policy in this case? For storage durability, we are already supposing that we never loose the storage of more than 2 nodes, should we also expect that we don't loose more than 2 nodes at the same time? What to think about people hosting all their nodes at the same place without an UPS?* -**Storing million of objects** - +**Storing million of objects** - Object storage systems are designed not only for data durability and availability, but also for scalability. +Following this observation, some people asked us how scalable Garage is. If answering this question is out of scope of this study, we wanted to +be sure that our metadata engine would be able to scale to million of objects. To put this target in context, it remains small compared to other industrial solutions: +Ceph claims to scale up to [10 billion objects](https://www.redhat.com/en/resources/data-solutions-overview), which is 4 order of magnitude more than our current target. Of course, their benchmarking setup has nothing in common with ours, and their tests are way more exhaustive. + +We wrote our own [benchmarking tool](https://git.deuxfleurs.fr/Deuxfleurs/mknet/src/branch/main/benchmarks/s3billion) for this test. It concurrently sends a defined number of very tiny object (8192 objects of 16 bytes by default) and measures the time it took. It repeats this step a given number of time (128 by default) to effectively create a certain number of objects on the target cluster (1M by default). +On our local setup with 3 nodes, both Minio and Garage with LMDB were able to achieve this target. On the following plot, we show how many times it took to Garage and Minio to handle each batch. + +Before looking at the plot, **you must keep in mind some important points about Minio and Garage internals**. + +Minio has no metadata engine, it stores its objects directly on the filesystem. Sending 1 million objects on Minio results in creating one million inodes on the storage node in our current setup. So the performance of your filesystem will probably impact a lot the results you will observe; we know the filesystem we used is not adapted at all for Minio (encryption layer, fixed number of inodes, etc.). Additionaly, we mentioned earlier that we deactivated fsync for our metadata engine, minio might have some fsync logic here slowing down the creation of objects. Finally, object storage is designed for big objects: this cost is negligible with bigger objects. In the end, again, we use Minio as a reference to understand what are our performance budget for each part of our software. + +Conversely, Garage has a metadata engine with a special optimization for small objects. Below 3KB, a block is not created on the filesystem but the object is directly stored inline in the metadata engine. +In the future, we plan to evaluate how Garage behaves with 3KB+ objects at scale, probably way closer to Minio, as it will have to create an inode for each object. +For now, we limit ourselves at evaluating our metadata engine, and thus focus only on 16-byte objects. + +![Showing the time to send 128 batches of 8192 objects for Minio and Garage](1million-both.png) + +It appears that performances of our metadata engine are acceptables, as we have a comfortable margin compared to Minio (Minio is between 3x and 4x times slower per batch). +We also note that, past 200k objects, Minio batch completion time is constant as Garage's one remains linear: it could be interesting to know if Garage batch's completion time +would cross Minio's one for a very large number of objects. +If we reason per object, both Minio and Garage performances remains very good: it takes respectively around 20ms and 5ms to create an object. +At 100 Mbps, if you upload a 10MB file, the upload will take 800ms, for a 100MB files, it goes up to 8sec; in both cases handling the object metadata is only a fraction of the upload time. +The only cases where you could notice it would be if you upload lot of very small files at once, which again, is an unsual usage of the S3 API. + +Next, we focus on Garage's data only to better see its specific behavior: + +![Showing the time to send 128 batches of 8192 objects for Garage only](1million.png) -![](1million-both.png) +Two effects are now more visibles: 1. batch completion time is linear with the number of objects in the bucket and 2. measurements are dispersed, at least more than Minio. +We discussed the first point previously but not the second one on measurement dispersion. +This instability could be an issue as it could be a symptom of what we saw with some other experiments in this machine: sometime it freezes under heavy I/O operations. +Such freezes could lead to request timeouts and failures. If it occures on our testing computer, it will occure on other servers too: it could be interesting to better understand this issue, +document how to avoid it or change how we handle our I/O. +At the same time, this was a very stressful test that will probably not be encountered in many setups: we were adding 273 object per seconds for 30 minutes! -![](1million.png) +As a conclusion, Garage can ingest 1 million tiny objects in 30 minutes in a very restricted environment. As a comparison, our production cluster at [deuxfleurs.fr](https://deuxfleurs) manages a bucket with 116k objects. This bucket contains real data as it is used by our Matrix instance to store people's media files (profile picture, shared pictures, videos, audios, documents...). Thanks to this benchmark, we have identified two points of vigilance: putting object duration seems linear with the number of existing objects in the cluster, and we have some volatility in our measured data that could be a symptom of our system freezing under the load. Despite these 2 points, we are confident that Garage could scale way above 1M+ objects, but it remains to be proved! -## An unpredictable world +## In an unpredictable world, stay resilient - low bandwidth -- cgit v1.2.3 From d99406fe633cdb753c25c1eb93c6ae5237923154 Mon Sep 17 00:00:00 2001 From: Quentin Dufour Date: Tue, 27 Sep 2022 16:20:00 +0200 Subject: Add latency amplification part --- content/blog/2022-perf/amplification.png | Bin 145685 -> 147625 bytes content/blog/2022-perf/index.md | 22 ++++++++++++++++------ 2 files changed, 16 insertions(+), 6 deletions(-) (limited to 'content/blog') diff --git a/content/blog/2022-perf/amplification.png b/content/blog/2022-perf/amplification.png index 8962b65..92eac3f 100644 Binary files a/content/blog/2022-perf/amplification.png and b/content/blog/2022-perf/amplification.png differ diff --git a/content/blog/2022-perf/index.md b/content/blog/2022-perf/index.md index 5238c1a..eb54b3a 100644 --- a/content/blog/2022-perf/index.md +++ b/content/blog/2022-perf/index.md @@ -117,19 +117,29 @@ Such freezes could lead to request timeouts and failures. If it occures on our t document how to avoid it or change how we handle our I/O. At the same time, this was a very stressful test that will probably not be encountered in many setups: we were adding 273 object per seconds for 30 minutes! -As a conclusion, Garage can ingest 1 million tiny objects in 30 minutes in a very restricted environment. As a comparison, our production cluster at [deuxfleurs.fr](https://deuxfleurs) manages a bucket with 116k objects. This bucket contains real data as it is used by our Matrix instance to store people's media files (profile picture, shared pictures, videos, audios, documents...). Thanks to this benchmark, we have identified two points of vigilance: putting object duration seems linear with the number of existing objects in the cluster, and we have some volatility in our measured data that could be a symptom of our system freezing under the load. Despite these 2 points, we are confident that Garage could scale way above 1M+ objects, but it remains to be proved! +As a conclusion, Garage can ingest 1 million tiny objects in 30 minutes in a very restricted environment. As a comparison, our production cluster at [deuxfleurs.fr](https://deuxfleurs) manages a bucket with 116k objects. This bucket contains real data as it is used by our Matrix instance to store people's media files (profile picture, shared pictures, videos, audios, documents...). Thanks to this benchmark, we have identified two points of vigilance: putting object duration seems linear with the number of existing objects in the cluster, and we have some volatility in our measured data that could be a symptom of our system freezing under the load. Despite these two points, we are confident that Garage could scale way above 1M+ objects, but it remains to be proved! ## In an unpredictable world, stay resilient -- low bandwidth +**Latency amplification** - We designed Garage with low-tech geo-distributed setups in mind. For example, our production cluster is hosted [on old Lenovo Thinkcentre Tiny Desktop computers](https://guide.deuxfleurs.fr/img/serv_neptune.jpg) behind consumer-grade fiber links across France and Belgium. With these kind of networks, the observed latency is in the 50ms range between nodes. -![]() +When latency is not negligible, you will observe that your requests completion time is a factor of your observed latency. That's expected: in many cases, the node of the cluster you are contacting can not directly answer your request, it needs to reach other nodes of the cluster to get your information. Each sequential request it does add to the final request duration, which can quickly become expensive. +This ratio between request duration and network latency is what we refer as *latency amplification*. -- high network latency. phenomenon we name amplification +For example, on Garage, a GetObject request does two sequential calls: first, it asks for the descriptor of the requested object containing the block list of the requested object, then it retrieves its blocks. +We can expect that the request duration of a small GetObject request will be close to twice the network latency. -![](amplification.png) +On the following graph, we test experimentally some standard endpoints, including GetObject: -- complexity (constant time) +![Latency amplification](amplification.png) + +As Garage has been optimized for this use case from the beginning, we don't see any significant evolution from one version to another (garage v0.7.3 and garage v0.8.0 beta here). +Compared to Minio, these values are either similar (for ListObjects and ListBuckets) or way better (for GetObject, PutObject, and RemoveObject). +It is understandable: Minio has not been designed for environment with high latencies, you are expected to build your clusters in the same datacenter, and then possibly connect them with their asynchronous [Bucket Replication](https://min.io/docs/minio/linux/administration/bucket-replication.html?ref=docs-redirect) feature. + +*Minio also has a [Multi-Site Active-Active Replication System](https://blog.min.io/minio-multi-site-active-active-replication/) but it is even more sensitive to latency: "Multi-site replication has increased latency sensitivity, as MinIO does not consider an object as replicated until it has synchronized to all configured remote targets. Replication latency is therefore dictated by the slowest link in the replication mesh."* + +**Node count amplification** - ![](complexity.png) -- cgit v1.2.3 From 26d396ba851d53eb5a3280644357ed696411f662 Mon Sep 17 00:00:00 2001 From: Quentin Dufour Date: Tue, 27 Sep 2022 17:42:08 +0200 Subject: Part unpredictable --- content/blog/2022-perf/index.md | 32 ++++++++++++++++++++++++-------- 1 file changed, 24 insertions(+), 8 deletions(-) (limited to 'content/blog') diff --git a/content/blog/2022-perf/index.md b/content/blog/2022-perf/index.md index eb54b3a..025bb3a 100644 --- a/content/blog/2022-perf/index.md +++ b/content/blog/2022-perf/index.md @@ -27,7 +27,7 @@ Read [benchmarking crimes](https://gernot-heiser.org/benchmarking-crimes.html), We started a batch of tests on [Grid5000](https://www.grid5000.fr/w/Grid5000:Home), a large-scale and flexible testbed for experiment-driven research in all areas of computer science, under the [Open Access](https://www.grid5000.fr/w/Grid5000:Open-Access) program. During our tests, we used part of the following clusters: [nova](https://www.grid5000.fr/w/Lyon:Hardware#nova), [paravance](https://www.grid5000.fr/w/Rennes:Hardware#paravance), and [econome](https://www.grid5000.fr/w/Nantes:Hardware#econome) to make a geo-distributed topology. We used the Grid5000 testbed only during our preliminary tests to identify issues when running Garage on many powerful servers, issues that we then reproduced in a controlled environment; don't be surprised then if Grid5000 is not mentioned often on our plots. -To reproduce some environments locally, we have a small set of Python scripts named [mknet](https://git.deuxfleurs.fr/Deuxfleurs/mknet) tailored to our needs[^1]. Most of the following tests where thus run locally with mknet on a single computer: a Dell Inspiron 27" 7775 AIO, with a Ryzen 5 1400, 16GB of RAM, a 512GB SSD. In term of software, NixOS 22.05 with the 5.15.50 kernel is used with an ext4 encrypted filesystem. The `vm.dirty_background_ratio` and `vm.dirty_ratio` have been reduced to `2` and `1` respectively as, with default values, the system tends to freeze when it is under heavy I/O load. +To reproduce some environments locally, we have a small set of Python scripts named [mknet](https://git.deuxfleurs.fr/Deuxfleurs/mknet) tailored to our needs[^ref1]. Most of the following tests where thus run locally with mknet on a single computer: a Dell Inspiron 27" 7775 AIO, with a Ryzen 5 1400, 16GB of RAM, a 512GB SSD. In term of software, NixOS 22.05 with the 5.15.50 kernel is used with an ext4 encrypted filesystem. The `vm.dirty_background_ratio` and `vm.dirty_ratio` have been reduced to `2` and `1` respectively as, with default values, the system tends to freeze when it is under heavy I/O load. ## Efficient I/O @@ -41,11 +41,11 @@ With Garage v0.8, we integrated a block streaming logic which allows the gateway As our default block size is only 1MB, the difference will be very small on fast networks: it takes only 8ms to transfer 1MB on a 1Gbps network. However, on a very slow network (or a very congested link with many parallel requests handled), the impact can be much more important: at 5Mbps, it takes 1.6 second to transfer our 1MB block, and streaming could heavily improve user experience. -We wanted to see if this theory helds in practise: we simulated a low latency but slow network on mknet and did some request with (garage v0.8 beta) and without (garage v0.7) block streaming. We also added Minio as a reference. +We wanted to see if this theory helds in practise: we simulated a low latency but slow network on mknet and did some request with (garage v0.8 beta) and without (garage v0.7) block streaming. We also added Minio as a reference. To benchmark this behavior, we wrote a small test named [s3ttfb](https://git.deuxfleurs.fr/Deuxfleurs/mknet/src/branch/main/benchmarks/s3ttfb), its results are depicted on the following figure. ![Plot showing the TTFB observed on Garage v0.8, v0.7 and Minio](ttfb.png) -As planned, Garage v0.7 that does not support block streaming features TTFB between 1.6s and 2s, which correspond to the computed time to transfer the full block. On the other side of the plot, we see Garage v0.8 with very low TTFB thanks to our streaming approach (the lowest value is 43 ms). Minio sits between our 2 implementations: we suppose that it does some form of batching, but on less than 1MB. +Garage v0.7, that does not support block streaming, features TTFB between 1.6s and 2s, which correspond to the theoretical time to transfer the full block. On the other side of the plot, Garage v0.8 has very low TTFB thanks to the streaming feature (the lowest value is 43 ms). Minio sits between the two Garage versions: we suppose that it does some form of batching, but smaller than 1MB. **Read/write throughput** - As soon as we publicly released Garage, people started benchmarking it, comparing its performances to writing directly on the filesystem, and observed that Garage was slower (eg. [#288](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/288)). To improve the situation, we put costly processing like hashing on a dedicated thread and did many compute optimization ([#342](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/342), [#343](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/343)) which lead us to `v0.8 beta 1`. We also noted logic we wrote (to better control resources usage and detect errors, like semaphore or timeouts) were artificially limiting performances. In another iteration, we made this logic less restrictive at the cost of higher resource consumption under load ([#387](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/387)), resulting in `v0.8 beta 2`. Finally, we currently do multiple `fsync` calls each time we write a block. We know that this is expensive and did a test build without any `fsync` call ([see the commit](https://git.deuxfleurs.fr/Deuxfleurs/garage/commit/432131f5b8c2aad113df3b295072a00756da47e7)) that will not be merged, just to assess the impact of `fsync`. We refer to it as `no-fsync` in the following plot. @@ -86,7 +86,7 @@ Following this observation, some people asked us how scalable Garage is. If answ be sure that our metadata engine would be able to scale to million of objects. To put this target in context, it remains small compared to other industrial solutions: Ceph claims to scale up to [10 billion objects](https://www.redhat.com/en/resources/data-solutions-overview), which is 4 order of magnitude more than our current target. Of course, their benchmarking setup has nothing in common with ours, and their tests are way more exhaustive. -We wrote our own [benchmarking tool](https://git.deuxfleurs.fr/Deuxfleurs/mknet/src/branch/main/benchmarks/s3billion) for this test. It concurrently sends a defined number of very tiny object (8192 objects of 16 bytes by default) and measures the time it took. It repeats this step a given number of time (128 by default) to effectively create a certain number of objects on the target cluster (1M by default). +We wrote our own benchmarking tool [s3billion](https://git.deuxfleurs.fr/Deuxfleurs/mknet/src/branch/main/benchmarks/s3billion)[^ref2] for this test. It concurrently sends a defined number of very tiny object (8192 objects of 16 bytes by default) and measures the time it took. It repeats this step a given number of time (128 by default) to effectively create a certain number of objects on the target cluster (1M by default). On our local setup with 3 nodes, both Minio and Garage with LMDB were able to achieve this target. On the following plot, we show how many times it took to Garage and Minio to handle each batch. Before looking at the plot, **you must keep in mind some important points about Minio and Garage internals**. @@ -129,7 +129,8 @@ This ratio between request duration and network latency is what we refer as *lat For example, on Garage, a GetObject request does two sequential calls: first, it asks for the descriptor of the requested object containing the block list of the requested object, then it retrieves its blocks. We can expect that the request duration of a small GetObject request will be close to twice the network latency. -On the following graph, we test experimentally some standard endpoints, including GetObject: +We tested this theory with another benchmark of our own named [s3lat](https://git.deuxfleurs.fr/Deuxfleurs/mknet/src/branch/main/benchmarks/s3lat) that does a single request at a time on an endpoint and measure its response time. As we are not interested in bandwidth but latency, all our requests involving an object are made on a tiny file of around 16 bytes. Our benchmark tests 5 standard endpoints: ListBuckets, ListObjects, PutObject, GetObject and RemoveObject. Its results are plotted here: + ![Latency amplification](amplification.png) @@ -139,9 +140,16 @@ It is understandable: Minio has not been designed for environment with high late *Minio also has a [Multi-Site Active-Active Replication System](https://blog.min.io/minio-multi-site-active-active-replication/) but it is even more sensitive to latency: "Multi-site replication has increased latency sensitivity, as MinIO does not consider an object as replicated until it has synchronized to all configured remote targets. Replication latency is therefore dictated by the slowest link in the replication mesh."* -**Node count amplification** - +**A cluster with many nodes** - Whether you already have many compute nodes with unused storage, need to store lot of data, or experiment with unusual system architecture, you might want to deploy hundredth of Garage nodes. However, on some distributed systems, the number of nodes in the cluster will impact performances. Theoretically, our protocol inspired by distributed hashtables (DHT) should scale fairly well but we never took the time to test it with hundredth of nodes before. + +This time, we did our test directly on Grid5000 with 6 physical servers spread in 3 locations in France: Lyon, Rennes, and Nantes. On each server, we ran up to 65 instances of Garage simultaneously (for a total of 390 nodes). The network between the physical server is the dedicated network provided by Grid5000 operators. Nodes on the same physical machine communicate directly through the Linux network stack without any limitation: we are aware this is a weakness of this test. We still think that this test can be relevant as, at each step in the test, each instance of Garage has 83% (5/6) of its connections that are made over a real network. To benchmark each cluster size, we used [s3lat](https://git.deuxfleurs.fr/Deuxfleurs/mknet/src/branch/main/benchmarks/s3lat) again: + + +![Impact of response time with bigger clusters](complexity.png) -![](complexity.png) +Up to 250 nodes observed response times remain constant. After this threshold, results become very noisy. +By looking at the server resource usage, we saw that their load started to become non negligible: it seems that we are not hitting a limit at the protocol side but we have simply exhausted the ressource of our testing nodes. In the future, we would like to run this experiment again, but on way more physical nodes, to confirm our hypothesis. +For now, we are confident that a Garage cluster with 100+ nodes should definitely work. ## Future work @@ -154,4 +162,12 @@ It is understandable: Minio has not been designed for environment with high late - try to better understand ecosystem (riak cs, minio, ceph, swift) -> some knowledge to get -[^1]: Yes, we are aware of [Jepsen](https://github.com/jepsen-io/jepsen) existence. This tool is far more complex than our set of scripts, but we know that it is also way more versatile. +## Notes + +[^ref1]: Yes, we are aware of [Jepsen](https://github.com/jepsen-io/jepsen) existence. This tool is far more complex than our set of scripts, but we know that it is also way more versatile. + +[^ref2]: The program name contains the word "billion" and we only tested Garage up to 1 "million" object, this is not a typo, we were just a little bit too enthusiast when we wrote it. + + -- cgit v1.2.3 From b6d01f81b2b1b0284f261ccfc249ec9d6bc5b35a Mon Sep 17 00:00:00 2001 From: Quentin Dufour Date: Tue, 27 Sep 2022 17:47:08 +0200 Subject: Reword intro, notes for conclusion --- content/blog/2022-perf/index.md | 12 +++++++----- 1 file changed, 7 insertions(+), 5 deletions(-) (limited to 'content/blog') diff --git a/content/blog/2022-perf/index.md b/content/blog/2022-perf/index.md index 025bb3a..d2f1d61 100644 --- a/content/blog/2022-perf/index.md +++ b/content/blog/2022-perf/index.md @@ -4,7 +4,7 @@ date=2022-09-26 +++ -*For the past years, we have extensively analyzed possible design decisions and their theoretical tradeoffs on Garage, being it on the network, data structure, or scheduling side. And it worked well enough for our production cluster at Deuxfleurs, but we also knew that people started discovering some unexpected behaviors. We thus started a round of benchmark and performance measurement to see how Garage behaves compared to our expectations.* +*For the past years, we have extensively analyzed possible design decisions and their theoretical tradeoffs on Garage, being it on the network, data structure, or scheduling side. And it worked well enough for our production cluster at Deuxfleurs, but we also knew that people started discovering some unexpected behaviors. We thus started a round of benchmark and performance measurement to see how Garage behaves compared to our expectations. We split them in 3 categories: "efficient I/O", "myriads of objects" and "resiliency" to reflect the high level properties we are seeking.* @@ -152,16 +152,18 @@ By looking at the server resource usage, we saw that their load started to becom For now, we are confident that a Garage cluster with 100+ nodes should definitely work. -## Future work +## Conclusion and Future work -- srpt +Identified some sensitive points: fsync, metadata engine, raw i/o. +At the same time, validated important performance improvements (ttfb, minio warp, metadata engine - including less resource usage) while keeping our versatility (network/nodes). +- srpt - better analysis of the fsync / data reliability impact - - analysis and comparison of Garage at scale - - try to better understand ecosystem (riak cs, minio, ceph, swift) -> some knowledge to get + + ## Notes [^ref1]: Yes, we are aware of [Jepsen](https://github.com/jepsen-io/jepsen) existence. This tool is far more complex than our set of scripts, but we know that it is also way more versatile. -- cgit v1.2.3 From 026402cae346491c4578e6ee0a1f1073cb3c5934 Mon Sep 17 00:00:00 2001 From: Quentin Dufour Date: Wed, 28 Sep 2022 09:56:48 +0200 Subject: Added the conclusion --- content/blog/2022-perf/index.md | 17 ++++++++++------- 1 file changed, 10 insertions(+), 7 deletions(-) (limited to 'content/blog') diff --git a/content/blog/2022-perf/index.md b/content/blog/2022-perf/index.md index d2f1d61..75aec67 100644 --- a/content/blog/2022-perf/index.md +++ b/content/blog/2022-perf/index.md @@ -79,7 +79,7 @@ LMDB looks very promising both in term of performances and resource usage, it is a very good candidate for Garage's default metadata engine in the future, and we need to define a data policy for Garage that would help us arbitrate between performances and durability. -*To fsync or not to fsync? Performance is nothing without reliability, so we need to better assess the impact of validating a write and then losing it. Because Garage is a distributed system, even if a node loses its write due to a power loss, it will fetch it back from the 2 other nodes storing it. But rare situations where 1 node is down and the 2 others validated the write and then lost power can occure, what is our policy in this case? For storage durability, we are already supposing that we never loose the storage of more than 2 nodes, should we also expect that we don't loose more than 2 nodes at the same time? What to think about people hosting all their nodes at the same place without an UPS?* +*To fsync or not to fsync? Performance is nothing without reliability, so we need to better assess the impact of validating a write and then losing it. Because Garage is a distributed system, even if a node loses its write due to a power loss, it will fetch it back from the 2 other nodes storing it. But rare situations where 1 node is down and the 2 others validated the write and then lost power can occure, what is our policy in this case? For storage durability, we are already supposing that we never loose the storage of more than 2 nodes, should we also expect that we don't loose power on more than 2 nodes at the same time? What should we think about people hosting all their nodes at the same place without an UPS? Historically, it seems that Minio developers also accepted some compromises on this side ([#3536](https://github.com/minio/minio/issues/3536), [HN Discussion](https://news.ycombinator.com/item?id=28135533)). Now, they seem to use a combination of `O_DSYNC` and `fdatasync(3p)` - a derivative that check only data and not metadata are persisted on disk - in combination with `O_DIRECT` for direct I/O ([discussion](https://github.com/minio/minio/discussions/14339#discussioncomment-2200274), [example in minio source](https://github.com/minio/minio/blob/master/cmd/xl-storage.go#L1928-L1932)).* **Storing million of objects** - Object storage systems are designed not only for data durability and availability, but also for scalability. Following this observation, some people asked us how scalable Garage is. If answering this question is out of scope of this study, we wanted to @@ -154,14 +154,17 @@ For now, we are confident that a Garage cluster with 100+ nodes should definitel ## Conclusion and Future work -Identified some sensitive points: fsync, metadata engine, raw i/o. -At the same time, validated important performance improvements (ttfb, minio warp, metadata engine - including less resource usage) while keeping our versatility (network/nodes). +During this work, we identified some sensitive points on Garage we will continue working on: our data durability target and interaction with the filesystem (`O_DSYNC`, `fsync`, `O_DIRECT`, etc.) is not yet homogeneous across our components, our new metadata engines (lmdb, sqlite) still need some testing and tuning, and we know that raw I/O (GetObject, PutObject) have a small improvement margin. -- srpt -- better analysis of the fsync / data reliability impact -- analysis and comparison of Garage at scale -- try to better understand ecosystem (riak cs, minio, ceph, swift) -> some knowledge to get +At the same time, Garage has never been better: its next version (v0.8) will see drastic improvements on term of performances and reliability. +We are confident that it is already be able to cover a wide range of deployment needs, up to hundredth of nodes, millions of objects, and so on. +In the future, on the performance aspect, we would like to evaluate the impact of introducing an SRPT scheduler ([#361](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/361)), +define a data durability policy and implement it, make a deeper and larger review of the state of the art (minio, ceph, swift, openio, riak cs, seaweedfs, etc.) to learn from them, +and finally, benchmark Garage at scale with possibly multiple terabytes of data on a long lasting experiments. + +In the mean time, stay tuned: we have released [a first release candidate for Garage v0.8](https://git.deuxfleurs.fr/Deuxfleurs/garage/releases/tag/v0.8.0-rc1), we are working +on proving and explaining our layout algorithm ([#296](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/296)), we are working on a Python SDK for Garage's administration API ([#379](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/379)), and we will soon introduce officially and explain why we created and published as a technical preview a new API named K2V ([see K2V on our doc](https://garagehq.deuxfleurs.fr/documentation/reference-manual/k2v/)). ## Notes -- cgit v1.2.3 From e0ab07baee2bcd77b0e70b98d7930e92a6ebbf0a Mon Sep 17 00:00:00 2001 From: Quentin Dufour Date: Wed, 28 Sep 2022 11:03:53 +0200 Subject: Grammarly proof check --- content/blog/2022-perf/index.md | 120 +++++++++++++++++++++------------------- 1 file changed, 64 insertions(+), 56 deletions(-) (limited to 'content/blog') diff --git a/content/blog/2022-perf/index.md b/content/blog/2022-perf/index.md index 75aec67..c406890 100644 --- a/content/blog/2022-perf/index.md +++ b/content/blog/2022-perf/index.md @@ -4,7 +4,7 @@ date=2022-09-26 +++ -*For the past years, we have extensively analyzed possible design decisions and their theoretical tradeoffs on Garage, being it on the network, data structure, or scheduling side. And it worked well enough for our production cluster at Deuxfleurs, but we also knew that people started discovering some unexpected behaviors. We thus started a round of benchmark and performance measurement to see how Garage behaves compared to our expectations. We split them in 3 categories: "efficient I/O", "myriads of objects" and "resiliency" to reflect the high level properties we are seeking.* +*For the past years, we have extensively analyzed possible design decisions and their theoretical tradeoffs on Garage, especially on the network, data structure, or scheduling side. And it worked well enough for our production cluster at Deuxfleurs, but we also knew that people started discovering some unexpected behaviors. We thus started a round of benchmark and performance measurements to see how Garage behaves compared to our expectations. We split them into 3 categories: "efficient I/O", "myriads of objects" and "resiliency" to reflect the high-level properties we are seeking.* @@ -13,134 +13,142 @@ date=2022-09-26 ## ⚠️ Disclaimer The following results must be taken with a critical grain of salt due to some limitations that are inherent to any benchmark. We try to reference them as exhaustively as possible in this section, but other limitations might exist. -Most of our tests are done on simulated networks that can not represent all the diversity of real networks (dynamic drop, jitter, latency, all of them could possibly be correlated with throughput or any other external event). We also limited ourselves to very small workloads that are not representative of a production cluster. Furthermore, we only benchmarked some very specific aspects of Garage, on which we are currently working on: our results are thus not an overview of the whole software performances. +Most of our tests are done on simulated networks that can not represent all the diversity of real networks (dynamic drop, jitter, latency, all of them could be correlated with throughput or any other external event). We also limited ourselves to very small workloads that are not representative of a production cluster. Furthermore, we only benchmarked some very specific aspects of Garage: our results are thus not an overview of the whole software performance. For some benchmarks, we used Minio as a reference. It must be noted that we did not try to optimize its configuration as we have done on Garage, and more generally, we have way less knowledge on Minio than on Garage, which can lead to underrated performance measurements for Minio. -It must also be noted that Garage and Minio are systems with different feature sets, *eg.* Minio supports erasure coding for better data density while Garage doesn't, Minio implements way more S3 endpoints than Garage, etc. Such feature have necessarily a cost that you must keep in mind when reading plots. You should consider Minio results as a way to contextualize our results, to check that our improvements are not artificials compared to existing object storage implementations. +It must also be noted that Garage and Minio are systems with different feature sets, *eg.* Minio supports erasure coding for better data density while Garage doesn't, Minio implements way more S3 endpoints than Garage, etc. Such features have necessarily a cost that you must keep in mind when reading plots. You should consider Minio results as a way to contextualize our results, to check that our improvements are not artificials compared to existing object storage implementations. -Impact of the testing environment is also not evaluated (kernel patches, configuration, parameters, filesystem, hardware configuration, etc.), some of these configurations could favor one configuration/software over another. Especially, it must be noted that most of the tests were done on a consumer-grade computer and SSD only, which will be different from most production setups. Finally, our results are also provided without statistical tests to check their significance, and thus might be statistically not significative. +The impact of the testing environment is also not evaluated (kernel patches, configuration, parameters, filesystem, hardware configuration, etc.), some of these configurations could favor one configuration/software over another. Especially, it must be noted that most of the tests were done on a consumer-grade computer and SSD only, which will be different from most production setups. Finally, our results are also provided without statistical tests to check their significance, and thus might be statistically not significant. -When reading this post, please keep in mind that **we are not making any business or technical recommendation here, this is not a scientific paper either**; we only share bits of our development process as honestly as possible. +When reading this post, please keep in mind that **we are not making any business or technical recommendations here, this is not a scientific paper either**; we only share bits of our development process as honestly as possible. Read [benchmarking crimes](https://gernot-heiser.org/benchmarking-crimes.html), make your own tests if you need to take a decision, and remain supportive and caring with your peers... ## About our testing environment We started a batch of tests on [Grid5000](https://www.grid5000.fr/w/Grid5000:Home), a large-scale and flexible testbed for experiment-driven research in all areas of computer science, under the [Open Access](https://www.grid5000.fr/w/Grid5000:Open-Access) program. During our tests, we used part of the following clusters: [nova](https://www.grid5000.fr/w/Lyon:Hardware#nova), [paravance](https://www.grid5000.fr/w/Rennes:Hardware#paravance), and [econome](https://www.grid5000.fr/w/Nantes:Hardware#econome) to make a geo-distributed topology. We used the Grid5000 testbed only during our preliminary tests to identify issues when running Garage on many powerful servers, issues that we then reproduced in a controlled environment; don't be surprised then if Grid5000 is not mentioned often on our plots. -To reproduce some environments locally, we have a small set of Python scripts named [mknet](https://git.deuxfleurs.fr/Deuxfleurs/mknet) tailored to our needs[^ref1]. Most of the following tests where thus run locally with mknet on a single computer: a Dell Inspiron 27" 7775 AIO, with a Ryzen 5 1400, 16GB of RAM, a 512GB SSD. In term of software, NixOS 22.05 with the 5.15.50 kernel is used with an ext4 encrypted filesystem. The `vm.dirty_background_ratio` and `vm.dirty_ratio` have been reduced to `2` and `1` respectively as, with default values, the system tends to freeze when it is under heavy I/O load. +To reproduce some environments locally, we have a small set of Python scripts named [mknet](https://git.deuxfleurs.fr/Deuxfleurs/mknet) tailored to our needs[^ref1]. Most of the following tests were thus run locally with mknet on a single computer: a Dell Inspiron 27" 7775 AIO, with a Ryzen 5 1400, 16GB of RAM, a 512GB SSD. In terms of software, NixOS 22.05 with the 5.15.50 kernel is used with an ext4 encrypted filesystem. The `vm.dirty_background_ratio` and `vm.dirty_ratio` have been reduced to `2` and `1` respectively as, with default values, the system tends to freeze when it is under heavy I/O load. ## Efficient I/O +The main goal of an object storage system is to store or retrieve an object across the network, and the faster, the better. +For this analysis, we focus on 2 aspects: time to first byte, as many applications can start processing a file before receiving it completely, +and generic throughput, to understand how well Garage can leverage the underlying machine performances. + + **Time To First Byte** - One specificity of Garage is that we implemented S3 web endpoints, with the idea to make it the platform of choice to publish your static website. When publishing a website, one metric you observe is Time To First Byte (TTFB), as it will impact the perceived reactivity of your website. On Garage, time to first byte was a bit high. -This is not surprising as, until now, the smallest level of granularity was blocks. Blocks are 1MB chunks (this is [configurable](https://garagehq.deuxfleurs.fr/documentation/reference-manual/configuration/#block-size)) of a given object. For example, a 4.5MB object will be split in 4 blocks of 1MB and 1 block of 0.5MB. With this design, when you were sending a GET request, the first block had to be fully retrieved by the gateway node from the storage node before starting sending any data to the client. +This is not surprising as, until now, the smallest level of granularity internally was handling full blocks. Blocks are 1MB chunks (this is [configurable](https://garagehq.deuxfleurs.fr/documentation/reference-manual/configuration/#block-size)) of a given object. For example, a 4.5MB object will be split into 4 blocks of 1MB and 1 block of 0.5MB. With this design, when you were sending a GET request, the first block had to be fully retrieved by the gateway node from the storage node before starting to send any data to the client. -With Garage v0.8, we integrated a block streaming logic which allows the gateway to send the beginning of a block without having to wait for the full block from the storage node. We can visually represent the difference as follow: +With Garage v0.8, we integrated a block streaming logic that allows the gateway to send the beginning of a block without having to wait for the full block from the storage node. We can visually represent the difference as follow: ![A schema depicting how streaming improves the delivery of a block](schema-streaming.png) -As our default block size is only 1MB, the difference will be very small on fast networks: it takes only 8ms to transfer 1MB on a 1Gbps network. However, on a very slow network (or a very congested link with many parallel requests handled), the impact can be much more important: at 5Mbps, it takes 1.6 second to transfer our 1MB block, and streaming could heavily improve user experience. +As our default block size is only 1MB, the difference will be very small on fast networks: it takes only 8ms to transfer 1MB on a 1Gbps network. However, on a very slow network (or a very congested link with many parallel requests handled), the impact can be much more important: at 5Mbps, it takes 1.6 seconds to transfer our 1MB block, and streaming could heavily improve user experience. -We wanted to see if this theory helds in practise: we simulated a low latency but slow network on mknet and did some request with (garage v0.8 beta) and without (garage v0.7) block streaming. We also added Minio as a reference. To benchmark this behavior, we wrote a small test named [s3ttfb](https://git.deuxfleurs.fr/Deuxfleurs/mknet/src/branch/main/benchmarks/s3ttfb), its results are depicted on the following figure. +We wanted to see if this theory holds in practice: we simulated a low latency but slow network on mknet and did some requests with (garage v0.8 beta) and without (garage v0.7) block streaming. We also added Minio as a reference. To benchmark this behavior, we wrote a small test named [s3ttfb](https://git.deuxfleurs.fr/Deuxfleurs/mknet/src/branch/main/benchmarks/s3ttfb), its results are depicted in the following figure. ![Plot showing the TTFB observed on Garage v0.8, v0.7 and Minio](ttfb.png) -Garage v0.7, that does not support block streaming, features TTFB between 1.6s and 2s, which correspond to the theoretical time to transfer the full block. On the other side of the plot, Garage v0.8 has very low TTFB thanks to the streaming feature (the lowest value is 43 ms). Minio sits between the two Garage versions: we suppose that it does some form of batching, but smaller than 1MB. +Garage v0.7, which does not support block streaming, features TTFB between 1.6s and 2s, which corresponds to the theoretical time to transfer the full block. On the other side of the plot, Garage v0.8 has a very low TTFB thanks to the streaming feature (the lowest value is 43 ms). Minio sits between the two Garage versions: we suppose that it does some form of batching, but smaller than 1MB. -**Read/write throughput** - As soon as we publicly released Garage, people started benchmarking it, comparing its performances to writing directly on the filesystem, and observed that Garage was slower (eg. [#288](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/288)). To improve the situation, we put costly processing like hashing on a dedicated thread and did many compute optimization ([#342](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/342), [#343](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/343)) which lead us to `v0.8 beta 1`. We also noted logic we wrote (to better control resources usage and detect errors, like semaphore or timeouts) were artificially limiting performances. In another iteration, we made this logic less restrictive at the cost of higher resource consumption under load ([#387](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/387)), resulting in `v0.8 beta 2`. Finally, we currently do multiple `fsync` calls each time we write a block. We know that this is expensive and did a test build without any `fsync` call ([see the commit](https://git.deuxfleurs.fr/Deuxfleurs/garage/commit/432131f5b8c2aad113df3b295072a00756da47e7)) that will not be merged, just to assess the impact of `fsync`. We refer to it as `no-fsync` in the following plot. +**Throughput** - As soon as we publicly released Garage, people started benchmarking it, comparing its performances to writing directly on the filesystem, and observed that Garage was slower (eg. [#288](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/288)). To improve the situation, we put costly processing like hashing on a dedicated thread and did many compute optimization ([#342](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/342), [#343](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/343)) which lead us to `v0.8 beta 1`. We also noted logic we wrote (to better control resource usage and detect errors, like semaphores or timeouts) was artificially limiting performances. In another iteration, we made this logic less restrictive at the cost of higher resource consumption under load ([#387](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/387)), resulting in `v0.8 beta 2`. Finally, we currently do multiple `fsync` calls each time we write a block. We know that this is expensive and did a test build without any `fsync` call ([see the commit](https://git.deuxfleurs.fr/Deuxfleurs/garage/commit/432131f5b8c2aad113df3b295072a00756da47e7)) that will not be merged, just to assess the impact of `fsync`. We refer to it as `no-fsync` in the following plot. -*A note about fsync: for performance reasons, operating systems often do not write directly to the disk when a process creates or updates a file in your filesystem, instead, the write is kept in memory, and flushed later in a batch with other writes. If a power loss occures before the OS has time to flush the writes on the disk, data will be lost. To ensure that a write is effectively written on disk, you must use the [fsync(2)](https://man7.org/linux/man-pages/man2/fsync.2.html) system call: it will block until your file or directory has been written from your volatile RAM memory to your persisting storage device. Additionaly, the exact semantic of fsync [differs from one OS to another](https://mjtsai.com/blog/2022/02/17/apple-ssd-benchmarks-and-f_fullsync/) and, even on battle-tested software like Postgres, [they "did it wrong for 20 years"](https://archive.fosdem.org/2019/schedule/event/postgresql_fsync/). Note that on Garage, we are currently working on our "fsync" policy and thus, for now, you should expect limited data durability in case of power loss, as we are aware of some inconsistency on this point (which we describe in the following and plan to solve).* +*A note about fsync: for performance reasons, operating systems often do not write directly to the disk when a process creates or updates a file in your filesystem, instead, the write is kept in memory, and flushed later in a batch with other writes. If a power loss occurs before the OS has time to flush the writes on the disk, data will be lost. To ensure that a write is effectively written on disk, you must use the [fsync(2)](https://man7.org/linux/man-pages/man2/fsync.2.html) system call: it will block until your file or directory has been written from your volatile memory to your persisting storage device. Additionally, the exact semantic of fsync [differs from one OS to another](https://mjtsai.com/blog/2022/02/17/apple-ssd-benchmarks-and-f_fullsync/) and, even on battle-tested software like Postgres, [they "did it wrong for 20 years"](https://archive.fosdem.org/2019/schedule/event/postgresql_fsync/). Note that on Garage, we are currently working on our "fsync" policy and thus, for now, you should expect limited data durability in case of power loss, as we are aware of some inconsistency on this point (which we describe in the following and plan to solve).* -To assess performance improvements, we used the benchmark tool [minio/warp](https://github.com/minio/warp) in a non-standard configuration, adapted for small scale tests, and we kept only the aggregated result named "cluster total". The goal of this experiment is to get an idea of the cluster performance with a standardized and mixed workload. +To assess performance improvements, we used the benchmark tool [minio/warp](https://github.com/minio/warp) in a non-standard configuration, adapted for small-scale tests, and we kept only the aggregated result named "cluster total". The goal of this experiment is to get an idea of the cluster performance with a standardized and mixed workload. ![Plot showing IO perf of Garage configs and Minio](io.png) -Minio, our ground truth, features the best performances in this test. Considering Garage, we observe that each improvement we made has a visible impact on its performances. We also note that we have a progress margin in term of performances compared to Minio: additional benchmarks, test and monitoring could help better understand the remaining difference. +Minio, our ground truth, features the best performances in this test. Considering Garage, we observe that each improvement we made has a visible impact on its performances. We also note that we have a progress margin in terms of performances compared to Minio: additional benchmarks, tests, and monitoring could help better understand the remaining difference. ## A myriad of objects +Object storage systems do not handle a single object but a myriad of them: Amazon claims to handle trillions of objects on their platform, and Red Hat communicates about Ceph being able to handle 10 billion objects. All these objects must be tracked efficiently in the system to be fetched, listed, removed, etc. In Garage, we use a "metadata engine" component to track them. For this analysis, we compare different metadata engines in Garage and see how well the best one scale to a million objects. + **Testing metadata engines** - With Garage, we chose to not store metadata directly on the filesystem, like Minio for example, but in an on-disk fancy B-Tree structure, in other words, in an embedded database engine. Until now, the only available option was [sled](https://sled.rs/), but we started having serious issues with it, and we were not alone ([#284](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/284)). With Garage v0.8, we introduce an abstraction semantic over the features we expect from our database, allowing us to switch from one backend to another without touching the rest of our codebase. We added two additional backends: lmdb ([heed](https://github.com/meilisearch/heed)) and sqlite ([rusqlite](https://github.com/rusqlite/rusqlite)). **Keep in mind that they are both experimental: contrarily to sled, we have never run them in production for a long time.** -Similarly to the impact of fsync on block writing, each database engine we use has its own policy with fsync. Sled flushes its write every 2 seconds by default, this is [configurable](https://garagehq.deuxfleurs.fr/documentation/reference-manual/configuration/#sled-flush-every-ms)). lmdb by default does an `fsync` on each write, on early tests it lead to very slow resynchronizations between nodes. We added 2 flags: [MDB\_NOSYNC](http://www.lmdb.tech/doc/group__mdb__env.html#ga5791dd1adb09123f82dd1f331209e12e) and [MDB\_NOMETASYNC](http://www.lmdb.tech/doc/group__mdb__env.html#ga5021c4e96ffe9f383f5b8ab2af8e4b16) that basically deactivate fsync. On sqlite, it is also possible to deactivate fsync with `pragma synchronous = off;`, but we did not start any optimization work on it: our sqlite implementation fsync all the data on the disk. Additionaly, we are using these engine through a Rust binding that had to do some tradeoff on the concurrency part. **Our comparison will not reflect the raw performances of these database engine, but instead, our integration choices.** +Similarly to the impact of fsync on block writing, each database engine we use has its own policy with fsync. Sled flushes its write every 2 seconds by default, this is [configurable](https://garagehq.deuxfleurs.fr/documentation/reference-manual/configuration/#sled-flush-every-ms)). lmdb by default does an `fsync` on each write, on early tests it led to very slow resynchronizations between nodes. We added 2 flags: [MDB\_NOSYNC](http://www.lmdb.tech/doc/group__mdb__env.html#ga5791dd1adb09123f82dd1f331209e12e) and [MDB\_NOMETASYNC](http://www.lmdb.tech/doc/group__mdb__env.html#ga5021c4e96ffe9f383f5b8ab2af8e4b16) which deactivate fsync. On sqlite, it is also possible to deactivate fsync with `pragma synchronous = off;`, but we did not start any optimization work on it: our sqlite implementation fsync all the data on the disk. Additionally, we are using these engines through a Rust binding that had to do some tradeoff on the concurrency part. **Our comparison will not reflect the raw performances of these database engines, but instead, our integration choices.** Still, we think it makes sense to evaluate our implementations in their current state in Garage. We designed a benchmark that is intensive on the metadata part of the software, ie. handling tiny files. We chose again minio/warp but we configure it with the smallest possible object size supported by warp, 256 bytes, to put some pressure on the metadata engine. We evaluate sled twice: with its default configuration, and with a configuration where we set a flush interval of 10 minutes to disable fsync. -*Note that S3 has not been designed for such small objects; a regular database, like Cassandra, would be more appropriate for such workloads. This test has only be designed to stress our metadata engine, it is not indicative of real world performances.* +*Note that S3 has not been designed for such small objects; a regular database, like Cassandra, would be more appropriate for such workloads. This test has only been designed to stress our metadata engine, it is not indicative of real-world performances.* ![Plot of our metadata engines comparison with Warp](db_engine.png) -Unsurprinsingly, we observe abysall performances for sqlite, the engine we have the less tested and kept fsync for each write. -lmdb performs twice better than default sled and 60% better than no fsync sled in our benchmark. -Additionaly, and not depicted on these plots, LMDB uses way less disk storage and RAM; we would like to quantify that in the future. -As we are only at the very beginning of our work on metadata engine, it is hard to draw strong conclusions. +Unsurprisingly, we observe abysmal performances for sqlite, the engine we have the less tested and kept fsync for each write. +lmdb performs twice better than sled in its default version and 60% better than the "no fsync" version in our benchmark. +Furthermore, and not depicted on these plots, LMDB uses way less disk storage and RAM; we would like to quantify that in the future. +As we are only at the very beginning of our work on metadata engines, it is hard to draw strong conclusions. Still, we can say that sqlite is not ready for production workloads, -LMDB looks very promising both in term of performances and resource usage, +LMDB looks very promising both in terms of performances and resource usage, it is a very good candidate for Garage's default metadata engine in the future, and we need to define a data policy for Garage that would help us arbitrate between performances and durability. -*To fsync or not to fsync? Performance is nothing without reliability, so we need to better assess the impact of validating a write and then losing it. Because Garage is a distributed system, even if a node loses its write due to a power loss, it will fetch it back from the 2 other nodes storing it. But rare situations where 1 node is down and the 2 others validated the write and then lost power can occure, what is our policy in this case? For storage durability, we are already supposing that we never loose the storage of more than 2 nodes, should we also expect that we don't loose power on more than 2 nodes at the same time? What should we think about people hosting all their nodes at the same place without an UPS? Historically, it seems that Minio developers also accepted some compromises on this side ([#3536](https://github.com/minio/minio/issues/3536), [HN Discussion](https://news.ycombinator.com/item?id=28135533)). Now, they seem to use a combination of `O_DSYNC` and `fdatasync(3p)` - a derivative that check only data and not metadata are persisted on disk - in combination with `O_DIRECT` for direct I/O ([discussion](https://github.com/minio/minio/discussions/14339#discussioncomment-2200274), [example in minio source](https://github.com/minio/minio/blob/master/cmd/xl-storage.go#L1928-L1932)).* +*To fsync or not to fsync? Performance is nothing without reliability, so we need to better assess the impact of validating a write and then losing it. Because Garage is a distributed system, even if a node loses its write due to a power loss, it will fetch it back from the 2 other nodes storing it. But rare situations where 1 node is down and the 2 others validated the write and then lost power can occur, what is our policy in this case? For storage durability, we are already supposing that we never lose the storage of more than 2 nodes, should we also expect that we don't lose power on more than 2 nodes at the same time? What should we think about people hosting all their nodes at the same place without a UPS? Historically, it seems that Minio developers also accepted some compromises on this side ([#3536](https://github.com/minio/minio/issues/3536), [HN Discussion](https://news.ycombinator.com/item?id=28135533)). Now, they seem to use a combination of `O_DSYNC` and `fdatasync(3p)` - a derivative that ensures only data and not metadata are persisted on disk - in combination with `O_DIRECT` for direct I/O ([discussion](https://github.com/minio/minio/discussions/14339#discussioncomment-2200274), [example in minio source](https://github.com/minio/minio/blob/master/cmd/xl-storage.go#L1928-L1932)).* -**Storing million of objects** - Object storage systems are designed not only for data durability and availability, but also for scalability. -Following this observation, some people asked us how scalable Garage is. If answering this question is out of scope of this study, we wanted to -be sure that our metadata engine would be able to scale to million of objects. To put this target in context, it remains small compared to other industrial solutions: -Ceph claims to scale up to [10 billion objects](https://www.redhat.com/en/resources/data-solutions-overview), which is 4 order of magnitude more than our current target. Of course, their benchmarking setup has nothing in common with ours, and their tests are way more exhaustive. +**Storing a million objects** - Object storage systems are designed not only for data durability and availability but also for scalability. +Following this observation, some people asked us how scalable Garage is. If answering this question is out of the scope of this study, we wanted to +be sure that our metadata engine would be able to scale to a million objects. To put this target in context, it remains small compared to other industrial solutions: +Ceph claims to scale up to [10 billion objects](https://www.redhat.com/en/resources/data-solutions-overview), which is 4 orders of magnitude more than our current target. Of course, their benchmarking setup has nothing in common with ours, and their tests are way more exhaustive. -We wrote our own benchmarking tool [s3billion](https://git.deuxfleurs.fr/Deuxfleurs/mknet/src/branch/main/benchmarks/s3billion)[^ref2] for this test. It concurrently sends a defined number of very tiny object (8192 objects of 16 bytes by default) and measures the time it took. It repeats this step a given number of time (128 by default) to effectively create a certain number of objects on the target cluster (1M by default). -On our local setup with 3 nodes, both Minio and Garage with LMDB were able to achieve this target. On the following plot, we show how many times it took to Garage and Minio to handle each batch. +We wrote our own benchmarking tool for this test, [s3billion](https://git.deuxfleurs.fr/Deuxfleurs/mknet/src/branch/main/benchmarks/s3billion)[^ref2]. It concurrently sends a defined number of very tiny objects (8192 objects of 16 bytes by default) and measures the time it took. It repeats this step a given number of times (128 by default) to effectively create a certain number of objects on the target cluster (1M by default). +On our local setup with 3 nodes, both Minio and Garage with LMDB were able to achieve this target. In the following plot, we show how many times it took to Garage and Minio to handle each batch. Before looking at the plot, **you must keep in mind some important points about Minio and Garage internals**. -Minio has no metadata engine, it stores its objects directly on the filesystem. Sending 1 million objects on Minio results in creating one million inodes on the storage node in our current setup. So the performance of your filesystem will probably impact a lot the results you will observe; we know the filesystem we used is not adapted at all for Minio (encryption layer, fixed number of inodes, etc.). Additionaly, we mentioned earlier that we deactivated fsync for our metadata engine, minio might have some fsync logic here slowing down the creation of objects. Finally, object storage is designed for big objects: this cost is negligible with bigger objects. In the end, again, we use Minio as a reference to understand what are our performance budget for each part of our software. +Minio has no metadata engine, it stores its objects directly on the filesystem. Sending 1 million objects on Minio results in creating one million inodes on the storage node in our current setup. So the performance of your filesystem will probably substantially impact the results you will observe; we know the filesystem we used is not adapted at all for Minio (encryption layer, fixed number of inodes, etc.). Additionally, we mentioned earlier that we deactivated fsync for our metadata engine, minio has some fsync logic here slowing down the creation of objects. Finally, object storage is designed for big objects: this cost is negligible with bigger objects. In the end, again, we use Minio as a reference to understand what is our performance budget for each part of our software. -Conversely, Garage has a metadata engine with a special optimization for small objects. Below 3KB, a block is not created on the filesystem but the object is directly stored inline in the metadata engine. +Conversely, Garage has an optimization for small objects. Below 3KB, a block is not created on the filesystem but the object is directly stored inline in the metadata engine. In the future, we plan to evaluate how Garage behaves with 3KB+ objects at scale, probably way closer to Minio, as it will have to create an inode for each object. -For now, we limit ourselves at evaluating our metadata engine, and thus focus only on 16-byte objects. +For now, we limit ourselves to evaluating our metadata engine and thus focus only on 16-byte objects. ![Showing the time to send 128 batches of 8192 objects for Minio and Garage](1million-both.png) -It appears that performances of our metadata engine are acceptables, as we have a comfortable margin compared to Minio (Minio is between 3x and 4x times slower per batch). +It appears that the performances of our metadata engine are acceptable, as we have a comfortable margin compared to Minio (Minio is between 3x and 4x times slower per batch). We also note that, past 200k objects, Minio batch completion time is constant as Garage's one remains linear: it could be interesting to know if Garage batch's completion time would cross Minio's one for a very large number of objects. -If we reason per object, both Minio and Garage performances remains very good: it takes respectively around 20ms and 5ms to create an object. -At 100 Mbps, if you upload a 10MB file, the upload will take 800ms, for a 100MB files, it goes up to 8sec; in both cases handling the object metadata is only a fraction of the upload time. -The only cases where you could notice it would be if you upload lot of very small files at once, which again, is an unsual usage of the S3 API. +If we reason per object, both Minio and Garage performances remain very good: it takes respectively around 20ms and 5ms to create an object. +At 100 Mbps, if you upload a 10MB file, the upload will take 800ms, for a 100MB file, it goes up to 8sec; in both cases handling the object metadata is only a fraction of the upload time. +The only cases where you could notice it would be if you upload a lot of very small files at once, which again, is an unusual usage of the S3 API. Next, we focus on Garage's data only to better see its specific behavior: ![Showing the time to send 128 batches of 8192 objects for Garage only](1million.png) -Two effects are now more visibles: 1. batch completion time is linear with the number of objects in the bucket and 2. measurements are dispersed, at least more than Minio. +Two effects are now more visible: 1. batch completion time is linear with the number of objects in the bucket and 2. measurements are dispersed, at least more than Minio. We discussed the first point previously but not the second one on measurement dispersion. -This instability could be an issue as it could be a symptom of what we saw with some other experiments in this machine: sometime it freezes under heavy I/O operations. -Such freezes could lead to request timeouts and failures. If it occures on our testing computer, it will occure on other servers too: it could be interesting to better understand this issue, -document how to avoid it or change how we handle our I/O. -At the same time, this was a very stressful test that will probably not be encountered in many setups: we were adding 273 object per seconds for 30 minutes! +This instability could be an issue as it could be a symptom of what we saw with some other experiments in this machine: sometimes it freezes under heavy I/O operations. +Such freezes could lead to request timeouts and failures. If it occurs on our testing computer, it will occur on other servers too: it could be interesting to better understand this issue, +document how to avoid it, or change how we handle our I/O. +At the same time, this was a very stressful test that will probably not be encountered in many setups: we were adding 273 objects per second for 30 minutes! -As a conclusion, Garage can ingest 1 million tiny objects in 30 minutes in a very restricted environment. As a comparison, our production cluster at [deuxfleurs.fr](https://deuxfleurs) manages a bucket with 116k objects. This bucket contains real data as it is used by our Matrix instance to store people's media files (profile picture, shared pictures, videos, audios, documents...). Thanks to this benchmark, we have identified two points of vigilance: putting object duration seems linear with the number of existing objects in the cluster, and we have some volatility in our measured data that could be a symptom of our system freezing under the load. Despite these two points, we are confident that Garage could scale way above 1M+ objects, but it remains to be proved! +To conclude this part, Garage can ingest 1 million tiny objects while remaining usable on our local setup. To put this result in perspective, our production cluster at [deuxfleurs.fr](https://deuxfleurs) smoothly manages a bucket with 116k objects. This bucket contains real data: it is used by our Matrix instance to store people's media files (profile pictures, shared pictures, videos, audios, documents...). Thanks to this benchmark, we have identified two points of vigilance: putting object duration seems linear with the number of existing objects in the cluster, and we have some volatility in our measured data that could be a symptom of our system freezing under the load. Despite these two points, we are confident that Garage could scale way above 1M+ objects, but it remains to be proved! ## In an unpredictable world, stay resilient -**Latency amplification** - We designed Garage with low-tech geo-distributed setups in mind. For example, our production cluster is hosted [on old Lenovo Thinkcentre Tiny Desktop computers](https://guide.deuxfleurs.fr/img/serv_neptune.jpg) behind consumer-grade fiber links across France and Belgium. With these kind of networks, the observed latency is in the 50ms range between nodes. +Supporting a variety of network properties and computers, especially ones that were not designed for software-defined storage or even server purposes, is the core value proposition of Garage. +For example, our production cluster is hosted [on refurbished Lenovo Thinkcentre Tiny Desktop computers](https://guide.deuxfleurs.fr/img/serv_neptune.jpg) behind consumer-grade fiber links across France and Belgium - if you are reading this, congratulation, you fetched this webpage from it! That's why we are very careful that our internal protocol (named RPC protocol in our documentation) remains as lightweight as possible. For this analysis, we quantify how network latency and the number of nodes in the cluster impact S3 main requests duration. -When latency is not negligible, you will observe that your requests completion time is a factor of your observed latency. That's expected: in many cases, the node of the cluster you are contacting can not directly answer your request, it needs to reach other nodes of the cluster to get your information. Each sequential request it does add to the final request duration, which can quickly become expensive. -This ratio between request duration and network latency is what we refer as *latency amplification*. +**Latency amplification** - With the kind of networks we use (consumer-grade fiber links across the EU), the observed latency is in the 50ms range between nodes. When latency is not negligible, you will observe that request completion time is a factor of the observed latency. That's expected: in many cases, the node of the cluster you are contacting can not directly answer your request, it needs to reach other nodes of the cluster to get your information. Each sequential RPC adds to the final S3 request duration, which can quickly become expensive. +This ratio between request duration and network latency is what we refer to as *latency amplification*. For example, on Garage, a GetObject request does two sequential calls: first, it asks for the descriptor of the requested object containing the block list of the requested object, then it retrieves its blocks. We can expect that the request duration of a small GetObject request will be close to twice the network latency. -We tested this theory with another benchmark of our own named [s3lat](https://git.deuxfleurs.fr/Deuxfleurs/mknet/src/branch/main/benchmarks/s3lat) that does a single request at a time on an endpoint and measure its response time. As we are not interested in bandwidth but latency, all our requests involving an object are made on a tiny file of around 16 bytes. Our benchmark tests 5 standard endpoints: ListBuckets, ListObjects, PutObject, GetObject and RemoveObject. Its results are plotted here: +We tested this theory with another benchmark of our own named [s3lat](https://git.deuxfleurs.fr/Deuxfleurs/mknet/src/branch/main/benchmarks/s3lat) which does a single request at a time on an endpoint and measures its response time. As we are not interested in bandwidth but latency, all our requests involving an object are made on a tiny file of around 16 bytes. Our benchmark tests 5 standard endpoints: ListBuckets, ListObjects, PutObject, GetObject and RemoveObject. Its results are plotted here: ![Latency amplification](amplification.png) As Garage has been optimized for this use case from the beginning, we don't see any significant evolution from one version to another (garage v0.7.3 and garage v0.8.0 beta here). Compared to Minio, these values are either similar (for ListObjects and ListBuckets) or way better (for GetObject, PutObject, and RemoveObject). -It is understandable: Minio has not been designed for environment with high latencies, you are expected to build your clusters in the same datacenter, and then possibly connect them with their asynchronous [Bucket Replication](https://min.io/docs/minio/linux/administration/bucket-replication.html?ref=docs-redirect) feature. +It is understandable: Minio has not been designed for environments with high latencies, you are expected to build your clusters in the same datacenter, and then possibly connect them with their asynchronous [Bucket Replication](https://min.io/docs/minio/linux/administration/bucket-replication.html?ref=docs-redirect) feature. *Minio also has a [Multi-Site Active-Active Replication System](https://blog.min.io/minio-multi-site-active-active-replication/) but it is even more sensitive to latency: "Multi-site replication has increased latency sensitivity, as MinIO does not consider an object as replicated until it has synchronized to all configured remote targets. Replication latency is therefore dictated by the slowest link in the replication mesh."* -**A cluster with many nodes** - Whether you already have many compute nodes with unused storage, need to store lot of data, or experiment with unusual system architecture, you might want to deploy hundredth of Garage nodes. However, on some distributed systems, the number of nodes in the cluster will impact performances. Theoretically, our protocol inspired by distributed hashtables (DHT) should scale fairly well but we never took the time to test it with hundredth of nodes before. +**A cluster with many nodes** - Whether you already have many compute nodes with unused storage, need to store a lot of data, or experiment with unusual system architecture, you might want to deploy a hundredth of Garage nodes. However, in some distributed systems, the number of nodes in the cluster will impact performance. Theoretically, our protocol inspired by distributed hashtables (DHT) should scale fairly well but we never took the time to test it with a hundredth of nodes before. This time, we did our test directly on Grid5000 with 6 physical servers spread in 3 locations in France: Lyon, Rennes, and Nantes. On each server, we ran up to 65 instances of Garage simultaneously (for a total of 390 nodes). The network between the physical server is the dedicated network provided by Grid5000 operators. Nodes on the same physical machine communicate directly through the Linux network stack without any limitation: we are aware this is a weakness of this test. We still think that this test can be relevant as, at each step in the test, each instance of Garage has 83% (5/6) of its connections that are made over a real network. To benchmark each cluster size, we used [s3lat](https://git.deuxfleurs.fr/Deuxfleurs/mknet/src/branch/main/benchmarks/s3lat) again: @@ -148,23 +156,23 @@ This time, we did our test directly on Grid5000 with 6 physical servers spread i ![Impact of response time with bigger clusters](complexity.png) Up to 250 nodes observed response times remain constant. After this threshold, results become very noisy. -By looking at the server resource usage, we saw that their load started to become non negligible: it seems that we are not hitting a limit at the protocol side but we have simply exhausted the ressource of our testing nodes. In the future, we would like to run this experiment again, but on way more physical nodes, to confirm our hypothesis. -For now, we are confident that a Garage cluster with 100+ nodes should definitely work. +By looking at the server resource usage, we saw that their load started to become non-negligible: it seems that we are not hitting a limit on the protocol side but we have simply exhausted the resource of our testing nodes. In the future, we would like to run this experiment again, but on way more physical nodes, to confirm our hypothesis. +For now, we are confident that a Garage cluster with 100+ nodes should work. ## Conclusion and Future work During this work, we identified some sensitive points on Garage we will continue working on: our data durability target and interaction with the filesystem (`O_DSYNC`, `fsync`, `O_DIRECT`, etc.) is not yet homogeneous across our components, our new metadata engines (lmdb, sqlite) still need some testing and tuning, and we know that raw I/O (GetObject, PutObject) have a small improvement margin. -At the same time, Garage has never been better: its next version (v0.8) will see drastic improvements on term of performances and reliability. -We are confident that it is already be able to cover a wide range of deployment needs, up to hundredth of nodes, millions of objects, and so on. +At the same time, Garage has never been better: its next version (v0.8) will see drastic improvements in terms of performance and reliability. +We are confident that it is already able to cover a wide range of deployment needs, up to a hundredth of nodes and millions of objects. In the future, on the performance aspect, we would like to evaluate the impact of introducing an SRPT scheduler ([#361](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/361)), -define a data durability policy and implement it, make a deeper and larger review of the state of the art (minio, ceph, swift, openio, riak cs, seaweedfs, etc.) to learn from them, -and finally, benchmark Garage at scale with possibly multiple terabytes of data on a long lasting experiments. +define a data durability policy and implement it, and make a deeper and larger review of the state of the art (minio, ceph, swift, openio, riak cs, seaweedfs, etc.) to learn from them, +and finally, benchmark Garage at scale with possibly multiple terabytes of data and billions of objects on long-lasting experiments. -In the mean time, stay tuned: we have released [a first release candidate for Garage v0.8](https://git.deuxfleurs.fr/Deuxfleurs/garage/releases/tag/v0.8.0-rc1), we are working -on proving and explaining our layout algorithm ([#296](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/296)), we are working on a Python SDK for Garage's administration API ([#379](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/379)), and we will soon introduce officially and explain why we created and published as a technical preview a new API named K2V ([see K2V on our doc](https://garagehq.deuxfleurs.fr/documentation/reference-manual/k2v/)). +In the meantime, stay tuned: we have released [a first release candidate for Garage v0.8](https://git.deuxfleurs.fr/Deuxfleurs/garage/releases/tag/v0.8.0-rc1), and we are working +on proving and explaining our layout algorithm ([#296](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/296)), we are also working on a Python SDK for Garage's administration API ([#379](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/379)), and we will soon introduce officially a new API (as a technical preview) named K2V ([see K2V on our doc for a primer](https://garagehq.deuxfleurs.fr/documentation/reference-manual/k2v/)). ## Notes -- cgit v1.2.3 From 6ca4943ec04735b5e6988f7129472adee4142041 Mon Sep 17 00:00:00 2001 From: Alex Auvolat Date: Wed, 28 Sep 2022 14:42:47 +0200 Subject: Wrap text to make future diffs more readable (no content changed) --- content/blog/2022-perf/index.md | 534 +++++++++++++++++++++++++++++++--------- 1 file changed, 420 insertions(+), 114 deletions(-) (limited to 'content/blog') diff --git a/content/blog/2022-perf/index.md b/content/blog/2022-perf/index.md index c406890..b3caf31 100644 --- a/content/blog/2022-perf/index.md +++ b/content/blog/2022-perf/index.md @@ -4,182 +4,488 @@ date=2022-09-26 +++ -*For the past years, we have extensively analyzed possible design decisions and their theoretical tradeoffs on Garage, especially on the network, data structure, or scheduling side. And it worked well enough for our production cluster at Deuxfleurs, but we also knew that people started discovering some unexpected behaviors. We thus started a round of benchmark and performance measurements to see how Garage behaves compared to our expectations. We split them into 3 categories: "efficient I/O", "myriads of objects" and "resiliency" to reflect the high-level properties we are seeking.* +*For the past years, we have extensively analyzed possible design decisions and +their theoretical tradeoffs on Garage, especially on the network, data +structure, or scheduling side. And it worked well enough for our production +cluster at Deuxfleurs, but we also knew that people started discovering some +unexpected behaviors. We thus started a round of benchmark and performance +measurements to see how Garage behaves compared to our expectations. We split +them into 3 categories: "efficient I/O", "myriads of objects" and "resiliency" +to reflect the high-level properties we are seeking.* --- ## ⚠️ Disclaimer -The following results must be taken with a critical grain of salt due to some limitations that are inherent to any benchmark. We try to reference them as exhaustively as possible in this section, but other limitations might exist. -Most of our tests are done on simulated networks that can not represent all the diversity of real networks (dynamic drop, jitter, latency, all of them could be correlated with throughput or any other external event). We also limited ourselves to very small workloads that are not representative of a production cluster. Furthermore, we only benchmarked some very specific aspects of Garage: our results are thus not an overview of the whole software performance. - -For some benchmarks, we used Minio as a reference. It must be noted that we did not try to optimize its configuration as we have done on Garage, and more generally, we have way less knowledge on Minio than on Garage, which can lead to underrated performance measurements for Minio. -It must also be noted that Garage and Minio are systems with different feature sets, *eg.* Minio supports erasure coding for better data density while Garage doesn't, Minio implements way more S3 endpoints than Garage, etc. Such features have necessarily a cost that you must keep in mind when reading plots. You should consider Minio results as a way to contextualize our results, to check that our improvements are not artificials compared to existing object storage implementations. - -The impact of the testing environment is also not evaluated (kernel patches, configuration, parameters, filesystem, hardware configuration, etc.), some of these configurations could favor one configuration/software over another. Especially, it must be noted that most of the tests were done on a consumer-grade computer and SSD only, which will be different from most production setups. Finally, our results are also provided without statistical tests to check their significance, and thus might be statistically not significant. - -When reading this post, please keep in mind that **we are not making any business or technical recommendations here, this is not a scientific paper either**; we only share bits of our development process as honestly as possible. -Read [benchmarking crimes](https://gernot-heiser.org/benchmarking-crimes.html), make your own tests if you need to take a decision, and remain supportive and caring with your peers... +The following results must be taken with a critical grain of salt due to some +limitations that are inherent to any benchmark. We try to reference them as +exhaustively as possible in this section, but other limitations might exist. + +Most of our tests are done on simulated networks that can not represent all the +diversity of real networks (dynamic drop, jitter, latency, all of them could be +correlated with throughput or any other external event). We also limited +ourselves to very small workloads that are not representative of a production +cluster. Furthermore, we only benchmarked some very specific aspects of Garage: +our results are thus not an overview of the whole software performance. + +For some benchmarks, we used Minio as a reference. It must be noted that we did +not try to optimize its configuration as we have done on Garage, and more +generally, we have way less knowledge on Minio than on Garage, which can lead +to underrated performance measurements for Minio. It must also be noted that +Garage and Minio are systems with different feature sets, *eg.* Minio supports +erasure coding for better data density while Garage doesn't, Minio implements +way more S3 endpoints than Garage, etc. Such features have necessarily a cost +that you must keep in mind when reading plots. You should consider Minio +results as a way to contextualize our results, to check that our improvements +are not artificials compared to existing object storage implementations. + +The impact of the testing environment is also not evaluated (kernel patches, +configuration, parameters, filesystem, hardware configuration, etc.), some of +these configurations could favor one configuration/software over another. +Especially, it must be noted that most of the tests were done on a +consumer-grade computer and SSD only, which will be different from most +production setups. Finally, our results are also provided without statistical +tests to check their significance, and thus might be statistically not +significant. + +When reading this post, please keep in mind that **we are not making any +business or technical recommendations here, this is not a scientific paper +either**; we only share bits of our development process as honestly as +possible. Read [benchmarking crimes](https://gernot-heiser.org/benchmarking-crimes.html), +make your own +tests if you need to take a decision, and remain supportive and caring with +your peers... ## About our testing environment -We started a batch of tests on [Grid5000](https://www.grid5000.fr/w/Grid5000:Home), a large-scale and flexible testbed for experiment-driven research in all areas of computer science, under the [Open Access](https://www.grid5000.fr/w/Grid5000:Open-Access) program. During our tests, we used part of the following clusters: [nova](https://www.grid5000.fr/w/Lyon:Hardware#nova), [paravance](https://www.grid5000.fr/w/Rennes:Hardware#paravance), and [econome](https://www.grid5000.fr/w/Nantes:Hardware#econome) to make a geo-distributed topology. We used the Grid5000 testbed only during our preliminary tests to identify issues when running Garage on many powerful servers, issues that we then reproduced in a controlled environment; don't be surprised then if Grid5000 is not mentioned often on our plots. - -To reproduce some environments locally, we have a small set of Python scripts named [mknet](https://git.deuxfleurs.fr/Deuxfleurs/mknet) tailored to our needs[^ref1]. Most of the following tests were thus run locally with mknet on a single computer: a Dell Inspiron 27" 7775 AIO, with a Ryzen 5 1400, 16GB of RAM, a 512GB SSD. In terms of software, NixOS 22.05 with the 5.15.50 kernel is used with an ext4 encrypted filesystem. The `vm.dirty_background_ratio` and `vm.dirty_ratio` have been reduced to `2` and `1` respectively as, with default values, the system tends to freeze when it is under heavy I/O load. +We started a batch of tests on +[Grid5000](https://www.grid5000.fr/w/Grid5000:Home), a large-scale and flexible +testbed for experiment-driven research in all areas of computer science, under +the [Open Access](https://www.grid5000.fr/w/Grid5000:Open-Access) program. +During our tests, we used part of the following clusters: +[nova](https://www.grid5000.fr/w/Lyon:Hardware#nova), +[paravance](https://www.grid5000.fr/w/Rennes:Hardware#paravance), and +[econome](https://www.grid5000.fr/w/Nantes:Hardware#econome) to make a +geo-distributed topology. We used the Grid5000 testbed only during our +preliminary tests to identify issues when running Garage on many powerful +servers, issues that we then reproduced in a controlled environment; don't be +surprised then if Grid5000 is not mentioned often on our plots. + +To reproduce some environments locally, we have a small set of Python scripts +named [mknet](https://git.deuxfleurs.fr/Deuxfleurs/mknet) tailored to our +needs[^ref1]. Most of the following tests were thus run locally with mknet on a +single computer: a Dell Inspiron 27" 7775 AIO, with a Ryzen 5 1400, 16GB of +RAM, a 512GB SSD. In terms of software, NixOS 22.05 with the 5.15.50 kernel is +used with an ext4 encrypted filesystem. The `vm.dirty_background_ratio` and +`vm.dirty_ratio` have been reduced to `2` and `1` respectively as, with default +values, the system tends to freeze when it is under heavy I/O load. ## Efficient I/O -The main goal of an object storage system is to store or retrieve an object across the network, and the faster, the better. -For this analysis, we focus on 2 aspects: time to first byte, as many applications can start processing a file before receiving it completely, -and generic throughput, to understand how well Garage can leverage the underlying machine performances. +The main goal of an object storage system is to store or retrieve an object +across the network, and the faster, the better. For this analysis, we focus on +2 aspects: time to first byte, as many applications can start processing a file +before receiving it completely, and generic throughput, to understand how well +Garage can leverage the underlying machine performances. -**Time To First Byte** - One specificity of Garage is that we implemented S3 web endpoints, with the idea to make it the platform of choice to publish your static website. When publishing a website, one metric you observe is Time To First Byte (TTFB), as it will impact the perceived reactivity of your website. On Garage, time to first byte was a bit high. +**Time To First Byte** - One specificity of Garage is that we implemented S3 +web endpoints, with the idea to make it the platform of choice to publish your +static website. When publishing a website, one metric you observe is Time To +First Byte (TTFB), as it will impact the perceived reactivity of your website. +On Garage, time to first byte was a bit high. -This is not surprising as, until now, the smallest level of granularity internally was handling full blocks. Blocks are 1MB chunks (this is [configurable](https://garagehq.deuxfleurs.fr/documentation/reference-manual/configuration/#block-size)) of a given object. For example, a 4.5MB object will be split into 4 blocks of 1MB and 1 block of 0.5MB. With this design, when you were sending a GET request, the first block had to be fully retrieved by the gateway node from the storage node before starting to send any data to the client. +This is not surprising as, until now, the smallest level of granularity +internally was handling full blocks. Blocks are 1MB chunks (this is +[configurable](https://garagehq.deuxfleurs.fr/documentation/reference-manual/configuration/#block-size)) +of a given object. For example, a 4.5MB object will be split into 4 blocks of +1MB and 1 block of 0.5MB. With this design, when you were sending a GET +request, the first block had to be fully retrieved by the gateway node from the +storage node before starting to send any data to the client. -With Garage v0.8, we integrated a block streaming logic that allows the gateway to send the beginning of a block without having to wait for the full block from the storage node. We can visually represent the difference as follow: +With Garage v0.8, we integrated a block streaming logic that allows the gateway +to send the beginning of a block without having to wait for the full block from +the storage node. We can visually represent the difference as follow: ![A schema depicting how streaming improves the delivery of a block](schema-streaming.png) -As our default block size is only 1MB, the difference will be very small on fast networks: it takes only 8ms to transfer 1MB on a 1Gbps network. However, on a very slow network (or a very congested link with many parallel requests handled), the impact can be much more important: at 5Mbps, it takes 1.6 seconds to transfer our 1MB block, and streaming could heavily improve user experience. +As our default block size is only 1MB, the difference will be very small on +fast networks: it takes only 8ms to transfer 1MB on a 1Gbps network. However, +on a very slow network (or a very congested link with many parallel requests +handled), the impact can be much more important: at 5Mbps, it takes 1.6 seconds +to transfer our 1MB block, and streaming could heavily improve user experience. -We wanted to see if this theory holds in practice: we simulated a low latency but slow network on mknet and did some requests with (garage v0.8 beta) and without (garage v0.7) block streaming. We also added Minio as a reference. To benchmark this behavior, we wrote a small test named [s3ttfb](https://git.deuxfleurs.fr/Deuxfleurs/mknet/src/branch/main/benchmarks/s3ttfb), its results are depicted in the following figure. +We wanted to see if this theory holds in practice: we simulated a low latency +but slow network on mknet and did some requests with (garage v0.8 beta) and +without (garage v0.7) block streaming. We also added Minio as a reference. To +benchmark this behavior, we wrote a small test named +[s3ttfb](https://git.deuxfleurs.fr/Deuxfleurs/mknet/src/branch/main/benchmarks/s3ttfb), +its results are depicted in the following figure. ![Plot showing the TTFB observed on Garage v0.8, v0.7 and Minio](ttfb.png) -Garage v0.7, which does not support block streaming, features TTFB between 1.6s and 2s, which corresponds to the theoretical time to transfer the full block. On the other side of the plot, Garage v0.8 has a very low TTFB thanks to the streaming feature (the lowest value is 43 ms). Minio sits between the two Garage versions: we suppose that it does some form of batching, but smaller than 1MB. - -**Throughput** - As soon as we publicly released Garage, people started benchmarking it, comparing its performances to writing directly on the filesystem, and observed that Garage was slower (eg. [#288](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/288)). To improve the situation, we put costly processing like hashing on a dedicated thread and did many compute optimization ([#342](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/342), [#343](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/343)) which lead us to `v0.8 beta 1`. We also noted logic we wrote (to better control resource usage and detect errors, like semaphores or timeouts) was artificially limiting performances. In another iteration, we made this logic less restrictive at the cost of higher resource consumption under load ([#387](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/387)), resulting in `v0.8 beta 2`. Finally, we currently do multiple `fsync` calls each time we write a block. We know that this is expensive and did a test build without any `fsync` call ([see the commit](https://git.deuxfleurs.fr/Deuxfleurs/garage/commit/432131f5b8c2aad113df3b295072a00756da47e7)) that will not be merged, just to assess the impact of `fsync`. We refer to it as `no-fsync` in the following plot. - -*A note about fsync: for performance reasons, operating systems often do not write directly to the disk when a process creates or updates a file in your filesystem, instead, the write is kept in memory, and flushed later in a batch with other writes. If a power loss occurs before the OS has time to flush the writes on the disk, data will be lost. To ensure that a write is effectively written on disk, you must use the [fsync(2)](https://man7.org/linux/man-pages/man2/fsync.2.html) system call: it will block until your file or directory has been written from your volatile memory to your persisting storage device. Additionally, the exact semantic of fsync [differs from one OS to another](https://mjtsai.com/blog/2022/02/17/apple-ssd-benchmarks-and-f_fullsync/) and, even on battle-tested software like Postgres, [they "did it wrong for 20 years"](https://archive.fosdem.org/2019/schedule/event/postgresql_fsync/). Note that on Garage, we are currently working on our "fsync" policy and thus, for now, you should expect limited data durability in case of power loss, as we are aware of some inconsistency on this point (which we describe in the following and plan to solve).* - -To assess performance improvements, we used the benchmark tool [minio/warp](https://github.com/minio/warp) in a non-standard configuration, adapted for small-scale tests, and we kept only the aggregated result named "cluster total". The goal of this experiment is to get an idea of the cluster performance with a standardized and mixed workload. +Garage v0.7, which does not support block streaming, features TTFB between 1.6s +and 2s, which corresponds to the theoretical time to transfer the full block. +On the other side of the plot, Garage v0.8 has a very low TTFB thanks to the +streaming feature (the lowest value is 43 ms). Minio sits between the two +Garage versions: we suppose that it does some form of batching, but smaller +than 1MB. + +**Throughput** - As soon as we publicly released Garage, people started +benchmarking it, comparing its performances to writing directly on the +filesystem, and observed that Garage was slower (eg. +[#288](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/288)). To improve the +situation, we put costly processing like hashing on a dedicated thread and did +many compute optimization +([#342](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/342), +[#343](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/343)) which lead us to +`v0.8 beta 1`. We also noted logic we wrote (to better control resource usage +and detect errors, like semaphores or timeouts) was artificially limiting +performances. In another iteration, we made this logic less restrictive at the +cost of higher resource consumption under load +([#387](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/387)), resulting in +`v0.8 beta 2`. Finally, we currently do multiple `fsync` calls each time we +write a block. We know that this is expensive and did a test build without any +`fsync` call ([see the +commit](https://git.deuxfleurs.fr/Deuxfleurs/garage/commit/432131f5b8c2aad113df3b295072a00756da47e7)) +that will not be merged, just to assess the impact of `fsync`. We refer to it +as `no-fsync` in the following plot. + +*A note about fsync: for performance reasons, operating systems often do not +write directly to the disk when a process creates or updates a file in your +filesystem, instead, the write is kept in memory, and flushed later in a batch +with other writes. If a power loss occurs before the OS has time to flush the +writes on the disk, data will be lost. To ensure that a write is effectively +written on disk, you must use the +[fsync(2)](https://man7.org/linux/man-pages/man2/fsync.2.html) system call: it +will block until your file or directory has been written from your volatile +memory to your persisting storage device. Additionally, the exact semantic of +fsync [differs from one OS to another](https://mjtsai.com/blog/2022/02/17/apple-ssd-benchmarks-and-f_fullsync/) +and, even on battle-tested software like Postgres, +[they "did it wrong for 20 years"](https://archive.fosdem.org/2019/schedule/event/postgresql_fsync/). +Note that on Garage, we are currently working on our "fsync" policy and thus, for +now, you should expect limited data durability in case of power loss, as we are +aware of some inconsistency on this point (which we describe in the following +and plan to solve).* + +To assess performance improvements, we used the benchmark tool +[minio/warp](https://github.com/minio/warp) in a non-standard configuration, +adapted for small-scale tests, and we kept only the aggregated result named +"cluster total". The goal of this experiment is to get an idea of the cluster +performance with a standardized and mixed workload. ![Plot showing IO perf of Garage configs and Minio](io.png) -Minio, our ground truth, features the best performances in this test. Considering Garage, we observe that each improvement we made has a visible impact on its performances. We also note that we have a progress margin in terms of performances compared to Minio: additional benchmarks, tests, and monitoring could help better understand the remaining difference. +Minio, our ground truth, features the best performances in this test. +Considering Garage, we observe that each improvement we made has a visible +impact on its performances. We also note that we have a progress margin in +terms of performances compared to Minio: additional benchmarks, tests, and +monitoring could help better understand the remaining difference. ## A myriad of objects -Object storage systems do not handle a single object but a myriad of them: Amazon claims to handle trillions of objects on their platform, and Red Hat communicates about Ceph being able to handle 10 billion objects. All these objects must be tracked efficiently in the system to be fetched, listed, removed, etc. In Garage, we use a "metadata engine" component to track them. For this analysis, we compare different metadata engines in Garage and see how well the best one scale to a million objects. - -**Testing metadata engines** - With Garage, we chose to not store metadata directly on the filesystem, like Minio for example, but in an on-disk fancy B-Tree structure, in other words, in an embedded database engine. Until now, the only available option was [sled](https://sled.rs/), but we started having serious issues with it, and we were not alone ([#284](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/284)). With Garage v0.8, we introduce an abstraction semantic over the features we expect from our database, allowing us to switch from one backend to another without touching the rest of our codebase. We added two additional backends: lmdb ([heed](https://github.com/meilisearch/heed)) and sqlite ([rusqlite](https://github.com/rusqlite/rusqlite)). **Keep in mind that they are both experimental: contrarily to sled, we have never run them in production for a long time.** - -Similarly to the impact of fsync on block writing, each database engine we use has its own policy with fsync. Sled flushes its write every 2 seconds by default, this is [configurable](https://garagehq.deuxfleurs.fr/documentation/reference-manual/configuration/#sled-flush-every-ms)). lmdb by default does an `fsync` on each write, on early tests it led to very slow resynchronizations between nodes. We added 2 flags: [MDB\_NOSYNC](http://www.lmdb.tech/doc/group__mdb__env.html#ga5791dd1adb09123f82dd1f331209e12e) and [MDB\_NOMETASYNC](http://www.lmdb.tech/doc/group__mdb__env.html#ga5021c4e96ffe9f383f5b8ab2af8e4b16) which deactivate fsync. On sqlite, it is also possible to deactivate fsync with `pragma synchronous = off;`, but we did not start any optimization work on it: our sqlite implementation fsync all the data on the disk. Additionally, we are using these engines through a Rust binding that had to do some tradeoff on the concurrency part. **Our comparison will not reflect the raw performances of these database engines, but instead, our integration choices.** - -Still, we think it makes sense to evaluate our implementations in their current state in Garage. We designed a benchmark that is intensive on the metadata part of the software, ie. handling tiny files. We chose again minio/warp but we configure it with the smallest possible object size supported by warp, 256 bytes, to put some pressure on the metadata engine. We evaluate sled twice: with its default configuration, and with a configuration where we set a flush interval of 10 minutes to disable fsync. - -*Note that S3 has not been designed for such small objects; a regular database, like Cassandra, would be more appropriate for such workloads. This test has only been designed to stress our metadata engine, it is not indicative of real-world performances.* +Object storage systems do not handle a single object but a myriad of them: +Amazon claims to handle trillions of objects on their platform, and Red Hat +communicates about Ceph being able to handle 10 billion objects. All these +objects must be tracked efficiently in the system to be fetched, listed, +removed, etc. In Garage, we use a "metadata engine" component to track them. +For this analysis, we compare different metadata engines in Garage and see how +well the best one scale to a million objects. + +**Testing metadata engines** - With Garage, we chose to not store metadata +directly on the filesystem, like Minio for example, but in an on-disk fancy +B-Tree structure, in other words, in an embedded database engine. Until now, +the only available option was [sled](https://sled.rs/), but we started having +serious issues with it, and we were not alone +([#284](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/284)). With Garage +v0.8, we introduce an abstraction semantic over the features we expect from our +database, allowing us to switch from one backend to another without touching +the rest of our codebase. We added two additional backends: lmdb +([heed](https://github.com/meilisearch/heed)) and sqlite +([rusqlite](https://github.com/rusqlite/rusqlite)). **Keep in mind that they +are both experimental: contrarily to sled, we have never run them in production +for a long time.** + +Similarly to the impact of fsync on block writing, each database engine we use +has its own policy with fsync. Sled flushes its write every 2 seconds by +default, this is +[configurable](https://garagehq.deuxfleurs.fr/documentation/reference-manual/configuration/#sled-flush-every-ms)). +lmdb by default does an `fsync` on each write, on early tests it led to very +slow resynchronizations between nodes. We added 2 flags: +[MDB\_NOSYNC](http://www.lmdb.tech/doc/group__mdb__env.html#ga5791dd1adb09123f82dd1f331209e12e) +and +[MDB\_NOMETASYNC](http://www.lmdb.tech/doc/group__mdb__env.html#ga5021c4e96ffe9f383f5b8ab2af8e4b16) +which deactivate fsync. On sqlite, it is also possible to deactivate fsync with +`pragma synchronous = off;`, but we did not start any optimization work on it: +our sqlite implementation fsync all the data on the disk. Additionally, we are +using these engines through a Rust binding that had to do some tradeoff on the +concurrency part. **Our comparison will not reflect the raw performances of +these database engines, but instead, our integration choices.** + +Still, we think it makes sense to evaluate our implementations in their current +state in Garage. We designed a benchmark that is intensive on the metadata part +of the software, ie. handling tiny files. We chose again minio/warp but we +configure it with the smallest possible object size supported by warp, 256 +bytes, to put some pressure on the metadata engine. We evaluate sled twice: +with its default configuration, and with a configuration where we set a flush +interval of 10 minutes to disable fsync. + +*Note that S3 has not been designed for such small objects; a regular database, +like Cassandra, would be more appropriate for such workloads. This test has +only been designed to stress our metadata engine, it is not indicative of +real-world performances.* ![Plot of our metadata engines comparison with Warp](db_engine.png) -Unsurprisingly, we observe abysmal performances for sqlite, the engine we have the less tested and kept fsync for each write. -lmdb performs twice better than sled in its default version and 60% better than the "no fsync" version in our benchmark. -Furthermore, and not depicted on these plots, LMDB uses way less disk storage and RAM; we would like to quantify that in the future. -As we are only at the very beginning of our work on metadata engines, it is hard to draw strong conclusions. -Still, we can say that sqlite is not ready for production workloads, -LMDB looks very promising both in terms of performances and resource usage, -it is a very good candidate for Garage's default metadata engine in the future, -and we need to define a data policy for Garage that would help us arbitrate between performances and durability. - -*To fsync or not to fsync? Performance is nothing without reliability, so we need to better assess the impact of validating a write and then losing it. Because Garage is a distributed system, even if a node loses its write due to a power loss, it will fetch it back from the 2 other nodes storing it. But rare situations where 1 node is down and the 2 others validated the write and then lost power can occur, what is our policy in this case? For storage durability, we are already supposing that we never lose the storage of more than 2 nodes, should we also expect that we don't lose power on more than 2 nodes at the same time? What should we think about people hosting all their nodes at the same place without a UPS? Historically, it seems that Minio developers also accepted some compromises on this side ([#3536](https://github.com/minio/minio/issues/3536), [HN Discussion](https://news.ycombinator.com/item?id=28135533)). Now, they seem to use a combination of `O_DSYNC` and `fdatasync(3p)` - a derivative that ensures only data and not metadata are persisted on disk - in combination with `O_DIRECT` for direct I/O ([discussion](https://github.com/minio/minio/discussions/14339#discussioncomment-2200274), [example in minio source](https://github.com/minio/minio/blob/master/cmd/xl-storage.go#L1928-L1932)).* - -**Storing a million objects** - Object storage systems are designed not only for data durability and availability but also for scalability. -Following this observation, some people asked us how scalable Garage is. If answering this question is out of the scope of this study, we wanted to -be sure that our metadata engine would be able to scale to a million objects. To put this target in context, it remains small compared to other industrial solutions: -Ceph claims to scale up to [10 billion objects](https://www.redhat.com/en/resources/data-solutions-overview), which is 4 orders of magnitude more than our current target. Of course, their benchmarking setup has nothing in common with ours, and their tests are way more exhaustive. - -We wrote our own benchmarking tool for this test, [s3billion](https://git.deuxfleurs.fr/Deuxfleurs/mknet/src/branch/main/benchmarks/s3billion)[^ref2]. It concurrently sends a defined number of very tiny objects (8192 objects of 16 bytes by default) and measures the time it took. It repeats this step a given number of times (128 by default) to effectively create a certain number of objects on the target cluster (1M by default). -On our local setup with 3 nodes, both Minio and Garage with LMDB were able to achieve this target. In the following plot, we show how many times it took to Garage and Minio to handle each batch. - -Before looking at the plot, **you must keep in mind some important points about Minio and Garage internals**. - -Minio has no metadata engine, it stores its objects directly on the filesystem. Sending 1 million objects on Minio results in creating one million inodes on the storage node in our current setup. So the performance of your filesystem will probably substantially impact the results you will observe; we know the filesystem we used is not adapted at all for Minio (encryption layer, fixed number of inodes, etc.). Additionally, we mentioned earlier that we deactivated fsync for our metadata engine, minio has some fsync logic here slowing down the creation of objects. Finally, object storage is designed for big objects: this cost is negligible with bigger objects. In the end, again, we use Minio as a reference to understand what is our performance budget for each part of our software. - -Conversely, Garage has an optimization for small objects. Below 3KB, a block is not created on the filesystem but the object is directly stored inline in the metadata engine. -In the future, we plan to evaluate how Garage behaves with 3KB+ objects at scale, probably way closer to Minio, as it will have to create an inode for each object. -For now, we limit ourselves to evaluating our metadata engine and thus focus only on 16-byte objects. +Unsurprisingly, we observe abysmal performances for sqlite, the engine we have +the less tested and kept fsync for each write. lmdb performs twice better than +sled in its default version and 60% better than the "no fsync" version in our +benchmark. Furthermore, and not depicted on these plots, LMDB uses way less +disk storage and RAM; we would like to quantify that in the future. As we are +only at the very beginning of our work on metadata engines, it is hard to draw +strong conclusions. Still, we can say that sqlite is not ready for production +workloads, LMDB looks very promising both in terms of performances and resource +usage, it is a very good candidate for Garage's default metadata engine in the +future, and we need to define a data policy for Garage that would help us +arbitrate between performances and durability. + +*To fsync or not to fsync? Performance is nothing without reliability, so we +need to better assess the impact of validating a write and then losing it. +Because Garage is a distributed system, even if a node loses its write due to a +power loss, it will fetch it back from the 2 other nodes storing it. But rare +situations where 1 node is down and the 2 others validated the write and then +lost power can occur, what is our policy in this case? For storage durability, +we are already supposing that we never lose the storage of more than 2 nodes, +should we also expect that we don't lose power on more than 2 nodes at the same +time? What should we think about people hosting all their nodes at the same +place without a UPS? Historically, it seems that Minio developers also accepted +some compromises on this side +([#3536](https://github.com/minio/minio/issues/3536), +[HN Discussion](https://news.ycombinator.com/item?id=28135533)). Now, they seem to +use a combination of `O_DSYNC` and `fdatasync(3p)` - a derivative that ensures +only data and not metadata are persisted on disk - in combination with +`O_DIRECT` for direct I/O +([discussion](https://github.com/minio/minio/discussions/14339#discussioncomment-2200274), +[example in minio source](https://github.com/minio/minio/blob/master/cmd/xl-storage.go#L1928-L1932)).* + +**Storing a million objects** - Object storage systems are designed not only +for data durability and availability but also for scalability. Following this +observation, some people asked us how scalable Garage is. If answering this +question is out of the scope of this study, we wanted to be sure that our +metadata engine would be able to scale to a million objects. To put this +target in context, it remains small compared to other industrial solutions: +Ceph claims to scale up to [10 billion objects](https://www.redhat.com/en/resources/data-solutions-overview), +which is 4 orders of magnitude more than our current target. Of course, their +benchmarking setup has nothing in common with ours, and their tests are way +more exhaustive. + +We wrote our own benchmarking tool for this test, +[s3billion](https://git.deuxfleurs.fr/Deuxfleurs/mknet/src/branch/main/benchmarks/s3billion)[^ref2]. +It concurrently sends a defined number of very tiny objects (8192 objects of 16 +bytes by default) and measures the time it took. It repeats this step a given +number of times (128 by default) to effectively create a certain number of +objects on the target cluster (1M by default). On our local setup with 3 +nodes, both Minio and Garage with LMDB were able to achieve this target. In the +following plot, we show how many times it took to Garage and Minio to handle +each batch. + +Before looking at the plot, **you must keep in mind some important points about +Minio and Garage internals**. + +Minio has no metadata engine, it stores its objects directly on the filesystem. +Sending 1 million objects on Minio results in creating one million inodes on +the storage node in our current setup. So the performance of your filesystem +will probably substantially impact the results you will observe; we know the +filesystem we used is not adapted at all for Minio (encryption layer, fixed +number of inodes, etc.). Additionally, we mentioned earlier that we deactivated +fsync for our metadata engine, minio has some fsync logic here slowing down the +creation of objects. Finally, object storage is designed for big objects: this +cost is negligible with bigger objects. In the end, again, we use Minio as a +reference to understand what is our performance budget for each part of our +software. + +Conversely, Garage has an optimization for small objects. Below 3KB, a block is +not created on the filesystem but the object is directly stored inline in the +metadata engine. In the future, we plan to evaluate how Garage behaves with +3KB+ objects at scale, probably way closer to Minio, as it will have to create +an inode for each object. For now, we limit ourselves to evaluating our +metadata engine and thus focus only on 16-byte objects. ![Showing the time to send 128 batches of 8192 objects for Minio and Garage](1million-both.png) -It appears that the performances of our metadata engine are acceptable, as we have a comfortable margin compared to Minio (Minio is between 3x and 4x times slower per batch). -We also note that, past 200k objects, Minio batch completion time is constant as Garage's one remains linear: it could be interesting to know if Garage batch's completion time -would cross Minio's one for a very large number of objects. -If we reason per object, both Minio and Garage performances remain very good: it takes respectively around 20ms and 5ms to create an object. -At 100 Mbps, if you upload a 10MB file, the upload will take 800ms, for a 100MB file, it goes up to 8sec; in both cases handling the object metadata is only a fraction of the upload time. -The only cases where you could notice it would be if you upload a lot of very small files at once, which again, is an unusual usage of the S3 API. +It appears that the performances of our metadata engine are acceptable, as we +have a comfortable margin compared to Minio (Minio is between 3x and 4x times +slower per batch). We also note that, past 200k objects, Minio batch +completion time is constant as Garage's one remains linear: it could be +interesting to know if Garage batch's completion time would cross Minio's one +for a very large number of objects. If we reason per object, both Minio and +Garage performances remain very good: it takes respectively around 20ms and +5ms to create an object. At 100 Mbps, if you upload a 10MB file, the +upload will take 800ms, for a 100MB file, it goes up to 8sec; in both cases +handling the object metadata is only a fraction of the upload time. The +only cases where you could notice it would be if you upload a lot of very +small files at once, which again, is an unusual usage of the S3 API. Next, we focus on Garage's data only to better see its specific behavior: ![Showing the time to send 128 batches of 8192 objects for Garage only](1million.png) -Two effects are now more visible: 1. batch completion time is linear with the number of objects in the bucket and 2. measurements are dispersed, at least more than Minio. -We discussed the first point previously but not the second one on measurement dispersion. -This instability could be an issue as it could be a symptom of what we saw with some other experiments in this machine: sometimes it freezes under heavy I/O operations. -Such freezes could lead to request timeouts and failures. If it occurs on our testing computer, it will occur on other servers too: it could be interesting to better understand this issue, -document how to avoid it, or change how we handle our I/O. -At the same time, this was a very stressful test that will probably not be encountered in many setups: we were adding 273 objects per second for 30 minutes! - -To conclude this part, Garage can ingest 1 million tiny objects while remaining usable on our local setup. To put this result in perspective, our production cluster at [deuxfleurs.fr](https://deuxfleurs) smoothly manages a bucket with 116k objects. This bucket contains real data: it is used by our Matrix instance to store people's media files (profile pictures, shared pictures, videos, audios, documents...). Thanks to this benchmark, we have identified two points of vigilance: putting object duration seems linear with the number of existing objects in the cluster, and we have some volatility in our measured data that could be a symptom of our system freezing under the load. Despite these two points, we are confident that Garage could scale way above 1M+ objects, but it remains to be proved! +Two effects are now more visible: 1. batch completion time is linear with the +number of objects in the bucket and 2. measurements are dispersed, at least +more than Minio. We discussed the first point previously but not the second +one on measurement dispersion. This instability could be an issue as it could +be a symptom of what we saw with some other experiments in this machine: +sometimes it freezes under heavy I/O operations. Such freezes could lead to +request timeouts and failures. If it occurs on our testing computer, it will +occur on other servers too: it could be interesting to better understand this +issue, document how to avoid it, or change how we handle our I/O. At the same +time, this was a very stressful test that will probably not be encountered in +many setups: we were adding 273 objects per second for 30 minutes! + +To conclude this part, Garage can ingest 1 million tiny objects while remaining +usable on our local setup. To put this result in perspective, our production +cluster at [deuxfleurs.fr](https://deuxfleurs) smoothly manages a bucket with +116k objects. This bucket contains real data: it is used by our Matrix instance +to store people's media files (profile pictures, shared pictures, videos, +audios, documents...). Thanks to this benchmark, we have identified two points +of vigilance: putting object duration seems linear with the number of existing +objects in the cluster, and we have some volatility in our measured data that +could be a symptom of our system freezing under the load. Despite these two +points, we are confident that Garage could scale way above 1M+ objects, but it +remains to be proved! ## In an unpredictable world, stay resilient -Supporting a variety of network properties and computers, especially ones that were not designed for software-defined storage or even server purposes, is the core value proposition of Garage. -For example, our production cluster is hosted [on refurbished Lenovo Thinkcentre Tiny Desktop computers](https://guide.deuxfleurs.fr/img/serv_neptune.jpg) behind consumer-grade fiber links across France and Belgium - if you are reading this, congratulation, you fetched this webpage from it! That's why we are very careful that our internal protocol (named RPC protocol in our documentation) remains as lightweight as possible. For this analysis, we quantify how network latency and the number of nodes in the cluster impact S3 main requests duration. - -**Latency amplification** - With the kind of networks we use (consumer-grade fiber links across the EU), the observed latency is in the 50ms range between nodes. When latency is not negligible, you will observe that request completion time is a factor of the observed latency. That's expected: in many cases, the node of the cluster you are contacting can not directly answer your request, it needs to reach other nodes of the cluster to get your information. Each sequential RPC adds to the final S3 request duration, which can quickly become expensive. -This ratio between request duration and network latency is what we refer to as *latency amplification*. - -For example, on Garage, a GetObject request does two sequential calls: first, it asks for the descriptor of the requested object containing the block list of the requested object, then it retrieves its blocks. -We can expect that the request duration of a small GetObject request will be close to twice the network latency. - -We tested this theory with another benchmark of our own named [s3lat](https://git.deuxfleurs.fr/Deuxfleurs/mknet/src/branch/main/benchmarks/s3lat) which does a single request at a time on an endpoint and measures its response time. As we are not interested in bandwidth but latency, all our requests involving an object are made on a tiny file of around 16 bytes. Our benchmark tests 5 standard endpoints: ListBuckets, ListObjects, PutObject, GetObject and RemoveObject. Its results are plotted here: +Supporting a variety of network properties and computers, especially ones that +were not designed for software-defined storage or even server purposes, is the +core value proposition of Garage. For example, our production cluster is +hosted [on refurbished Lenovo Thinkcentre Tiny Desktop computers](https://guide.deuxfleurs.fr/img/serv_neptune.jpg) +behind consumer-grade fiber links across France and Belgium - if you are reading this, +congratulation, you fetched this webpage from it! That's why we are very +careful that our internal protocol (named RPC protocol in our documentation) +remains as lightweight as possible. For this analysis, we quantify how network +latency and the number of nodes in the cluster impact S3 main requests +duration. + +**Latency amplification** - With the kind of networks we use (consumer-grade +fiber links across the EU), the observed latency is in the 50ms range between +nodes. When latency is not negligible, you will observe that request completion +time is a factor of the observed latency. That's expected: in many cases, the +node of the cluster you are contacting can not directly answer your request, it +needs to reach other nodes of the cluster to get your information. Each +sequential RPC adds to the final S3 request duration, which can quickly become +expensive. This ratio between request duration and network latency is what we +refer to as *latency amplification*. + +For example, on Garage, a GetObject request does two sequential calls: first, +it asks for the descriptor of the requested object containing the block list of +the requested object, then it retrieves its blocks. We can expect that the +request duration of a small GetObject request will be close to twice the +network latency. + +We tested this theory with another benchmark of our own named +[s3lat](https://git.deuxfleurs.fr/Deuxfleurs/mknet/src/branch/main/benchmarks/s3lat) +which does a single request at a time on an endpoint and measures its response +time. As we are not interested in bandwidth but latency, all our requests +involving an object are made on a tiny file of around 16 bytes. Our benchmark +tests 5 standard endpoints: ListBuckets, ListObjects, PutObject, GetObject and +RemoveObject. Its results are plotted here: ![Latency amplification](amplification.png) -As Garage has been optimized for this use case from the beginning, we don't see any significant evolution from one version to another (garage v0.7.3 and garage v0.8.0 beta here). -Compared to Minio, these values are either similar (for ListObjects and ListBuckets) or way better (for GetObject, PutObject, and RemoveObject). -It is understandable: Minio has not been designed for environments with high latencies, you are expected to build your clusters in the same datacenter, and then possibly connect them with their asynchronous [Bucket Replication](https://min.io/docs/minio/linux/administration/bucket-replication.html?ref=docs-redirect) feature. - -*Minio also has a [Multi-Site Active-Active Replication System](https://blog.min.io/minio-multi-site-active-active-replication/) but it is even more sensitive to latency: "Multi-site replication has increased latency sensitivity, as MinIO does not consider an object as replicated until it has synchronized to all configured remote targets. Replication latency is therefore dictated by the slowest link in the replication mesh."* - -**A cluster with many nodes** - Whether you already have many compute nodes with unused storage, need to store a lot of data, or experiment with unusual system architecture, you might want to deploy a hundredth of Garage nodes. However, in some distributed systems, the number of nodes in the cluster will impact performance. Theoretically, our protocol inspired by distributed hashtables (DHT) should scale fairly well but we never took the time to test it with a hundredth of nodes before. - -This time, we did our test directly on Grid5000 with 6 physical servers spread in 3 locations in France: Lyon, Rennes, and Nantes. On each server, we ran up to 65 instances of Garage simultaneously (for a total of 390 nodes). The network between the physical server is the dedicated network provided by Grid5000 operators. Nodes on the same physical machine communicate directly through the Linux network stack without any limitation: we are aware this is a weakness of this test. We still think that this test can be relevant as, at each step in the test, each instance of Garage has 83% (5/6) of its connections that are made over a real network. To benchmark each cluster size, we used [s3lat](https://git.deuxfleurs.fr/Deuxfleurs/mknet/src/branch/main/benchmarks/s3lat) again: +As Garage has been optimized for this use case from the beginning, we don't see +any significant evolution from one version to another (garage v0.7.3 and garage +v0.8.0 beta here). Compared to Minio, these values are either similar (for +ListObjects and ListBuckets) or way better (for GetObject, PutObject, and +RemoveObject). It is understandable: Minio has not been designed for +environments with high latencies, you are expected to build your clusters in +the same datacenter, and then possibly connect them with their asynchronous +[Bucket Replication](https://min.io/docs/minio/linux/administration/bucket-replication.html?ref=docs-redirect) +feature. + +*Minio also has a [Multi-Site Active-Active Replication System](https://blog.min.io/minio-multi-site-active-active-replication/) +but it is even more sensitive to latency: "Multi-site replication has increased +latency sensitivity, as MinIO does not consider an object as replicated until +it has synchronized to all configured remote targets. Replication latency is +therefore dictated by the slowest link in the replication mesh."* + +**A cluster with many nodes** - Whether you already have many compute nodes +with unused storage, need to store a lot of data, or experiment with unusual +system architecture, you might want to deploy a hundredth of Garage nodes. +However, in some distributed systems, the number of nodes in the cluster will +impact performance. Theoretically, our protocol inspired by distributed +hashtables (DHT) should scale fairly well but we never took the time to test it +with a hundredth of nodes before. + +This time, we did our test directly on Grid5000 with 6 physical servers spread +in 3 locations in France: Lyon, Rennes, and Nantes. On each server, we ran up +to 65 instances of Garage simultaneously (for a total of 390 nodes). The +network between the physical server is the dedicated network provided by +Grid5000 operators. Nodes on the same physical machine communicate directly +through the Linux network stack without any limitation: we are aware this is a +weakness of this test. We still think that this test can be relevant as, at +each step in the test, each instance of Garage has 83% (5/6) of its connections +that are made over a real network. To benchmark each cluster size, we used +[s3lat](https://git.deuxfleurs.fr/Deuxfleurs/mknet/src/branch/main/benchmarks/s3lat) +again: ![Impact of response time with bigger clusters](complexity.png) -Up to 250 nodes observed response times remain constant. After this threshold, results become very noisy. -By looking at the server resource usage, we saw that their load started to become non-negligible: it seems that we are not hitting a limit on the protocol side but we have simply exhausted the resource of our testing nodes. In the future, we would like to run this experiment again, but on way more physical nodes, to confirm our hypothesis. -For now, we are confident that a Garage cluster with 100+ nodes should work. +Up to 250 nodes observed response times remain constant. After this threshold, +results become very noisy. By looking at the server resource usage, we saw +that their load started to become non-negligible: it seems that we are not +hitting a limit on the protocol side but we have simply exhausted the resource +of our testing nodes. In the future, we would like to run this experiment +again, but on way more physical nodes, to confirm our hypothesis. For now, we +are confident that a Garage cluster with 100+ nodes should work. ## Conclusion and Future work -During this work, we identified some sensitive points on Garage we will continue working on: our data durability target and interaction with the filesystem (`O_DSYNC`, `fsync`, `O_DIRECT`, etc.) is not yet homogeneous across our components, our new metadata engines (lmdb, sqlite) still need some testing and tuning, and we know that raw I/O (GetObject, PutObject) have a small improvement margin. - -At the same time, Garage has never been better: its next version (v0.8) will see drastic improvements in terms of performance and reliability. -We are confident that it is already able to cover a wide range of deployment needs, up to a hundredth of nodes and millions of objects. - -In the future, on the performance aspect, we would like to evaluate the impact of introducing an SRPT scheduler ([#361](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/361)), -define a data durability policy and implement it, and make a deeper and larger review of the state of the art (minio, ceph, swift, openio, riak cs, seaweedfs, etc.) to learn from them, -and finally, benchmark Garage at scale with possibly multiple terabytes of data and billions of objects on long-lasting experiments. - -In the meantime, stay tuned: we have released [a first release candidate for Garage v0.8](https://git.deuxfleurs.fr/Deuxfleurs/garage/releases/tag/v0.8.0-rc1), and we are working -on proving and explaining our layout algorithm ([#296](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/296)), we are also working on a Python SDK for Garage's administration API ([#379](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/379)), and we will soon introduce officially a new API (as a technical preview) named K2V ([see K2V on our doc for a primer](https://garagehq.deuxfleurs.fr/documentation/reference-manual/k2v/)). +During this work, we identified some sensitive points on Garage we will +continue working on: our data durability target and interaction with the +filesystem (`O_DSYNC`, `fsync`, `O_DIRECT`, etc.) is not yet homogeneous across +our components, our new metadata engines (lmdb, sqlite) still need some testing +and tuning, and we know that raw I/O (GetObject, PutObject) have a small +improvement margin. + +At the same time, Garage has never been better: its next version (v0.8) will +see drastic improvements in terms of performance and reliability. We are +confident that it is already able to cover a wide range of deployment needs, up +to a hundredth of nodes and millions of objects. + +In the future, on the performance aspect, we would like to evaluate the impact +of introducing an SRPT scheduler +([#361](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/361)), define a data +durability policy and implement it, and make a deeper and larger review of the +state of the art (minio, ceph, swift, openio, riak cs, seaweedfs, etc.) to +learn from them, and finally, benchmark Garage at scale with possibly multiple +terabytes of data and billions of objects on long-lasting experiments. + +In the meantime, stay tuned: we have released +[a first release candidate for Garage v0.8](https://git.deuxfleurs.fr/Deuxfleurs/garage/releases/tag/v0.8.0-rc1), +and we are working on proving and explaining our layout algorithm +([#296](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/296)), we are also +working on a Python SDK for Garage's administration API +([#379](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/379)), and we will +soon introduce officially a new API (as a technical preview) named K2V +([see K2V on our doc for a primer](https://garagehq.deuxfleurs.fr/documentation/reference-manual/k2v/)). ## Notes -[^ref1]: Yes, we are aware of [Jepsen](https://github.com/jepsen-io/jepsen) existence. This tool is far more complex than our set of scripts, but we know that it is also way more versatile. +[^ref1]: Yes, we are aware of [Jepsen](https://github.com/jepsen-io/jepsen) +existence. This tool is far more complex than our set of scripts, but we know +that it is also way more versatile. -[^ref2]: The program name contains the word "billion" and we only tested Garage up to 1 "million" object, this is not a typo, we were just a little bit too enthusiast when we wrote it. +[^ref2]: The program name contains the word "billion" and we only tested Garage +up to 1 "million" object, this is not a typo, we were just a little bit too +enthusiast when we wrote it.