diff options
-rw-r--r-- | content/blog/2022-perf/1million-both.png | bin | 0 -> 302107 bytes | |||
-rw-r--r-- | content/blog/2022-perf/1million.png | bin | 0 -> 237406 bytes | |||
-rw-r--r-- | content/blog/2022-perf/amplification.png | bin | 0 -> 147625 bytes | |||
-rw-r--r-- | content/blog/2022-perf/complexity.png | bin | 0 -> 198406 bytes | |||
-rw-r--r-- | content/blog/2022-perf/db_engine.png | bin | 0 -> 181046 bytes | |||
-rw-r--r-- | content/blog/2022-perf/index.md | 513 | ||||
-rw-r--r-- | content/blog/2022-perf/io.png | bin | 0 -> 193869 bytes | |||
-rw-r--r-- | content/blog/2022-perf/schema-streaming.png | bin | 0 -> 50437 bytes | |||
-rw-r--r-- | content/blog/2022-perf/ttfb.png | bin | 0 -> 131548 bytes | |||
-rwxr-xr-x | templates/section.html | 2 |
10 files changed, 514 insertions, 1 deletions
diff --git a/content/blog/2022-perf/1million-both.png b/content/blog/2022-perf/1million-both.png Binary files differnew file mode 100644 index 0000000..85d91ec --- /dev/null +++ b/content/blog/2022-perf/1million-both.png diff --git a/content/blog/2022-perf/1million.png b/content/blog/2022-perf/1million.png Binary files differnew file mode 100644 index 0000000..9554e60 --- /dev/null +++ b/content/blog/2022-perf/1million.png diff --git a/content/blog/2022-perf/amplification.png b/content/blog/2022-perf/amplification.png Binary files differnew file mode 100644 index 0000000..92eac3f --- /dev/null +++ b/content/blog/2022-perf/amplification.png diff --git a/content/blog/2022-perf/complexity.png b/content/blog/2022-perf/complexity.png Binary files differnew file mode 100644 index 0000000..a5cf631 --- /dev/null +++ b/content/blog/2022-perf/complexity.png diff --git a/content/blog/2022-perf/db_engine.png b/content/blog/2022-perf/db_engine.png Binary files differnew file mode 100644 index 0000000..b1124b0 --- /dev/null +++ b/content/blog/2022-perf/db_engine.png diff --git a/content/blog/2022-perf/index.md b/content/blog/2022-perf/index.md new file mode 100644 index 0000000..e48a0e4 --- /dev/null +++ b/content/blog/2022-perf/index.md @@ -0,0 +1,513 @@ ++++ +title="Confronting theoretical design with observed performances" +date=2022-09-26 ++++ + + +*During the past years, we have thought a lot about possible design decisions and +their theoretical trade-offs for Garage. In particular, we pondered the impacts +of data structures, networking methods, and scheduling algorithms. +Garage worked well enough for our production +cluster at Deuxfleurs, but we also knew that people started to experience some +unexpected behaviors, which motivated us to start a round of benchmarks and performance +measurements to see how Garage behaves compared to our expectations. +This post presents some of our first results, which cover +3 aspects of performance: efficient I/O, "myriads of objects", and resiliency, +reflecting the high-level properties we are seeking.* + +<!-- more --> + +--- + +## ⚠️ Disclaimer + +The results presented in this blog post must be taken with a (critical) grain of salt due to some +limitations that are inherent to any benchmarking endeavor. We try to reference them as +exhaustively as possible here, but other limitations might exist. + +Most of our tests were made on _simulated_ networks, which by definition cannot represent all the +diversity of _real_ networks (dynamic drop, jitter, latency, all of which could be +correlated with throughput or any other external event). We also limited +ourselves to very small workloads that are not representative of a production +cluster. Furthermore, we only benchmarked some very specific aspects of Garage: +our results are not an evaluation of the performance of Garage as a whole. + +For some benchmarks, we used Minio as a reference. It must be noted that we did +not try to optimize its configuration as we have done for Garage, and more +generally, we have significantly less knowledge of Minio's internals compared to Garage, which could lead +to underrated performance measurements for Minio. It must also be noted that +Garage and Minio are systems with different feature sets. For instance, Minio supports +erasure coding for higher data density and Garage doesn't, Minio implements +way more S3 endpoints than Garage, etc. Such features necessarily have a cost +that you must keep in mind when reading the plots we will present. You should consider +Minio's results as a way to contextualize Garage's numbers, to justify that our improvements +are not simply artificial in the light of existing object storage implementations. + +The impact of the testing environment is also not evaluated (kernel patches, +configuration, parameters, filesystem, hardware configuration, etc.). Some of +these parameters could favor one configuration or software product over another. +Especially, it must be noted that most of the tests were done on a +consumer-grade PC with only a SSD, which is different from most +production setups. Finally, our results are also provided without statistical +tests to validate their significance, and might have insufficient ground +to be claimed as reliable. + +When reading this post, please keep in mind that **we are not making any +business or technical recommendations here, and this is not a scientific paper +either**; we only share bits of our development process as honestly as +possible. +Make your own tests if you need to take a decision, +remember to read [benchmarking crimes](https://gernot-heiser.org/benchmarking-crimes.html) +and to remain supportive and caring with your peers ;) + +## About our testing environment + +We made a first batch of tests on +[Grid5000](https://www.grid5000.fr/w/Grid5000:Home), a large-scale and flexible +testbed for experiment-driven research in all areas of computer science, +which has an +[open access program](https://www.grid5000.fr/w/Grid5000:Open-Access). +During our tests, we used part of the following clusters: +[nova](https://www.grid5000.fr/w/Lyon:Hardware#nova), +[paravance](https://www.grid5000.fr/w/Rennes:Hardware#paravance), and +[econome](https://www.grid5000.fr/w/Nantes:Hardware#econome), to make a +geo-distributed topology. We used the Grid5000 testbed only during our +preliminary tests to identify issues when running Garage on many powerful +servers. We then reproduced these issues in a controlled environment +outside of Grid5000, so don't be +surprised then if Grid5000 is not always mentioned on our plots. + +To reproduce some environments locally, we have a small set of Python scripts +called [`mknet`](https://git.deuxfleurs.fr/Deuxfleurs/mknet) tailored to our +needs[^ref1]. Most of the following tests were run locally with `mknet` on a +single computer: a Dell Inspiron 27" 7775 AIO, with a Ryzen 5 1400, 16GB of +RAM and a 512GB SSD. In terms of software, NixOS 22.05 with the 5.15.50 kernel is +used with an ext4 encrypted filesystem. The `vm.dirty_background_ratio` and +`vm.dirty_ratio` have been reduced to `2` and `1` respectively: with default +values, the system tends to freeze under heavy I/O load. + +## Efficient I/O + +The main purpose of an object storage system is to store and retrieve objects +across the network, and the faster these two functions can be accomplished, +the more efficient the system as a whole will be. For this analysis, we focus on +2 aspects of performance. First, since many applications can start processing a file +before receiving it completely, we will evaluate the time-to-first-byte (TTFB) +on `GetObject` requests, i.e. the duration between the moment a request is sent +and the moment where the first bytes of the returned object are received by the client. +Second, we will evaluate generic throughput, to understand how well +Garage can leverage the underlying machine's performance. + +**Time-to-First-Byte** - One specificity of Garage is that we implemented S3 +web endpoints, with the idea to make it a platform of choice to publish +static websites. When publishing a website, TTFB can be directly observed +by the end user, as it will impact the perceived reactivity of the page being loaded. + +Up to version 0.7.3, time-to-first-byte on Garage used to be relatively high. +This can be explained by the fact that Garage was not able to handle data internally +at a smaller granularity level than entire data blocks, which are up to 1MB chunks of a given object +(a size which [can be configured](https://garagehq.deuxfleurs.fr/documentation/reference-manual/configuration/#block-size)). +Let us take the example of a 4.5MB object, which Garage will split by default into four 1MB blocks and one 0.5MB block. +With the old design, when you were sending a `GET` +request, the first block had to be _fully_ retrieved by the gateway node from the +storage node before it starts to send any data to the client. + +With Garage v0.8, we added a data streaming logic that allows the gateway +to send the beginning of a block without having to wait for the full block to be received from +the storage node. We can visually represent the difference as follow: + +<center> +<img src="schema-streaming.png" alt="A schema depicting how streaming improves the delivery of a block" /> +</center> + +As our default block size is only 1MB, the difference should be marginal on +fast networks: it takes only 8ms to transfer 1MB on a 1Gbps network, +adding at most 8ms of latency to a `GetObject` request (assuming no other +data transfer is happening in parallel). However, +on a very slow network, or a very congested link with many parallel requests +handled, the impact can be much more important: on a 5Mbps network, it takes at least 1.6 seconds +to transfer our 1MB block, and streaming will heavily improve user experience. + +We wanted to see if this theory holds in practice: we simulated a low latency +but slow network using `mknet` and did some requests with block streaming (Garage v0.8 beta) and +without (Garage v0.7.3). We also added Minio as a reference. To +benchmark this behavior, we wrote a small test named +[s3ttfb](https://git.deuxfleurs.fr/Deuxfleurs/mknet/src/branch/main/benchmarks/s3ttfb), +whose results are shown on the following figure: + +![Plot showing the TTFB observed on Garage v0.8, v0.7 and Minio](ttfb.png) + +Garage v0.7, which does not support block streaming, gives us a TTFB between 1.6s +and 2s, which matches the time required to transfer the full block which we calculated above. +On the other side of the plot, we can see Garage v0.8 with a very low TTFB thanks to the +streaming feature (the lowest value is 43ms). Minio sits between the two +Garage versions: we suppose that it does some form of batching, but smaller +than our initial 1MB default. + +**Throughput** - As soon as we publicly released Garage, people started +benchmarking it, comparing its performances to writing directly on the +filesystem, and observed that Garage was slower (eg. +[#288](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/288)). To improve the +situation, we did some optimizations, such as putting costly processing like hashing on a dedicated thread, +and many others +([#342](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/342), +[#343](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/343)), which led us to +version 0.8 "Beta 1". We also noticed that some of the logic we wrote +to better control resource usage +and detect errors, including semaphores and timeouts, was artificially limiting +performances. In another iteration, we made this logic less restrictive at the +cost of higher resource consumption under load +([#387](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/387)), resulting in +version 0.8 "Beta 2". Finally, we currently do multiple `fsync` calls each time we +write a block. We know that this is expensive and did a test build without any +`fsync` call ([see the +commit](https://git.deuxfleurs.fr/Deuxfleurs/garage/commit/432131f5b8c2aad113df3b295072a00756da47e7)) +that will not be merged, only to assess the impact of `fsync`. We refer to it +as `no-fsync` in the following plot. + +*A note about `fsync`: for performance reasons, operating systems often do not +write directly to the disk when a process creates or updates a file in your +filesystem. Instead, the write is kept in memory, and flushed later in a batch +with other writes. If a power loss occurs before the OS has time to flush +data to disk, some writes will be lost. To ensure that a write is effectively +written to disk, the +[`fsync(2)`](https://man7.org/linux/man-pages/man2/fsync.2.html) system call must be used, +which effectively blocks until the file or directory on which it is called has been flushed from volatile +memory to the persistent storage device. Additionally, the exact semantic of +`fsync` [differs from one OS to another](https://mjtsai.com/blog/2022/02/17/apple-ssd-benchmarks-and-f_fullsync/) +and, even on battle-tested software like Postgres, it was +["done wrong for 20 years"](https://archive.fosdem.org/2019/schedule/event/postgresql_fsync/). +Note that on Garage, we are still working on our `fsync` policy and thus, for +now, you should expect limited data durability in case of power loss, as we are +aware of some inconsistencies on this point (which we describe in the following +and plan to solve).* + +To assess performance improvements, we used the benchmark tool +[minio/warp](https://github.com/minio/warp) in a non-standard configuration, +adapted for small-scale tests, and we kept only the aggregated result named +"cluster total". The goal of this experiment is to get an idea of the cluster +performance with a standardized and mixed workload. + +![Plot showing IO performances of Garage configurations and Minio](io.png) + +Minio, our reference point, gives us the best performances in this test. +Looking at Garage, we observe that each improvement we made had a visible +impact on performances. We also note that we have a progress margin in +terms of performances compared to Minio: additional benchmarks, tests, and +monitoring could help us better understand the remaining gap. + + +## A myriad of objects + +Object storage systems do not handle a single object but huge numbers of them: +Amazon claims to handle trillions of objects on their platform, and Red Hat +tout Ceph as being able to handle 10 billion objects. All these +objects must be tracked efficiently in the system to be fetched, listed, +removed, etc. In Garage, we use a "metadata engine" component to track them. +For this analysis, we compare different metadata engines in Garage and see how +well the best one scales to a million objects. + +**Testing metadata engines** - With Garage, we chose not to store metadata +directly on the filesystem, like Minio for example, but in a specialized on-disk +B-Tree data structure; in other words, in an embedded database engine. Until now, +the only supported option was [sled](https://sled.rs/), but we started having +serious issues with it - and we were not alone +([#284](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/284)). With Garage +v0.8, we introduce an abstraction semantic over the features we expect from our +database, allowing us to switch from one metadata back-end to another without touching +the rest of our codebase. We added two additional back-ends: LMDB +(through [heed](https://github.com/meilisearch/heed)) and SQLite +(using [Rusqlite](https://github.com/rusqlite/rusqlite)). **Keep in mind that they +are both experimental: contrarily to sled, we have yet to run them in production +for a significant time.** + +Similarly to the impact of `fsync` on block writing, each database engine we use +has its own `fsync` policy. Sled flushes its writes every 2 seconds by +default (this is +[configurable](https://garagehq.deuxfleurs.fr/documentation/reference-manual/configuration/#sled-flush-every-ms)). +LMDB default to an `fsync` on each write, which on early tests led to +abysmal performance. We thus added 2 flags, +[MDB\_NOSYNC](http://www.lmdb.tech/doc/group__mdb__env.html#ga5791dd1adb09123f82dd1f331209e12e) +and +[MDB\_NOMETASYNC](http://www.lmdb.tech/doc/group__mdb__env.html#ga5021c4e96ffe9f383f5b8ab2af8e4b16), +to deactivate `fsync` entirely. On SQLite, it is also possible to deactivate `fsync` with +`pragma synchronous = off`, but we have not started any optimization work on it yet: +our SQLite implementation currently still calls `fsync` for all write operations. Additionally, we are +using these engines through Rust bindings that do not support async Rust, +with which Garage is built, which has an impact on performance as well. +**Our comparison will therefore not reflect the raw performances of +these database engines, but instead, our integration choices.** + +Still, we think it makes sense to evaluate our implementations in their current +state in Garage. We designed a benchmark that is intensive on the metadata part +of the software, i.e. handling large numbers of tiny files. We chose again +`minio/warp` as a benchmark tool, but we +configured it with the smallest possible object size it supported, 256 +bytes, to put pressure on the metadata engine. We evaluated sled twice: +with its default configuration, and with a configuration where we set a flush +interval of 10 minutes (longer than the test) to disable `fsync`. + +*Note that S3 has not been designed for workloads that store huge numbers of small objects; +a regular database, like Cassandra, would be more appropriate. This test has +only been designed to stress our metadata engine and is not indicative of +real-world performances.* + +![Plot of our metadata engines comparison with Warp](db_engine.png) + +Unsurprisingly, we observe abysmal performances with SQLite, as it is the engine we did not put work on yet, +and that still does an `fsync` for each write. Garage with the `fsync`-disabled LMDB backend performs twice better than +with sled in its default version and 60% better than the "no `fsync`" sled version in our +benchmark. Furthermore, and not depicted on these plots, LMDB uses way less +disk storage and RAM; we would like to quantify that in the future. As we are +only at the very beginning of our work on metadata engines, it is hard to draw +strong conclusions. Still, we can say that SQLite is not ready for production +workloads, and that LMDB looks very promising both in terms of performances and resource +usage, and is a very good candidate for being Garage's default metadata engine in +future releases, once we figure out the proper `fsync` tuning. In the future, we will need to define a data policy for Garage to help us +arbitrate between performance and durability. + +*To `fsync` or not to `fsync`? Performance is nothing without reliability, so we +need to better assess the impact of possibly losing a write after it has been validated. +Because Garage is a distributed system, even if a node loses its write due to a +power loss, it will fetch it back from the 2 other nodes that store it. But rare +situations can occur where 1 node is down and the 2 others validate the write and then +lose power before having time to flush to disk. What is our policy in this case? For storage durability, +we are already supposing that we never lose the storage of more than 2 nodes, +so should we also make the hypothesis that we won't lose power on more than 2 nodes at the same +time? What should we do about people hosting all of their nodes at the same +place without an uninterruptible power supply (UPS)? Historically, it seems that Minio developers also accepted +some compromises on this side +([#3536](https://github.com/minio/minio/issues/3536), +[HN Discussion](https://news.ycombinator.com/item?id=28135533)). Now, they seem to +use a combination of `O_DSYNC` and `fdatasync(3p)` - a derivative that ensures +only data and not metadata is persisted on disk - in combination with +`O_DIRECT` for direct I/O +([discussion](https://github.com/minio/minio/discussions/14339#discussioncomment-2200274), +[example in Minio source](https://github.com/minio/minio/blob/master/cmd/xl-storage.go#L1928-L1932)).* + +**Storing a million objects** - Object storage systems are designed not only +for data durability and availability but also for scalability, so naturally, +some people asked us how scalable Garage is. If giving a definitive answer to this +question is out of the scope of this study, we wanted to be sure that our +metadata engine would be able to scale to a million objects. To put this +target in context, it remains small compared to other industrial solutions: +Ceph claims to scale up to [10 billion objects](https://www.redhat.com/en/resources/data-solutions-overview), +which is 4 orders of magnitude more than our current target. Of course, their +benchmarking setup has nothing in common with ours, and their tests are way +more exhaustive. + +We wrote our own benchmarking tool for this test, +[s3billion](https://git.deuxfleurs.fr/Deuxfleurs/mknet/src/branch/main/benchmarks/s3billion)[^ref2]. +The benchmark procedure consists in +concurrently sending a defined number of tiny objects (8192 objects of 16 +bytes by default) and measuring the wall clock time to the last object upload. This step is then repeated a given +number of times (128 by default) to effectively create a target number of +objects on the cluster (1M by default). On our local setup with 3 +nodes, both Minio and Garage with LMDB were able to achieve this target. In the +following plot, we show how much time it took Garage and Minio to handle +each batch. + +Before looking at the plot, **you must keep in mind some important points regarding +the internals of both Minio and Garage**. + +Minio has no metadata engine, it stores its objects directly on the filesystem. +Sending 1 million objects on Minio results in creating one million inodes on +the storage server in our current setup. So the performances of the filesystem +probably have a substantial impact on the observed results. +In our precise setup, we know that the +filesystem we used is not adapted at all for Minio (encryption layer, fixed +number of inodes, etc.). Additionally, we mentioned earlier that we deactivated +`fsync` for our metadata engine in Garage, whereas Minio has some `fsync` logic here slowing down the +creation of objects. Finally, object storage is designed for big objects, for which the +costs measured here are negligible. In the end, again, we use Minio only as a +reference point to understand what performance budget we have for each part of our +software. + +Conversely, Garage has an optimization for small objects. Below 3KB, a separate file is +not created on the filesystem but the object is directly stored inline in the +metadata engine. In the future, we plan to evaluate how Garage behaves at scale with +objects above 3KB, which we expect to be way closer to Minio, as it will have to create +at least one inode per object. For now, we limit ourselves to evaluating our +metadata engine and focus only on 16-byte objects. + +![Showing the time to send 128 batches of 8192 objects for Minio and Garage](1million-both.png) + +It appears that the performances of our metadata engine are acceptable, as we +have a comfortable margin compared to Minio (Minio is between 3x and 4x times +slower per batch). We also note that, past the 200k objects mark, Minio's +time to complete a batch of inserts is constant, while on Garage it still increases on the observed range. +It could be interesting to know if Garage's batch completion time would cross Minio's one +for a very large number of objects. If we reason per object, both Minio's and +Garage's performances remain very good: it takes respectively around 20ms and +5ms to create an object. In a real-world scenario, at 100 Mbps, the upload of a 10MB file takes +800ms, and goes up to 8sec for a 100MB file: in both cases +handling the object metadata would be only a fraction of the upload time. The +only cases where a difference would be noticeable would be when uploading a lot of very +small files at once, which again would be an unusual usage of the S3 API. + +Let us now focus on Garage's metrics only to better see its specific behavior: + +![Showing the time to send 128 batches of 8192 objects for Garage only](1million.png) + +Two effects are now more visible: 1., batch completion time increases with the +number of objects in the bucket and 2., measurements are scattered, at least +more than for Minio. We expected this batch completion time increase to be logarithmic, +but we don't have enough data points to conclude confidently it is the case: additional +measurements are needed. Concerning the observed instability, it could +be a symptom of what we saw with some other experiments on this setup, +which sometimes freezes under heavy I/O load. Such freezes could lead to +request timeouts and failures. If this occurs on our testing computer, it might +occur on other servers as well: it would be interesting to better understand this +issue, document how to avoid it, and potentially change how we handle I/O +internally in Garage. But still, this was a very heavy test that will probably not be encountered in +many setups: we were adding 273 objects per second for 30 minutes straight! + +To conclude this part, Garage can ingest 1 million tiny objects while remaining +usable on our local setup. To put this result in perspective, our production +cluster at [deuxfleurs.fr](https://deuxfleurs) smoothly manages a bucket with +116k objects. This bucket contains real-world production data: it is used by our Matrix instance +to store people's media files (profile pictures, shared pictures, videos, +audio files, documents...). Thanks to this benchmark, we have identified two points +of vigilance: the increase of batch insert time with the number of existing +objects in the cluster in the observed range, and the volatility in our measured data that +could be a symptom of our system freezing under the load. Despite these two +points, we are confident that Garage could scale way above 1M objects, although +that remains to be proven. + +## In an unpredictable world, stay resilient + +Supporting a variety of real-world networks and computers, especially ones that +were not designed for software-defined storage or even for server purposes, is the +core value proposition of Garage. For example, our production cluster is +hosted [on refurbished Lenovo Thinkcentre Tiny desktop computers](https://guide.deuxfleurs.fr/img/serv_neptune.jpg) +behind consumer-grade fiber links across France and Belgium (if you are reading this, +congratulation, you fetched this webpage from it!). That's why we are very +careful that our internal protocol (referred to as "RPC protocol" in our documentation) +remains as lightweight as possible. For this analysis, we quantify how network +latency and number of nodes in the cluster impact the duration of the most +important kinds of S3 requests. + +**Latency amplification** - With the kind of networks we use (consumer-grade +fiber links across the EU), the observed latency between nodes is in the 50ms range. +When latency is not negligible, you will observe that request completion +time is a factor of the observed latency. That's to be expected: in many cases, the +node of the cluster you are contacting cannot directly answer your request, and +has to reach other nodes of the cluster to get the data. Each +of these sequential remote procedure calls - or RPCs - adds to the final S3 request duration, which can quickly become +expensive. This ratio between request duration and network latency is what we +refer to as *latency amplification*. + +For example, on Garage, a `GetObject` request does two sequential calls: first, +it fetches the descriptor of the requested object from the metadata engine, which contains a reference +to the first block of data, and then only in a second step it can start retrieving data blocks +from storage nodes. We can therefore expect that the +request duration of a small `GetObject` request will be close to twice the +network latency. + +We tested the latency amplification theory with another benchmark of our own named +[s3lat](https://git.deuxfleurs.fr/Deuxfleurs/mknet/src/branch/main/benchmarks/s3lat) +which does a single request at a time on an endpoint and measures the response +time. As we are not interested in bandwidth but latency, all our requests +involving objects are made on tiny files of around 16 bytes. Our benchmark +tests 5 standard endpoints of the S3 API: ListBuckets, ListObjects, PutObject, GetObject and +RemoveObject. Here are the results: + + +![Latency amplification](amplification.png) + +As Garage has been optimized for this use case from the very beginning, we don't see +any significant evolution from one version to another (Garage v0.7.3 and Garage +v0.8.0 Beta 1 here). Compared to Minio, these values are either similar (for +ListObjects and ListBuckets) or significantly better (for GetObject, PutObject, and +RemoveObject). This can be easily explained by the fact that Minio has not been designed with +environments with high latencies in mind. Instead, it is expected to run on clusters that are built +in a singe data center. In a multi-DC setup, different clusters could then possibly be interconnected with their asynchronous +[bucket replication](https://min.io/docs/minio/linux/administration/bucket-replication.html?ref=docs-redirect) +feature. + +*Minio also has a [multi-site active-active replication system](https://blog.min.io/minio-multi-site-active-active-replication/) +but it is even more sensitive to latency: "Multi-site replication has increased +latency sensitivity, as Minio does not consider an object as replicated until +it has synchronized to all configured remote targets. Replication latency is +therefore dictated by the slowest link in the replication mesh."* + + +**A cluster with many nodes** - Whether you already have many compute nodes +with unused storage, need to store a lot of data, or are experimenting with unusual +system architectures, you might be interested in deploying over a hundred Garage nodes. +However, in some distributed systems, the number of nodes in the cluster will +have an impact on performance. Theoretically, our protocol, which is inspired by distributed +hash tables (DHT), should scale fairly well, but until now, we never took the time to test it +with a hundred nodes or more. + +This test was run directly on Grid5000 with 6 physical servers spread +in 3 locations in France: Lyon, Rennes, and Nantes. On each server, we ran up +to 65 instances of Garage simultaneously, for a total of 390 nodes. The +network between physical servers is the dedicated network provided by +the Grid5000 community. Nodes on the same physical machine communicate directly +through the Linux network stack without any limitation. We are aware that this is a +weakness of this test, but we still think that this test can be relevant as, at +each step in the test, each instance of Garage has 83% (5/6) of its connections +that are made over a real network. To measure performances for each cluster size, we used +[s3lat](https://git.deuxfleurs.fr/Deuxfleurs/mknet/src/branch/main/benchmarks/s3lat) +again: + + +![Impact of response time with bigger clusters](complexity.png) + +Up to 250 nodes, we observed response times that remain constant. After this threshold, +results become very noisy. By looking at the server resource usage, we saw +that their load started to become non-negligible: it seems that we are not +hitting a limit on the protocol side, but have simply exhausted the resource +of our testing nodes. In the future, we would like to run this experiment +again, but on many more physical nodes, to confirm our hypothesis. For now, we +are confident that a Garage cluster with 100+ nodes should work. + + +## Conclusion and Future work + +During this work, we identified some sensitive points on Garage, +on which we will have to continue working: our data durability target and interaction with the +filesystem (`O_DSYNC`, `fsync`, `O_DIRECT`, etc.) is not yet homogeneous across +our components; our new metadata engines (LMDB, SQLite) still need some testing +and tuning; and we know that raw I/O performances (GetObject and PutObject for large objects) have a small +improvement margin. + +At the same time, Garage has never been in better shape: its next version (version 0.8) will +see drastic improvements in terms of performance and reliability. We are +confident that Garage is already able to cover a wide range of deployment needs, up +to over a hundred nodes and millions of objects. + +In the future, on the performance aspect, we would like to evaluate the impact +of introducing an SRPT scheduler +([#361](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/361)), define a data +durability policy and implement it, make a deeper and larger review of the +state of the art (Minio, Ceph, Swift, OpenIO, Riak CS, SeaweedFS, etc.) to +learn from them and, lastly, benchmark Garage at scale with possibly multiple +terabytes of data and billions of objects on long-lasting experiments. + +In the meantime, stay tuned: we have released +[a first release candidate for Garage v0.8](https://git.deuxfleurs.fr/Deuxfleurs/garage/releases/tag/v0.8.0-rc1), +and are already working on several features for the next version. +For instance, we are working on a new layout that will have enhanced optimality properties, +as well as a theoretical proof of correctness +([#296](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/296)). We are also +working on a Python SDK for Garage's administration API +([#379](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/379)), and we will +soon officially introduce a new API (as a technical preview) named K2V +([see K2V on our doc for a primer](https://garagehq.deuxfleurs.fr/documentation/reference-manual/k2v/)). + + +## Notes + +[^ref1]: Yes, we are aware of [Jepsen](https://github.com/jepsen-io/jepsen)'s +existence. Jepsen is far more complex than our set of scripts, but +it is also way more versatile. + +[^ref2]: The program name contains the word "billion", although we only tested Garage +up to 1 million objects: this is not a typo, we were just a little bit too +enthusiastic when we wrote it ;) + +<style> +.footnote-definition p { display: inline; } +</style> diff --git a/content/blog/2022-perf/io.png b/content/blog/2022-perf/io.png Binary files differnew file mode 100644 index 0000000..f581a22 --- /dev/null +++ b/content/blog/2022-perf/io.png diff --git a/content/blog/2022-perf/schema-streaming.png b/content/blog/2022-perf/schema-streaming.png Binary files differnew file mode 100644 index 0000000..f006484 --- /dev/null +++ b/content/blog/2022-perf/schema-streaming.png diff --git a/content/blog/2022-perf/ttfb.png b/content/blog/2022-perf/ttfb.png Binary files differnew file mode 100644 index 0000000..c0335bd --- /dev/null +++ b/content/blog/2022-perf/ttfb.png diff --git a/templates/section.html b/templates/section.html index 9ab3785..380e230 100755 --- a/templates/section.html +++ b/templates/section.html @@ -42,7 +42,7 @@ </div> <div class="content mt-2"> <div class="text-gray-700 text-lg not-italic"> - {{ page.summary | safe | striptags }} + {{ page.summary | striptags | safe }} </div> <a class="group font-semibold p-4 flex items-center space-x-1 text-garage-orange" href='{{ page.permalink }}'> <div class="h-0.5 mt-0.5 w-4 group-hover:w-8 group-hover:bg-garage-gray transition-all bg-garage-orange"></div> |