From 7193a1cce9f9f10de13ad4c5847d7751c7ceb07f Mon Sep 17 00:00:00 2001
From: Alex Auvolat <alex@adnab.me>
Date: Mon, 20 Jun 2022 13:51:05 +0200
Subject: Write on IPFS vs. S3

---
 content/blog/2022-ipfs/index.md | 61 +++++++++++++++++++++++++++++++++--------
 1 file changed, 49 insertions(+), 12 deletions(-)

(limited to 'content/blog/2022-ipfs')

diff --git a/content/blog/2022-ipfs/index.md b/content/blog/2022-ipfs/index.md
index 152e0a5..ec1649a 100644
--- a/content/blog/2022-ipfs/index.md
+++ b/content/blog/2022-ipfs/index.md
@@ -7,10 +7,12 @@ date=2022-06-09
 *Once you have spawned your Garage cluster, you might be interested in finding ways to share efficiently your content with the rest of the world,
 such as by joining federated platforms.
 In this blog post, we experiment with interconnecting the InterPlanetary File System (IPFS) daemon with Garage.
-We discuss the different bottlenecks and limitations of the software stack as it is currently available.*
+We discuss the different bottlenecks and limitations of the software stack in its current state.*
 
 <!-- more -->
 
+---
+
 
 <!--Garage has been designed to be operated inside the same "administrative area", ie. operated by a single organization made of members that fully trust each other.
 It is an intended design decision: trusting each other enables Garage to spread data over the machines instead of duplicating it.
@@ -36,19 +38,19 @@ And if nobody makes a copy of your content, you will loose it as soon as your no
 Furthermore, if you need multiple nodes to store your content, IPFS is not able to automatically place content on your nodes,
 enforce a given replication amount, check the integrity of your content, and so on.-->
 
-However, you would probably not rely on BitTorrent to durably store your encrypted holiday pictures you shared with your friends,
+However, you would probably not rely on BitTorrent to durably store the encrypted holiday pictures you shared with your friends,
 as content on the BitTorrent tends to vanish when no one in the network has a copy of it anymore. The same applies to IPFS.
-If at some time, everyone has a copy of the pictures on their hard disk, people might delete these copies after a while without you knowing it.
-You also can't easily collaborate to share this common treasure. For example, there is no automatic way to say that Alice and Bob
+Even if at some time everyone has a copy of the pictures on their hard disk, people might delete these copies after a while without you knowing it.
+You also can't easily collaborate on storing this common treasure. For example, there is no automatic way to say that Alice and Bob
 are in charge of storing the first half of the archive while Charlie and Eve are in charge of the second half.
 
 ➡️ **IPFS is designed to deliver content.**
 
-*Note: the IPFS project has another project named [IPFS Cluster](https://cluster.ipfs.io/) that allow servers to collaborate on hosting IPFS content.
+*Note: the IPFS project has another project named [IPFS Cluster](https://cluster.ipfs.io/) that allows servers to collaborate on hosting IPFS content.
 [Resilio](https://www.resilio.com/individuals/) and [Syncthing](https://syncthing.net/) both feature protocols inspired by BitTorrent to synchronize a tree of your file system between multiple computers.
 Reviewing these solutions is out of the scope of this article, feel free to try them by yourself!*
 
-Garage, on the contrary, is designed to spread automatically your content over all your available nodes, in a manner that makes the best possible use of your storage space.
+Garage, on the contrary, is designed to automatically spread your content over all your available nodes, in a manner that makes the best possible use of your storage space.
 At the same time, it ensures that your content is always replicated exactly 3 times across the cluster (or less if you change a configuration parameter),
 on different geographical zones when possible.
 <!--To access this content, you must have an API key, and have a correctly configured machine available over the network (including DNS/IP address/etc.). If the amount of traffic you receive is way larger than what your cluster can handle, your cluster will become simply unresponsive. Sharing content across people that do not trust each other, ie. who operate independant clusters, is not a feature of Garage: you have to rely on external software.-->
@@ -119,7 +121,8 @@ As a comparison, this whole webpage, with its pictures, triggers around 10 reque
 
 I think we can conclude that this first try was a failure.
 The S3 storage plugin for IPFS does too many request and would need some important work to be optimized.
-However, we should not give up too fast, because the people behind Peergos are known to run their software based on IPFS in production with an S3 backend.
+However, we are aware that the people behind Peergos are known to run their software based on IPFS in production with an S3 backend,
+so we should not give up too fast.
 
 ## Try #2: Peergos over Garage
 
@@ -141,7 +144,7 @@ I was able to upload my file, see it in the interface, create a link to share it
 ![A screenshot of the Peergos interface](./upload.png)
 
 At the same time, the fans of my computer started to become a bit loud!
-A quick look at Grafana shows that Garage is still very busy:
+A quick look at Grafana showed again a very active Garage:
 
 ![Screenshot of a grafana plot showing requests per second over time](./grafa.png) 
 <center><i>Legend: y axis = requests per 10 seconds on log(10) scale, x axis = time</i></center><p></p>
@@ -156,7 +159,7 @@ The `OPTIONS` HTTP verb is here because we use the direct access feature of Peer
 meaning that our browser is talking directly to Garage and has to use CORS to validate requests for security.
 
 Internally, IPFS splits files in blocks of less than 256 kB. My picture is thus split in 2 blocks, requiring 2 requests over Garage to fetch it.
-But even by knowing that IPFS split files in small blocks, I can't explain why we have so many `GetObject` requests.
+But even knowing that IPFS splits files in small blocks, I can't explain why we have so many `GetObject` requests.
 
 ## Try #3: Optimizing IPFS
 
@@ -165,7 +168,7 @@ Routing = dhtclient
 ![](./grafa2.png)
 -->
 
-We have seen in our 2 previous tries that the main source of load was the federation, and more especially, the DHT server.
+We have seen in our 2 previous tries that the main source of load was the federation, and in particular the DHT server.
 In this section, we'd like to artificially remove this problem from the equation by preventing our IPFS node from federating
 and see what pressure is put by Peergos alone on our local cluster.
 
@@ -198,9 +201,43 @@ From a theoretical perspective, it is still higher than the optimal number of re
 On S3, storing a file, downloading a file and listing available files are all actions that can be done in a single request.
 Even if all requests don't have the same cost on the cluster, processing a request has a non-negligible fixed cost.
 
-## S3 and IPFS are incompatible?
+## Are S3 and IPFS incompatible?
+
+Tweaking IPFS in order to try and make it work on an S3 backend is all and good,
+but in some sense, the assumptions made by IPFS are funamentally incompatible with using S3 as a block storage.
+
+First, data on IPFS is split in relatively small chunks: all IPFS blocks must be less than 1 MB, with most being 256 KB or less.
+This means that large files or complex directory hierarchies will need thousands of blocks to be stored,
+each of which is mapped to a single object in the S3 storage back-end.
+On the other side, S3 implementations such as Garage are made to handle very large objects efficiently,
+and they also provide their own primitives for rapidly listing all the objects present in a bucket or a directory.
+There is thus a huge loss in performance when data is stored in IPFS's block format, because this format does not
+take advantage of the optimizations provided by S3 back-ends in their standard usage scenarios. Instead, it
+requires storing and retrieving thousands of small S3 objects even for very simple operations such
+as retrieving a file or listing a directory, incurring a fixed overhead each time.
+
+This problem is compounded by the design of the IPFS data exchange protocol,
+in which nodes may request any data blocks to any other node in the network
+in its quest to answer a user's request (like retrieving a file, etc.).
+When a node is missing a file or a directory it wants to read, it has to do as many requests to other nodes
+as there are IPFS blocks in the object to be read.
+On the receiving end, this means that any fully-fledged IPFS node has to answer large numbers
+of requests for blocks required by users everywhere on the network, which is what we observed in our experiment above.
+We were however surprised to observe that many requests comming from the IPFS network were for blocks
+which our node wasn't locally storing a copy of: this means that somewhere in the IPFS protocol, an overly optimistic
+assumption is made on where data could be found in the network, and this ends up translating in many requests
+between nodes that return negative results.
+When IPFS blocks are stored on a local filesystem, answering these requests fast might be possible.
+However when using an S3 server as a storage back-end, this becomes prohibitively costly.
+
+If one wanted to design a distributed storage system for IPFS data blocks, they would probably need to start at a lower level.
+Garage itself makes use of a block storage mechanism that allows small-sized blocks to be stored on a cluster and accessed
+rapidly by nodes that need to access them.
+However passing through the entire abstraction that provides an S3 API is wastefull and redundant, as this API is
+designed to provide advanced functionnality such as mutating objects, associating metadata with objects, listing objects, etc.
+Plugging the IPFS daemon directly into a lower-level distributed block storage like
+Garage's might yield way better results by bypassing all of this complexity.
 
-*Text by Alex*
 
 ## Conclusion
 
-- 
cgit v1.2.3