aboutsummaryrefslogtreecommitdiff
path: root/src/Technique/Développement/Garage.md
diff options
context:
space:
mode:
Diffstat (limited to 'src/Technique/Développement/Garage.md')
-rw-r--r--src/Technique/Développement/Garage.md36
1 files changed, 11 insertions, 25 deletions
diff --git a/src/Technique/Développement/Garage.md b/src/Technique/Développement/Garage.md
index 16fa635..6297ad3 100644
--- a/src/Technique/Développement/Garage.md
+++ b/src/Technique/Développement/Garage.md
@@ -2,7 +2,6 @@
Store pile of bytes in your garage.
-
## Context
Data storage is critical: it can lead to data loss if done badly and/or on hardware failure.
@@ -11,13 +10,13 @@ Moreover, it put a hard limit on scalability. Often this limit can be pushed bac
But here we consider non specialized off the shelf machines that can be as low powered and subject to failures as a raspberry pi.
Distributed storage may help to solve both availability and scalability problems on these machines.
-Many solutions were proposed, they can be categorized as block storage, file storage and object storage depending on the abstraction they provide.
+Many solutions were proposed, they can be categorized as block storage, file storage and object storage depending on the abstraction they provide.
## Related work
-Block storage is the most low level one, it's like exposing your raw hard drive over the network.
-It requires very low latencies and stable network, that are often dedicated.
-However it provides disk devices that can be manipulated by the operating system with the less constraints: it can be partitioned with any filesystem, meaning that it supports even the most exotic features.
+Block storage is the most low level one, it's like exposing your raw hard drive over the network.
+It requires very low latencies and stable network, that are often dedicated.
+However it provides disk devices that can be manipulated by the operating system with the less constraints: it can be partitioned with any filesystem, meaning that it supports even the most exotic features.
We can cite [iSCSI](https://en.wikipedia.org/wiki/ISCSI) or [Fibre Channel](https://en.wikipedia.org/wiki/Fibre_Channel).
Openstack Cinder proxy previous solution to provide an uniform API.
@@ -28,7 +27,7 @@ We can also mention CephFS (read [RADOS](https://ceph.com/wp-content/uploads/201
OpenStack Manila proxy previous solutions to provide an uniform API.
Finally object storages provide the highest level abstraction.
-They are the testimony that the POSIX filesystem API is not adapted to distributed filesystems.
+They are the testimony that the POSIX filesystem API is not adapted to distributed filesystems.
Especially, the strong concistency has been dropped in favor of eventual consistency which is way more convenient and powerful in presence of high latencies and unreliability.
We often read about S3 that pioneered the concept that it's a filesystem for the WAN.
Applications must be adapted to work for the desired object storage service.
@@ -43,7 +42,6 @@ There was many attempts in research too. I am only thinking to [LBFS](https://pd
- Cassandra (ScyllaDB) for metadata
- Own system using consistent hashing for data chunks
-
**Quentin:**
- pas d'erasure coding mais des checksums à côté des fichiers (ou dans les meta données)
@@ -67,14 +65,9 @@ _Remark 2_ Seafile idea has been stolen from this article: https://pdos.csail.mi
### Questions à résoudre
-
1. est-ce que cassandra support de mettre certaines tables sur un SSD et d'autres sur un disque rotatif ?
2. est-ce que cassandra/scylladb a un format de table on disk qui ne s'écroule pas complètement losque tu as des gros blobs ? (les devs de sqlite ont écrit tout un article pour dire que même avec leur lib qui est quand même sacrément optimisés, ils considèrent qu'à partir de je crois 4ko c'est plus efficace de mettre les blobs dans des fichiers séparés) - https://www.sqlite.org/intern-v-extern-blob.html
- 3. Quelle taille de blocs ? L'idée c'est qu'on a quand même des liens en WAN avec des débits pas forcéments incroyables. Et ça serait bien que le temps de répliquer un bloc soit de l'ordre de la seconde maxi. En cas de retry, pour pouvoir mieux monitorer la progression, etc. Exoscale utilise 16Mo. LX propose 1Mo.
-
-
-
-
+ 3. Quelle taille de blocs ? L'idée c'est qu'on a quand même des liens en WAN avec des débits pas forcéments incroyables. Et ça serait bien que le temps de répliquer un bloc soit de l'ordre de la seconde maxi. En cas de retry, pour pouvoir mieux monitorer la progression, etc. Exoscale utilise 16Mo. LX propose 1Mo.
#### Modules
@@ -85,7 +78,6 @@ _Remark 2_ Seafile idea has been stolen from this article: https://pdos.csail.mi
- `api/`: S3 API
- `web/`: web management interface
-
#### Metadata tables
**Objects:**
@@ -133,7 +125,6 @@ Workflow for PUT:
6. Return success to the user
7. Launch a background job to check and delete older versions
-
Workflow for DELETE:
1. Check write permission (LDAP)
@@ -148,12 +139,10 @@ To delete a version:
3. Delete all of the blocks from the blocks table
4. Finally, delete the version from the objects table
-
Known issue: if someone is reading from a version that we want to delete and the object is big, the read might be interrupted. I think it is ok to leave it like this, we just cut the connection if data disappears during a read.
("Soit P un problème, on s'en fout est une solution à ce problème")
-
#### Block storage on disk
**Blocks themselves:**
@@ -161,7 +150,7 @@ Known issue: if someone is reading from a version that we want to delete and the
- file path = /blobs/(first 3 hex digits of hash)/(rest of hash)
**Reverse index for GC & other block-level metadata:**
-
+
- file path = /meta/(first 3 hex digits of hash)/(rest of hash)
- map block hash -> set of version UUIDs where it is referenced
@@ -194,7 +183,6 @@ Rebalancing: takes as argument the list of newly added nodes.
Only one balancing can be running at a same time. It can be restarted at the beginning with new parameters.
-
#### Membership management
Two sets of nodes:
@@ -231,11 +219,9 @@ Number K of tokens per node: decided by the operator & stored in the operator's
- Ping interval: 10 seconds
- ??
-
#### Links
- - CDC: <https://www.usenix.org/system/files/conference/atc16/atc16-paper-xia.pdf>
- - Erasure coding: <http://web.eecs.utk.edu/~jplank/plank/papers/CS-08-627.html>
- - [Openstack Storage Concepts](https://docs.openstack.org/arch-design/design-storage/design-storage-concepts.html)
- - [RADOS](https://ceph.com/wp-content/uploads/2016/08/weil-rados-pdsw07.pdf)
-
+- CDC: <https://www.usenix.org/system/files/conference/atc16/atc16-paper-xia.pdf>
+- Erasure coding: <http://web.eecs.utk.edu/~jplank/plank/papers/CS-08-627.html>
+- [Openstack Storage Concepts](https://docs.openstack.org/arch-design/design-storage/design-storage-concepts.html)
+- [RADOS](https://ceph.com/wp-content/uploads/2016/08/weil-rados-pdsw07.pdf)