aboutsummaryrefslogtreecommitdiff
path: root/content/documentation/working_documents
diff options
context:
space:
mode:
authorsptaule <lecas83@gmail.com>2022-01-24 20:58:36 +0100
committersptaule <lecas83@gmail.com>2022-01-24 20:58:36 +0100
commitdc932f973593747697b98aa612a308df889741ab (patch)
treed1b88306af8373d650161e845c26690a52e32d29 /content/documentation/working_documents
parent336fd3f7566c7872c17b30f7cd150e1d16899205 (diff)
downloadgaragehq.deuxfleurs.fr-dc932f973593747697b98aa612a308df889741ab.tar.gz
garagehq.deuxfleurs.fr-dc932f973593747697b98aa612a308df889741ab.zip
Added doc book
Diffstat (limited to 'content/documentation/working_documents')
-rw-r--r--content/documentation/working_documents/compatibility_target.md105
-rw-r--r--content/documentation/working_documents/design_draft.md162
-rw-r--r--content/documentation/working_documents/index.md8
-rw-r--r--content/documentation/working_documents/load_balancing.md199
-rw-r--r--content/documentation/working_documents/migration_04.md105
-rw-r--r--content/documentation/working_documents/migration_06.md46
6 files changed, 625 insertions, 0 deletions
diff --git a/content/documentation/working_documents/compatibility_target.md b/content/documentation/working_documents/compatibility_target.md
new file mode 100644
index 0000000..8830101
--- /dev/null
+++ b/content/documentation/working_documents/compatibility_target.md
@@ -0,0 +1,105 @@
+# S3 compatibility target
+
+If there is a specific S3 functionnality you have a need for, feel free to open
+a PR to put the corresponding endpoints higher in the list. Please explain
+your motivations for doing so in the PR message.
+
+| Priority | Endpoints |
+| -------------------------- | --------- |
+| **S-tier** (high priority) | |
+| | HeadBucket |
+| | GetBucketLocation |
+| | CreateBucket |
+| | DeleteBucket |
+| | ListBuckets |
+| | ListObjects |
+| | ListObjectsV2 |
+| | HeadObject |
+| | GetObject |
+| | PutObject |
+| | CopyObject |
+| | DeleteObject |
+| | DeleteObjects |
+| | CreateMultipartUpload |
+| | CompleteMultipartUpload |
+| | AbortMultipartUpload |
+| | UploadPart |
+| | [*ListMultipartUploads*](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/103) |
+| | [*ListParts*](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/103) |
+| **A-tier** (will implement) | |
+| | [*GetBucketCors*](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/138) |
+| | [*PutBucketCors*](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/138) |
+| | [*DeleteBucketCors*](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/138) |
+| | UploadPartCopy |
+| | GetBucketWebsite |
+| | PutBucketWebsite |
+| | DeleteBucketWebsite |
+| ~~~~~~~~~~~~~~~~~~~~~~~~~~ | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
+| **B-tier** | |
+| | GetBucketAcl |
+| | PutBucketAcl |
+| | GetObjectLockConfiguration |
+| | PutObjectLockConfiguration |
+| | GetObjectRetention |
+| | PutObjectRetention |
+| | GetObjectLegalHold |
+| | PutObjectLegalHold |
+| **C-tier** | |
+| | GetBucketVersioning |
+| | PutBucketVersioning |
+| | ListObjectVersions |
+| | GetObjectAcl |
+| | PutObjectAcl |
+| | GetBucketLifecycleConfiguration |
+| | PutBucketLifecycleConfiguration |
+| | DeleteBucketLifecycle |
+| **garbage-tier** | |
+| | DeleteBucketEncryption |
+| | DeleteBucketAnalyticsConfiguration |
+| | DeleteBucketIntelligentTieringConfiguration |
+| | DeleteBucketInventoryConfiguration |
+| | DeleteBucketMetricsConfiguration |
+| | DeleteBucketOwnershipControls |
+| | DeleteBucketPolicy |
+| | DeleteBucketReplication |
+| | DeleteBucketTagging |
+| | DeleteObjectTagging |
+| | DeletePublicAccessBlock |
+| | GetBucketAccelerateConfiguration |
+| | GetBucketAnalyticsConfiguration |
+| | GetBucketEncryption |
+| | GetBucketIntelligentTieringConfiguration |
+| | GetBucketInventoryConfiguration |
+| | GetBucketLogging |
+| | GetBucketMetricsConfiguration |
+| | GetBucketNotificationConfiguration |
+| | GetBucketOwnershipControls |
+| | GetBucketPolicy |
+| | GetBucketPolicyStatus |
+| | GetBucketReplication |
+| | GetBucketRequestPayment |
+| | GetBucketTagging |
+| | GetObjectTagging |
+| | GetObjectTorrent |
+| | GetPublicAccessBlock |
+| | ListBucketAnalyticsConfigurations |
+| | ListBucketIntelligentTieringConfigurations |
+| | ListBucketInventoryConfigurations |
+| | ListBucketMetricsConfigurations |
+| | PutBucketAccelerateConfiguration |
+| | PutBucketAnalyticsConfiguration |
+| | PutBucketEncryption |
+| | PutBucketIntelligentTieringConfiguration |
+| | PutBucketInventoryConfiguration |
+| | PutBucketLogging |
+| | PutBucketMetricsConfiguration |
+| | PutBucketNotificationConfiguration |
+| | PutBucketOwnershipControls |
+| | PutBucketPolicy |
+| | PutBucketReplication |
+| | PutBucketRequestPayment |
+| | PutBucketTagging |
+| | PutObjectTagging |
+| | PutPublicAccessBlock |
+| | RestoreObject |
+| | SelectObjectContent |
diff --git a/content/documentation/working_documents/design_draft.md b/content/documentation/working_documents/design_draft.md
new file mode 100644
index 0000000..06ed46b
--- /dev/null
+++ b/content/documentation/working_documents/design_draft.md
@@ -0,0 +1,162 @@
+# Design draft
+
+**WARNING: this documentation is a design draft which was written before Garage's actual implementation.
+The general principle are similar, but details have not been updated.**
+
+
+#### Modules
+
+- `membership/`: configuration, membership management (gossip of node's presence and status), ring generation --> what about Serf (used by Consul/Nomad) : https://www.serf.io/? Seems a huge library with many features so maybe overkill/hard to integrate
+- `metadata/`: metadata management
+- `blocks/`: block management, writing, GC and rebalancing
+- `internal/`: server to server communication (HTTP server and client that reuses connections, TLS if we want, etc)
+- `api/`: S3 API
+- `web/`: web management interface
+
+#### Metadata tables
+
+**Objects:**
+
+- *Hash key:* Bucket name (string)
+- *Sort key:* Object key (string)
+- *Sort key:* Version timestamp (int)
+- *Sort key:* Version UUID (string)
+- Complete: bool
+- Inline: bool, true for objects < threshold (say 1024)
+- Object size (int)
+- Mime type (string)
+- Data for inlined objects (blob)
+- Hash of first block otherwise (string)
+
+*Having only a hash key on the bucket name will lead to storing all file entries of this table for a specific bucket on a single node. At the same time, it is the only way I see to rapidly being able to list all bucket entries...*
+
+**Blocks:**
+
+- *Hash key:* Version UUID (string)
+- *Sort key:* Offset of block in total file (int)
+- Hash of data block (string)
+
+A version is defined by the existence of at least one entry in the blocks table for a certain version UUID.
+We must keep the following invariant: if a version exists in the blocks table, it has to be referenced in the objects table.
+We explicitly manage concurrent versions of an object: the version timestamp and version UUID columns are index columns, thus we may have several concurrent versions of an object.
+Important: before deleting an older version from the objects table, we must make sure that we did a successfull delete of the blocks of that version from the blocks table.
+
+Thus, the workflow for reading an object is as follows:
+
+1. Check permissions (LDAP)
+2. Read entry in object table. If data is inline, we have its data, stop here.
+ -> if several versions, take newest one and launch deletion of old ones in background
+3. Read first block from cluster. If size <= 1 block, stop here.
+4. Simultaneously with previous step, if size > 1 block: query the Blocks table for the IDs of the next blocks
+5. Read subsequent blocks from cluster
+
+Workflow for PUT:
+
+1. Check write permission (LDAP)
+2. Select a new version UUID
+3. Write a preliminary entry for the new version in the objects table with complete = false
+4. Send blocks to cluster and write entries in the blocks table
+5. Update the version with complete = true and all of the accurate information (size, etc)
+6. Return success to the user
+7. Launch a background job to check and delete older versions
+
+Workflow for DELETE:
+
+1. Check write permission (LDAP)
+2. Get current version (or versions) in object table
+3. Do the deletion of those versions NOT IN A BACKGROUND JOB THIS TIME
+4. Return succes to the user if we were able to delete blocks from the blocks table and entries from the object table
+
+To delete a version:
+
+1. List the blocks from Cassandra
+2. For each block, delete it from cluster. Don't care if some deletions fail, we can do GC.
+3. Delete all of the blocks from the blocks table
+4. Finally, delete the version from the objects table
+
+Known issue: if someone is reading from a version that we want to delete and the object is big, the read might be interrupted. I think it is ok to leave it like this, we just cut the connection if data disappears during a read.
+
+("Soit P un problème, on s'en fout est une solution à ce problème")
+
+#### Block storage on disk
+
+**Blocks themselves:**
+
+- file path = /blobs/(first 3 hex digits of hash)/(rest of hash)
+
+**Reverse index for GC & other block-level metadata:**
+
+- file path = /meta/(first 3 hex digits of hash)/(rest of hash)
+- map block hash -> set of version UUIDs where it is referenced
+
+Usefull metadata:
+
+- list of versions that reference this block in the Casandra table, so that we can do GC by checking in Cassandra that the lines still exist
+- list of other nodes that we know have acknowledged a write of this block, usefull in the rebalancing algorithm
+
+Write strategy: have a single thread that does all write IO so that it is serialized (or have several threads that manage independent parts of the hash space). When writing a blob, write it to a temporary file, close, then rename so that a concurrent read gets a consistent result (either not found or found with whole content).
+
+Read strategy: the only read operation is get(hash) that returns either the data or not found (can do a corruption check as well and return corrupted state if it is the case). Can be done concurrently with writes.
+
+**Internal API:**
+
+- get(block hash) -> ok+data/not found/corrupted
+- put(block hash & data, version uuid + offset) -> ok/error
+- put with no data(block hash, version uuid + offset) -> ok/not found plz send data/error
+- delete(block hash, version uuid + offset) -> ok/error
+
+GC: when last ref is deleted, delete block.
+Long GC procedure: check in Cassandra that version UUIDs still exist and references this block.
+
+Rebalancing: takes as argument the list of newly added nodes.
+
+- List all blocks that we have. For each block:
+- If it hits a newly introduced node, send it to them.
+ Use put with no data first to check if it has to be sent to them already or not.
+ Use a random listing order to avoid race conditions (they do no harm but we might have two nodes sending the same thing at the same time thus wasting time).
+- If it doesn't hit us anymore, delete it and its reference list.
+
+Only one balancing can be running at a same time. It can be restarted at the beginning with new parameters.
+
+#### Membership management
+
+Two sets of nodes:
+
+- set of nodes from which a ping was recently received, with status: number of stored blocks, request counters, error counters, GC%, rebalancing%
+ (eviction from this set after say 30 seconds without ping)
+- set of nodes that are part of the system, explicitly modified by the operator using the web UI (persisted to disk),
+ is a CRDT using a version number for the value of the whole set
+
+Thus, three states for nodes:
+
+- healthy: in both sets
+- missing: not pingable but part of desired cluster
+- unused/draining: currently present but not part of the desired cluster, empty = if contains nothing, draining = if still contains some blocks
+
+Membership messages between nodes:
+
+- ping with current state + hash of current membership info -> reply with same info
+- send&get back membership info (the ids of nodes that are in the two sets): used when no local membership change in a long time and membership info hash discrepancy detected with first message (passive membership fixing with full CRDT gossip)
+- inform of newly pingable node(s) -> no result, when receive new info repeat to all (reliable broadcast)
+- inform of operator membership change -> no result, when receive new info repeat to all (reliable broadcast)
+
+Ring: generated from the desired set of nodes, however when doing read/writes on the ring, skip nodes that are known to be not pingable.
+The tokens are generated in a deterministic fashion from node IDs (hash of node id + token number from 1 to K).
+Number K of tokens per node: decided by the operator & stored in the operator's list of nodes CRDT. Default value proposal: with node status information also broadcast disk total size and free space, and propose a default number of tokens equal to 80%Free space / 10Gb. (this is all user interface)
+
+
+#### Constants
+
+- Block size: around 1MB ? --> Exoscale use 16MB chunks
+- Number of tokens in the hash ring: one every 10Gb of allocated storage
+- Threshold for storing data directly in Cassandra objects table: 1kb bytes (maybe up to 4kb?)
+- Ping timeout (time after which a node is registered as unresponsive/missing): 30 seconds
+- Ping interval: 10 seconds
+- ??
+
+#### Links
+
+- CDC: <https://www.usenix.org/system/files/conference/atc16/atc16-paper-xia.pdf>
+- Erasure coding: <http://web.eecs.utk.edu/~jplank/plank/papers/CS-08-627.html>
+- [Openstack Storage Concepts](https://docs.openstack.org/arch-design/design-storage/design-storage-concepts.html)
+- [RADOS](https://ceph.com/wp-content/uploads/2016/08/weil-rados-pdsw07.pdf)
diff --git a/content/documentation/working_documents/index.md b/content/documentation/working_documents/index.md
new file mode 100644
index 0000000..a9e7f89
--- /dev/null
+++ b/content/documentation/working_documents/index.md
@@ -0,0 +1,8 @@
+# Working Documents
+
+Working documents are documents that reflect the fact that Garage is a software that evolves quickly.
+They are a way to communicate our ideas, our changes, and so on before or while we are implementing them in Garage.
+If you like to live on the edge, it could also serve as a documentation of our next features to be released.
+
+Ideally, once the feature/patch has been merged, the working document should serve as a source to
+update the rest of the documentation and then be removed.
diff --git a/content/documentation/working_documents/load_balancing.md b/content/documentation/working_documents/load_balancing.md
new file mode 100644
index 0000000..99271ad
--- /dev/null
+++ b/content/documentation/working_documents/load_balancing.md
@@ -0,0 +1,199 @@
+# Load Balancing Data (planned for version 0.2)
+
+**This is being yet improved in release 0.5. The working document has not been updated yet, it still only applies to Garage 0.2 through 0.4.**
+
+I have conducted a quick study of different methods to load-balance data over different Garage nodes using consistent hashing.
+
+## Requirements
+
+- *good balancing*: two nodes that have the same announced capacity should receive close to the same number of items
+
+- *multi-datacenter*: the replicas of a partition should be distributed over as many datacenters as possible
+
+- *minimal disruption*: when adding or removing a node, as few partitions as possible should have to move around
+
+- *order-agnostic*: the same set of nodes (each associated with a datacenter name
+ and a capacity) should always return the same distribution of partition
+ replicas, independently of the order in which nodes were added/removed (this
+ is to keep the implementation simple)
+
+## Methods
+
+### Naive multi-DC ring walking strategy
+
+This strategy can be used with any ring-like algorithm to make it aware of the *multi-datacenter* requirement:
+
+In this method, the ring is a list of positions, each associated with a single node in the cluster.
+Partitions contain all the keys between two consecutive items of the ring.
+To find the nodes that store replicas of a given partition:
+
+- select the node for the position of the partition's lower bound
+- go clockwise on the ring, skipping nodes that:
+ - we halve already selected
+ - are in a datacenter of a node we have selected, except if we already have nodes from all possible datacenters
+
+In this way the selected nodes will always be distributed over
+`min(n_datacenters, n_replicas)` different datacenters, which is the best we
+can do.
+
+This method was implemented in the first version of Garage, with the basic
+ring construction from Dynamo DB that consists in associating `n_token` random positions to
+each node (I know it's not optimal, the Dynamo paper already studies this).
+
+### Better rings
+
+The ring construction that selects `n_token` random positions for each nodes gives a ring of positions that
+is not well-balanced: the space between the tokens varies a lot, and some partitions are thus bigger than others.
+This problem was demonstrated in the original Dynamo DB paper.
+
+To solve this, we want to apply a better second method for partitionning our dataset:
+
+1. fix an initially large number of partitions (say 1024) with evenly-spaced delimiters,
+
+2. attribute each partition randomly to a node, with a probability
+ proportionnal to its capacity (which `n_tokens` represented in the first
+ method)
+
+For now we continue using the multi-DC ring walking described above.
+
+I have studied two ways to do the attribution of partitions to nodes, in a way that is deterministic:
+
+- Min-hash: for each partition, select node that minimizes `hash(node, partition_number)`
+- MagLev: see [here](https://blog.acolyer.org/2016/03/21/maglev-a-fast-and-reliable-software-network-load-balancer/)
+
+MagLev provided significantly better balancing, as it guarantees that the exact
+same number of partitions is attributed to all nodes that have the same
+capacity (and that this number is proportionnal to the node's capacity, except
+for large values), however in both cases:
+
+- the distribution is still bad, because we use the naive multi-DC ring walking
+ that behaves strangely due to interactions between consecutive positions on
+ the ring
+
+- the disruption in case of adding/removing a node is not as low as it can be,
+ as we show with the following method.
+
+A quick description of MagLev (backend = node, lookup table = ring):
+
+> The basic idea of Maglev hashing is to assign a preference list of all the
+> lookup table positions to each backend. Then all the backends take turns
+> filling their most-preferred table positions that are still empty, until the
+> lookup table is completely filled in. Hence, Maglev hashing gives an almost
+> equal share of the lookup table to each of the backends. Heterogeneous
+> backend weights can be achieved by altering the relative frequency of the
+> backends’ turns…
+
+Here are some stats (run `scripts/simulate_ring.py` to reproduce):
+
+```
+##### Custom-ring (min-hash) #####
+
+#partitions per node (capacity in parenthesis):
+- datura (8) : 227
+- digitale (8) : 351
+- drosera (8) : 259
+- geant (16) : 476
+- gipsie (16) : 410
+- io (16) : 495
+- isou (8) : 231
+- mini (4) : 149
+- mixi (4) : 188
+- modi (4) : 127
+- moxi (4) : 159
+
+Variance of load distribution for load normalized to intra-class mean
+(a class being the set of nodes with the same announced capacity): 2.18% <-- REALLY BAD
+
+Disruption when removing nodes (partitions moved on 0/1/2/3 nodes):
+removing atuin digitale : 63.09% 30.18% 6.64% 0.10%
+removing atuin drosera : 72.36% 23.44% 4.10% 0.10%
+removing atuin datura : 73.24% 21.48% 5.18% 0.10%
+removing jupiter io : 48.34% 38.48% 12.30% 0.88%
+removing jupiter isou : 74.12% 19.73% 6.05% 0.10%
+removing grog mini : 84.47% 12.40% 2.93% 0.20%
+removing grog mixi : 80.76% 16.60% 2.64% 0.00%
+removing grog moxi : 83.59% 14.06% 2.34% 0.00%
+removing grog modi : 87.01% 11.43% 1.46% 0.10%
+removing grisou geant : 48.24% 37.40% 13.67% 0.68%
+removing grisou gipsie : 53.03% 33.59% 13.09% 0.29%
+on average: 69.84% 23.53% 6.40% 0.23% <-- COULD BE BETTER
+
+--------
+
+##### MagLev #####
+
+#partitions per node:
+- datura (8) : 273
+- digitale (8) : 256
+- drosera (8) : 267
+- geant (16) : 452
+- gipsie (16) : 427
+- io (16) : 483
+- isou (8) : 272
+- mini (4) : 184
+- mixi (4) : 160
+- modi (4) : 144
+- moxi (4) : 154
+
+Variance of load distribution: 0.37% <-- Already much better, but not optimal
+
+Disruption when removing nodes (partitions moved on 0/1/2/3 nodes):
+removing atuin digitale : 62.60% 29.20% 7.91% 0.29%
+removing atuin drosera : 65.92% 26.56% 7.23% 0.29%
+removing atuin datura : 63.96% 27.83% 7.71% 0.49%
+removing jupiter io : 44.63% 40.33% 14.06% 0.98%
+removing jupiter isou : 63.38% 27.25% 8.98% 0.39%
+removing grog mini : 72.46% 21.00% 6.35% 0.20%
+removing grog mixi : 72.95% 22.46% 4.39% 0.20%
+removing grog moxi : 74.22% 20.61% 4.98% 0.20%
+removing grog modi : 75.98% 18.36% 5.27% 0.39%
+removing grisou geant : 46.97% 36.62% 15.04% 1.37%
+removing grisou gipsie : 49.22% 36.52% 12.79% 1.46%
+on average: 62.94% 27.89% 8.61% 0.57% <-- WORSE THAN PREVIOUSLY
+```
+
+### The magical solution: multi-DC aware MagLev
+
+Suppose we want to select three replicas for each partition (this is what we do in our simulation and in most Garage deployments).
+We apply MagLev three times consecutively, one for each replica selection.
+The first time is pretty much the same as normal MagLev, but for the following times, when a node runs through its preference
+list to select a partition to replicate, we skip partitions for which adding this node would not bring datacenter-diversity.
+More precisely, we skip a partition in the preference list if:
+
+- the node already replicates the partition (from one of the previous rounds of MagLev)
+- the node is in a datacenter where a node already replicates the partition and there are other datacenters available
+
+Refer to `method4` in the simulation script for a formal definition.
+
+```
+##### Multi-DC aware MagLev #####
+
+#partitions per node:
+- datura (8) : 268 <-- NODES WITH THE SAME CAPACITY
+- digitale (8) : 267 HAVE THE SAME NUM OF PARTITIONS
+- drosera (8) : 267 (+- 1)
+- geant (16) : 470
+- gipsie (16) : 472
+- io (16) : 516
+- isou (8) : 268
+- mini (4) : 136
+- mixi (4) : 136
+- modi (4) : 136
+- moxi (4) : 136
+
+Variance of load distribution: 0.06% <-- CAN'T DO BETTER THAN THIS
+
+Disruption when removing nodes (partitions moved on 0/1/2/3 nodes):
+removing atuin digitale : 65.72% 33.01% 1.27% 0.00%
+removing atuin drosera : 64.65% 33.89% 1.37% 0.10%
+removing atuin datura : 66.11% 32.62% 1.27% 0.00%
+removing jupiter io : 42.97% 53.42% 3.61% 0.00%
+removing jupiter isou : 66.11% 32.32% 1.56% 0.00%
+removing grog mini : 80.47% 18.85% 0.68% 0.00%
+removing grog mixi : 80.27% 18.85% 0.88% 0.00%
+removing grog moxi : 80.18% 19.04% 0.78% 0.00%
+removing grog modi : 79.69% 19.92% 0.39% 0.00%
+removing grisou geant : 44.63% 52.15% 3.22% 0.00%
+removing grisou gipsie : 43.55% 52.54% 3.91% 0.00%
+on average: 64.94% 33.33% 1.72% 0.01% <-- VERY GOOD (VERY LOW VALUES FOR 2 AND 3 NODES)
+```
diff --git a/content/documentation/working_documents/migration_04.md b/content/documentation/working_documents/migration_04.md
new file mode 100644
index 0000000..05e7eaf
--- /dev/null
+++ b/content/documentation/working_documents/migration_04.md
@@ -0,0 +1,105 @@
+# Migrating from 0.3 to 0.4
+
+**Migrating from 0.3 to 0.4 is unsupported. This document is only intended to
+document the process internally for the Deuxfleurs cluster where we have to do
+it. Do not try it yourself, you will lose your data and we will not help you.**
+
+**Migrating from 0.2 to 0.4 will break everything for sure. Never try it.**
+
+The internal data format of Garage hasn't changed much between 0.3 and 0.4.
+The Sled database is still the same, and the data directory as well.
+
+The following has changed, all in the meta directory:
+
+- `node_id` in 0.3 contains the identifier of the current node. In 0.4, this
+ file does nothing and should be deleted. It is replaced by `node_key` (the
+ secret key) and `node_key.pub` (the associated public key). A node's
+ identifier on the ring is its public key.
+
+- `peer_info` in 0.3 contains the list of peers saved automatically by Garage.
+ The format has changed and it is now stored in `peer_list` (`peer_info`
+ should be deleted).
+
+When migrating, all node identifiers will change. This also means that the
+affectation of data partitions on the ring will change, and lots of data will
+have to be rebalanced.
+
+- If your cluster has only 3 nodes, all nodes store everything, therefore nothing has to be rebalanced.
+
+- If your cluster has only 4 nodes, for any partition there will always be at
+ least 2 nodes that stored data before that still store it after. Therefore
+ the migration should in theory be transparent and Garage should continue to
+ work during the rebalance.
+
+- If your cluster has 5 or more nodes, data will disappear during the
+ migration. Do not migrate (fortunately we don't have this scenario at
+ Deuxfleurs), or if you do, make Garage unavailable until things stabilize
+ (disable web and api access).
+
+
+The migration steps are as follows:
+
+1. Prepare a new configuration file for 0.4. For each node, point to the same
+ meta and data directories as Garage 0.3. Basically, the things that change
+ are the following:
+
+ - No more `rpc_tls` section
+ - You have to generate a shared `rpc_secret` and put it in all config files
+ - `bootstrap_peers` has a different syntax as it has to contain node keys.
+ Leave it empty and use `garage node-id` and `garage node connect` instead (new features of 0.4)
+ - put the publicly accessible RPC address of your node in `rpc_public_addr` if possible (its optional but recommended)
+ - If you are using Consul, change the `consul_service_name` to NOT be the name advertised by Nomad.
+ Now Garage is responsible for advertising its own service itself.
+
+2. Disable api and web access for some time (Garage does not support disabling
+ these endpoints but you can change the port number or stop your reverse
+ proxy for instance).
+
+3. Do `garage repair -a --yes tables` and `garage repair -a --yes blocks`,
+ check the logs and check that all data seems to be synced correctly between
+ nodes.
+
+4. Save somewhere the output of `garage status`. We will need this to remember
+ how to reconfigure nodes in 0.4.
+
+5. Turn off Garage 0.3
+
+6. Backup metadata folders if you can (i.e. if you have space to do it
+ somewhere). Backuping data folders could also be usefull but that's much
+ harder to do. If your filesystem supports snapshots, this could be a good
+ time to use them.
+
+7. Turn on Garage 0.4
+
+8. At this point, running `garage status` should indicate that all nodes of the
+ previous cluster are "unavailable". The nodes have new identifiers that
+ should appear in healthy nodes once they can talk to one another (use
+ `garage node connect` if necessary`). They should have NO ROLE ASSIGNED at
+ the moment.
+
+9. Prepare a script with several `garage node configure` commands that replace
+ each of the v0.3 node ID with the corresponding v0.4 node ID, with the same
+ zone/tag/capacity. For example if your node `drosera` had identifier `c24e`
+ before and now has identifier `789a`, and it was configured with capacity
+ `2` in zone `dc1`, put the following command in your script:
+
+```bash
+garage node configure 789a -z dc1 -c 2 -t drosera --replace c24e
+```
+
+10. Run your reconfiguration script. Check that the new output of `garage
+ status` contains the correct node IDs with the correct values for capacity
+ and zone. Old nodes should no longer be mentioned.
+
+11. If your cluster has 4 nodes or less, and you are feeling adventurous, you
+ can reenable Web and API access now. Things will probably work.
+
+12. Garage might already be resyncing stuff. Issue a `garage repair -a --yes
+ tables` and `garage repair -a --yes blocks` to force it to do so.
+
+13. Wait for resyncing activity to stop in the logs. Do steps 12 and 13 two or
+ three times, until you see that when you issue the repair commands, nothing
+ gets resynced any longer.
+
+14. Your upgraded cluster should be in a working state. Re-enable API and Web
+ access and check that everything went well.
diff --git a/content/documentation/working_documents/migration_06.md b/content/documentation/working_documents/migration_06.md
new file mode 100644
index 0000000..4312e37
--- /dev/null
+++ b/content/documentation/working_documents/migration_06.md
@@ -0,0 +1,46 @@
+# Migrating from 0.5 to 0.6
+
+**This guide explains how to migrate to 0.6 if you have an existing 0.5 cluster.
+We don't recommend trying to migrate directly from 0.4 or older to 0.6.**
+
+**We make no guarantee that this migration will work perfectly:
+back up all your data before attempting it!**
+
+Garage v0.6 (not yet released) introduces a new data model for buckets,
+that allows buckets to have many names (aliases).
+Buckets can also have "private" aliases (called local aliases),
+which are only visible when using a certain access key.
+
+This new data model means that the metadata tables have changed quite a bit in structure,
+and a manual migration step is required.
+
+The migration steps are as follows:
+
+1. Disable api and web access for some time (Garage does not support disabling
+ these endpoints but you can change the port number or stop your reverse
+ proxy for instance).
+
+2. Do `garage repair -a --yes tables` and `garage repair -a --yes blocks`,
+ check the logs and check that all data seems to be synced correctly between
+ nodes.
+
+4. Turn off Garage 0.5
+
+5. **Backup your metadata folders!!**
+
+6. Turn on Garage 0.6
+
+7. At this point, `garage bucket list` should indicate that no buckets are present
+ in the cluster. `garage key list` should show all of the previously existing
+ access key, however these keys should not have any permissions to access buckets.
+
+8. Run `garage migrate buckets050`: this will populate the new bucket table with
+ the buckets that existed previously. This will also give access to API keys
+ as it was before.
+
+9. Check that all your buckets indeed appear in `garage bucket list`, and that
+ keys have the proper access flags set. If that is not the case, revert
+ everything and file a bug!
+
+10. Your upgraded cluster should be in a working state. Re-enable API and Web
+ access and check that everything went well.