aboutsummaryrefslogtreecommitdiff
path: root/doc/book/working-documents/migration-04.md
diff options
context:
space:
mode:
authorAlex <alex@adnab.me>2022-02-07 11:51:12 +0100
committerAlex <alex@adnab.me>2022-02-07 11:51:12 +0100
commit1c0ba930b8d6aa5d97e6942852240861e6ab9bed (patch)
treecddc9af5fc2378c76fe5ef6306f807e27648b7a7 /doc/book/working-documents/migration-04.md
parent45d6d377d2011d8fb4ceb13bb4584df97c458525 (diff)
downloadgarage-1c0ba930b8d6aa5d97e6942852240861e6ab9bed.tar.gz
garage-1c0ba930b8d6aa5d97e6942852240861e6ab9bed.zip
Reorganize documentation for new website (#213)
This PR should be merged after the new website is deployed. - [x] Rename files - [x] Add front matter section to all `.md` files in the book (necessary for Zola) - [x] Change all internal links to use Zola's linking system that checks broken links - [x] Some updates to documentation contents and organization Co-authored-by: Alex Auvolat <alex@adnab.me> Reviewed-on: https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/213 Co-authored-by: Alex <alex@adnab.me> Co-committed-by: Alex <alex@adnab.me>
Diffstat (limited to 'doc/book/working-documents/migration-04.md')
-rw-r--r--doc/book/working-documents/migration-04.md108
1 files changed, 108 insertions, 0 deletions
diff --git a/doc/book/working-documents/migration-04.md b/doc/book/working-documents/migration-04.md
new file mode 100644
index 00000000..d9d3ede1
--- /dev/null
+++ b/doc/book/working-documents/migration-04.md
@@ -0,0 +1,108 @@
++++
+title = "Migrating from 0.3 to 0.4"
+weight = 20
++++
+
+**Migrating from 0.3 to 0.4 is unsupported. This document is only intended to
+document the process internally for the Deuxfleurs cluster where we have to do
+it. Do not try it yourself, you will lose your data and we will not help you.**
+
+**Migrating from 0.2 to 0.4 will break everything for sure. Never try it.**
+
+The internal data format of Garage hasn't changed much between 0.3 and 0.4.
+The Sled database is still the same, and the data directory as well.
+
+The following has changed, all in the meta directory:
+
+- `node_id` in 0.3 contains the identifier of the current node. In 0.4, this
+ file does nothing and should be deleted. It is replaced by `node_key` (the
+ secret key) and `node_key.pub` (the associated public key). A node's
+ identifier on the ring is its public key.
+
+- `peer_info` in 0.3 contains the list of peers saved automatically by Garage.
+ The format has changed and it is now stored in `peer_list` (`peer_info`
+ should be deleted).
+
+When migrating, all node identifiers will change. This also means that the
+affectation of data partitions on the ring will change, and lots of data will
+have to be rebalanced.
+
+- If your cluster has only 3 nodes, all nodes store everything, therefore nothing has to be rebalanced.
+
+- If your cluster has only 4 nodes, for any partition there will always be at
+ least 2 nodes that stored data before that still store it after. Therefore
+ the migration should in theory be transparent and Garage should continue to
+ work during the rebalance.
+
+- If your cluster has 5 or more nodes, data will disappear during the
+ migration. Do not migrate (fortunately we don't have this scenario at
+ Deuxfleurs), or if you do, make Garage unavailable until things stabilize
+ (disable web and api access).
+
+
+The migration steps are as follows:
+
+1. Prepare a new configuration file for 0.4. For each node, point to the same
+ meta and data directories as Garage 0.3. Basically, the things that change
+ are the following:
+
+ - No more `rpc_tls` section
+ - You have to generate a shared `rpc_secret` and put it in all config files
+ - `bootstrap_peers` has a different syntax as it has to contain node keys.
+ Leave it empty and use `garage node-id` and `garage node connect` instead (new features of 0.4)
+ - put the publicly accessible RPC address of your node in `rpc_public_addr` if possible (its optional but recommended)
+ - If you are using Consul, change the `consul_service_name` to NOT be the name advertised by Nomad.
+ Now Garage is responsible for advertising its own service itself.
+
+2. Disable api and web access for some time (Garage does not support disabling
+ these endpoints but you can change the port number or stop your reverse
+ proxy for instance).
+
+3. Do `garage repair -a --yes tables` and `garage repair -a --yes blocks`,
+ check the logs and check that all data seems to be synced correctly between
+ nodes.
+
+4. Save somewhere the output of `garage status`. We will need this to remember
+ how to reconfigure nodes in 0.4.
+
+5. Turn off Garage 0.3
+
+6. Backup metadata folders if you can (i.e. if you have space to do it
+ somewhere). Backuping data folders could also be usefull but that's much
+ harder to do. If your filesystem supports snapshots, this could be a good
+ time to use them.
+
+7. Turn on Garage 0.4
+
+8. At this point, running `garage status` should indicate that all nodes of the
+ previous cluster are "unavailable". The nodes have new identifiers that
+ should appear in healthy nodes once they can talk to one another (use
+ `garage node connect` if necessary`). They should have NO ROLE ASSIGNED at
+ the moment.
+
+9. Prepare a script with several `garage node configure` commands that replace
+ each of the v0.3 node ID with the corresponding v0.4 node ID, with the same
+ zone/tag/capacity. For example if your node `drosera` had identifier `c24e`
+ before and now has identifier `789a`, and it was configured with capacity
+ `2` in zone `dc1`, put the following command in your script:
+
+```bash
+garage node configure 789a -z dc1 -c 2 -t drosera --replace c24e
+```
+
+10. Run your reconfiguration script. Check that the new output of `garage
+ status` contains the correct node IDs with the correct values for capacity
+ and zone. Old nodes should no longer be mentioned.
+
+11. If your cluster has 4 nodes or less, and you are feeling adventurous, you
+ can reenable Web and API access now. Things will probably work.
+
+12. Garage might already be resyncing stuff. Issue a `garage repair -a --yes
+ tables` and `garage repair -a --yes blocks` to force it to do so.
+
+13. Wait for resyncing activity to stop in the logs. Do steps 12 and 13 two or
+ three times, until you see that when you issue the repair commands, nothing
+ gets resynced any longer.
+
+14. Your upgraded cluster should be in a working state. Re-enable API and Web
+ access and check that everything went well.