aboutsummaryrefslogtreecommitdiff
path: root/op_guide/stolon
diff options
context:
space:
mode:
authorAlex Auvolat <alex@adnab.me>2022-12-22 17:46:19 +0100
committerAlex Auvolat <alex@adnab.me>2022-12-22 17:46:19 +0100
commitb575b2b4862c4019a4ca5c9240ea5989f7a93b40 (patch)
treedfc5889c25a69d8ce3402539484a20d5af732db3 /op_guide/stolon
parent015c3725326e635d58bd5ee1c30b95560ed45055 (diff)
downloadinfrastructure-b575b2b4862c4019a4ca5c9240ea5989f7a93b40.tar.gz
infrastructure-b575b2b4862c4019a4ca5c9240ea5989f7a93b40.zip
Remove all files from op_guide, now migrated to guide.deuxfleurs.fr
Diffstat (limited to 'op_guide/stolon')
-rw-r--r--op_guide/stolon/README.md3
-rw-r--r--op_guide/stolon/create_database.md26
-rw-r--r--op_guide/stolon/install.md87
-rw-r--r--op_guide/stolon/manual_backup.md305
-rw-r--r--op_guide/stolon/nomad_full_backup.md26
5 files changed, 0 insertions, 447 deletions
diff --git a/op_guide/stolon/README.md b/op_guide/stolon/README.md
deleted file mode 100644
index 9e76b0e..0000000
--- a/op_guide/stolon/README.md
+++ /dev/null
@@ -1,3 +0,0 @@
- - [Initialize the cluster](install.md)
- - [Create a database](create_database.md)
- - [Manually backup all the databases](manual_backup.md)
diff --git a/op_guide/stolon/create_database.md b/op_guide/stolon/create_database.md
deleted file mode 100644
index 96999ef..0000000
--- a/op_guide/stolon/create_database.md
+++ /dev/null
@@ -1,26 +0,0 @@
-## 1. Create a LDAP user and assign a password for your service
-
-Go to guichet.deuxfleurs.fr
-
- 1. Everything takes place in `ou=services,ou=users,dc=deuxfleurs,dc=fr`
- 2. Create a new user, like `johny`
- 3. Generate a random password with `openssl rand -base64 32`
- 4. Hash it with `slappasswd`
- 5. Add a `userpassword` entry with the hash
-
-This step can also be done using the automated tool `secretmgr.py` in the app folder.
-
-## 2. Connect to postgres with the admin users
-
-```bash
-# 1. Launch ssh tunnel given in the README
-# 2. Make sure you have postregsql client installed locally
-psql -h localhost -U postgres -W postgres
-```
-
-## 3. Create the binded users with LDAP in postgres + the database
-
-```sql
-CREATE USER sogo;
-Create database sogodb with owner sogo encoding 'utf8' LC_COLLATE = 'C' LC_CTYPE = 'C' TEMPLATE template0;
-```
diff --git a/op_guide/stolon/install.md b/op_guide/stolon/install.md
deleted file mode 100644
index e4791ed..0000000
--- a/op_guide/stolon/install.md
+++ /dev/null
@@ -1,87 +0,0 @@
-Spawn container:
-
-```bash
-docker run \
- -ti --rm \
- --name stolon-config \
- --user root \
- -v /var/lib/consul/pki/:/certs \
- superboum/amd64_postgres:v11
-```
-
-
-Init with:
-
-```
-stolonctl \
- --cluster-name chelidoine \
- --store-backend=consul \
- --store-endpoints https://consul.service.prod.consul:8501 \
- --store-ca-file /certs/consul-ca.crt \
- --store-cert-file /certs/consul2022-client.crt \
- --store-key /certs/consul2022-client.key \
- init \
- '{ "initMode": "new",
- "usePgrewind" : true,
- "proxyTimeout" : "120s",
- "pgHBA": [
- "host all postgres all md5",
- "host replication replicator all md5",
- "host all all all ldap ldapserver=bottin.service.prod.consul ldapbasedn=\"ou=users,dc=deuxfleurs, dc=fr\" ldapbinddn=\"<bind_dn>\" ldapbindpasswd=\"<bind_pwd>\" ldapsearchattribute=\"cn\""
- ]
- }'
-
-```
-
-Then set appropriate permission on host:
-
-```
-mkdir -p /mnt/{ssd,storage}/postgres/
-chown -R 999:999 /mnt/{ssd,storage}/postgres/
-```
-
-(102 is the id of the postgres user used in Docker)
-It might be improved by staying with root, then chmoding in an entrypoint and finally switching to user 102 before executing user's command.
-Moreover it would enable the usage of the user namespace that shift the UIDs.
-
-
-
-## Upgrading the cluster
-
-To retrieve the current stolon config:
-
-```
-stolonctl spec --cluster-name chelidoine --store-backend consul --store-ca-file ... --store-cert-file ... --store-endpoints https://consul.service.prod.consul:8501
-```
-
-The important part for the LDAP:
-
-```
-{
- "pgHBA": [
- "host all postgres all md5",
- "host replication replicator all md5",
- "host all all all ldap ldapserver=bottin.service.2.cluster.deuxfleurs.fr ldapbasedn=\"ou=users,dc=deuxfleurs,dc=fr\" ldapbinddn=\"cn=admin,dc=deuxfleurs,dc=fr\" ldapbindpasswd=\"<REDACTED>\" ldapsearchattribute=\"cn\""
- ]
-}
-```
-
-Once a patch is writen:
-
-```
-stolonctl --cluster-name pissenlit --store-backend consul --store-endpoints http://consul.service.2.cluster.deuxfleurs.fr:8500 update --patch -f /tmp/patch.json
-```
-
-## Log
-
-- 2020-12-18 Activate pg\_rewind in stolon
-
-```
-stolonctl --cluster-name pissenlit --store-backend consul --store-endpoints http://consul.service.2.cluster.deuxfleurs.fr:8500 update --patch '{ "usePgrewind" : true }'
-```
-
-- 2021-03-14 Increase proxy timeout to cope with consul latency spikes
-
-```
-stolonctl --cluster-name pissenlit --store-backend consul --store-endpoints http://consul.service.2.cluster.deuxfleurs.fr:8500 update --patch '{ "proxyTimeout" : "120s" }'
-```
diff --git a/op_guide/stolon/manual_backup.md b/op_guide/stolon/manual_backup.md
deleted file mode 100644
index 654d789..0000000
--- a/op_guide/stolon/manual_backup.md
+++ /dev/null
@@ -1,305 +0,0 @@
-## Disclaimer
-
-Do **NOT** use the following backup methods on the Stolon Cluster:
- 1. copying the data directory
- 2. `pg_dump`
- 3. `pg_dumpall`
-
-The first one will lead to corrupted/inconsistent files.
-The second and third ones put too much pressure on the cluster.
-Basically, you will destroy it, in the following ways:
- - Load will increase, requests will timeout
- - RAM will increase, the daemon will be OOM (Out Of Memory) killed by Linux
- - Potentially, the WAL log will grow a lot
-
-
-## A binary backup with `pg_basebackup`
-
-The only acceptable solution is `pg_basebackup` with **some throttling configured**.
-Later, if you want a SQL dump, you can inject this binary backup on an ephemeral database you spawned solely for this purpose on a non-production machine.
-
-First, start by fetching from Consul the identifiers of the replication account.
-Do not use the root account setup in Stolon, it will not work.
-
-First setup a SSH tunnel on your machine that bind postgresql, eg:
-
-```bash
-ssh -L 5432:psql-proxy.service.2.cluster.deuxfleurs.fr:5432 ...
-```
-
-*Later, we will use `/tmp/sql` as our working directory. Depending on your distribution, this
-folder may be a `tmpfs` and thus mounted on RAM. If it is the case, choose another folder, that is not a `tmpfs`, otherwise you will fill your RAM
-and fail your backup. I am using NixOS and the `/tmp` folder is a regular folder, persisted on disk, which explain why I am using it.*
-
-Then export your password in `PGPASSWORD` and launch the backup:
-
-```bash
-export PGPASSWORD=xxx
-
-mkdir -p /tmp/sql
-cd /tmp/sql
-
-pg_basebackup \
- --host=127.0.0.1 \
- --username=replicator \
- --pgdata=/tmp/sql \
- --format=tar \
- --wal-method=stream \
- --gzip \
- --compress=6 \
- --progress \
- --max-rate=5M
-```
-
-*Something you should now: while it seems optional, fetching the WAL is mandatory. At first, I thought it was a way to have a "more recent backup".
-But after some reading, it appears that the base backup is corrupted because it is not a snapshot at all, but a copy of the postgres folder with no specific state.
-The whole point of the WAL is, in fact, to fix this corrupted archive...*
-
-*Take a cup of coffe, it will take some times...*
-
-The result I get (the important file is `base.tar.gz`, `41921.tar.gz` will probably be missing as it is a secondary tablespace I will deactivate soon):
-
-```
-[nix-shell:/tmp/sql]$ ls
-41921.tar.gz backup_manifest base.tar.gz pg_wal.tar.gz
-```
-
-From now, disconnect from the production to continue your work.
-You don't need it anymore and it will prevent some disaster if you fail a command.
-
-
-## Importing the backup
-
-> The backup taken with `pg_basebckup` is an exact copy of your data directory so, all you need to do to restore from that backup is to point postgres at that directory and start it up.
-
-```bash
-mkdir -p /tmp/sql/pg_data && cd /tmp/sql/pg_data
-tar xzfv ../base.tar.gz
-```
-
-Now you should have something like that:
-
-```
-[nix-shell:/tmp/sql/pg_data]$ ls
-backup_label base pg_commit_ts pg_hba.conf pg_logical pg_notify pg_serial pg_stat pg_subtrans pg_twophase pg_wal postgresql.conf tablespace_map
-backup_label.old global pg_dynshmem pg_ident.conf pg_multixact pg_replslot pg_snapshots pg_stat_tmp pg_tblspc PG_VERSION pg_xact stolon-temp-postgresql.conf
-```
-
-Now we will extract the WAL:
-
-```bash
-mkdir -p /tmp/sql/wal && cd /tmp/sql/wal
-tar xzfv ../pg_wal.tar.gz
-```
-
-You should have something like that:
-
-```
-[nix-shell:/tmp/sql/wal]$ ls
-00000003000014AF000000C9 00000003000014AF000000CA 00000003.history archive_status
-```
-
-Before restoring our backup, we want to check it:
-
-```bash
-cd /tmp/sql/pg_data
-cp ../backup_manifest .
-# On ne vérifie pas le WAL car il semblerait que ça marche pas trop
-# Cf ma référence en bas capdata.fr
-# pg_verifybackup -w ../wal .
-pg_verifybackup -n .
-
-```
-
-Now, We must edit/read some files before launching our ephemeral server:
- - Set `listen_addresses = '0.0.0.0'` in `postgresql.conf`
- - Add `restore_command = 'cp /mnt/wal/%f %p' ` in `postgresql.conf`
- - Check `port` in `postgresql.conf`, in our case it is `5433`.
- - Create an empty file named `recovery.signal`
-
-*Do not create a `recovery.conf` file, it might be written on the internet but this is a deprecated method and your postgres daemon will refuse to boot if it finds one.*
-
-*Currently, we use port 5433 in oour postgresql configuration despite 5432 being the default port. Indeed, in production, clients access the cluster transparently through the Stolon Proxy that listens on port 5432 and redirect the requests to the correct PostgreSQL instance, listening secretly on port 5433! To export our binary backup in text, we will directly query our postgres instance without passing through the proxy, which is why you must note this port.*
-
-Now we will start our postgres container on our machine.
-
-At the time of writing the live version is `superboum/amd64_postgres:v9`.
-We must start by getting `postgres` user id. Our container are run by default with this user, so you only need to run:
-
-```bash
-docker run --rm -it superboum/amd64_postgres:v9 id
-```
-
-And we get:
-
-```
-uid=999(postgres) gid=999(postgres) groups=999(postgres),101(ssl-cert)
-```
-
-Now `chown` your `pg_data`:
-
-```bash
-chown 999:999 -R /tmp/sql/{pg_data,wal}
-chmod 700 -R /tmp/sql/{pg_data,wal}
-```
-
-And finally:
-
-```
-docker run \
- --rm \
- -it \
- -p 5433:5433 \
- -v /tmp/sql/:/mnt/ \
- superboum/amd64_postgres:v9 \
- postgres -D /mnt/pg_data
-```
-
-I have the following output:
-
-```
-2022-01-28 14:46:39.750 GMT [1] LOG: skipping missing configuration file "/mnt/pg_data/postgresql.auto.conf"
-2022-01-28 14:46:39.763 UTC [1] LOG: starting PostgreSQL 13.3 (Debian 13.3-1.pgdg100+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 8.3.0-6) 8.3.0, 64-bit
-2022-01-28 14:46:39.764 UTC [1] LOG: listening on IPv4 address "0.0.0.0", port 5433
-2022-01-28 14:46:39.767 UTC [1] LOG: listening on Unix socket "/tmp/.s.PGSQL.5433"
-2022-01-28 14:46:39.773 UTC [7] LOG: database system was interrupted; last known up at 2022-01-28 14:33:13 UTC
-cp: cannot stat '/mnt/wal/00000004.history': No such file or directory
-2022-01-28 14:46:40.318 UTC [7] LOG: starting archive recovery
-2022-01-28 14:46:40.321 UTC [7] LOG: restored log file "00000003.history" from archive
-2022-01-28 14:46:40.336 UTC [7] LOG: restored log file "00000003000014AF000000C9" from archive
-2022-01-28 14:46:41.426 UTC [7] LOG: could not open directory "pg_tblspc/41921/PG_13_202007201": No such file or directory
-2022-01-28 14:46:41.445 UTC [7] LOG: could not open directory "pg_tblspc/41921/PG_13_202007201": No such file or directory
-2022-01-28 14:46:41.457 UTC [7] LOG: redo starts at 14AF/C9000028
-2022-01-28 14:46:41.500 UTC [7] LOG: restored log file "00000003000014AF000000CA" from archive
-2022-01-28 14:46:42.461 UTC [7] LOG: consistent recovery state reached at 14AF/CA369AB0
-2022-01-28 14:46:42.461 UTC [1] LOG: database system is ready to accept read only connections
-cp: cannot stat '/mnt/wal/00000003000014AF000000CB': No such file or directory
-2022-01-28 14:46:42.463 UTC [7] LOG: redo done at 14AF/CA369AB0
-2022-01-28 14:46:42.463 UTC [7] LOG: last completed transaction was at log time 2022-01-28 14:35:04.698438+00
-2022-01-28 14:46:42.480 UTC [7] LOG: could not open directory "pg_tblspc/41921/PG_13_202007201": No such file or directory
-2022-01-28 14:46:42.493 UTC [7] LOG: restored log file "00000003000014AF000000CA" from archive
-cp: cannot stat '/mnt/wal/00000004.history': No such file or directory
-2022-01-28 14:46:43.462 UTC [7] LOG: selected new timeline ID: 4
-2022-01-28 14:46:44.441 UTC [7] LOG: archive recovery complete
-2022-01-28 14:46:44.444 UTC [7] LOG: restored log file "00000003.history" from archive
-2022-01-28 14:46:45.614 UTC [1] LOG: database system is ready to accept connections
-```
-
-*Notes: the missing tablespace is a legacy tablesplace used in the past to debug Matrix. It will be removed soon, we can safely ignore it. Other errors on cp seems to be intended as postgres might want to know how far it can rewind with the WAL but I a not 100% sure.*
-
-Your ephemeral instance should work:
-
-```bash
-export PGPASSWORD=xxx # your postgres (admin) account password
-
-psql -h 127.0.0.1 -p 5433 -U postgres postgres
-```
-
-And your databases should appear:
-
-```
-[nix-shell:~/Documents/dev/infrastructure]$ psql -h 127.0.0.1 -p 5433 -U postgres postgres
-psql (13.5, server 13.3 (Debian 13.3-1.pgdg100+1))
-Type "help" for help.
-
-postgres=# \l
- List of databases
- Name | Owner | Encoding | Collate | Ctype | Access privileges
------------+----------+----------+------------+------------+-----------------------
- xxxxx | xxxxx | UTF8 | en_US.utf8 | en_US.utf8 |
- xxxxx | xxxxx | UTF8 | en_US.utf8 | en_US.utf8 |
- xxxxx | xxxxx | UTF8 | en_US.utf8 | en_US.utf8 |
- postgres | postgres | UTF8 | en_US.utf8 | en_US.utf8 |
- xxxx | xxxxx | UTF8 | en_US.utf8 | en_US.utf8 |
- xxxx | xxxxx | UTF8 | en_US.utf8 | en_US.utf8 |
- template0 | postgres | UTF8 | en_US.utf8 | en_US.utf8 | =c/postgres +
- | | | | | postgres=CTc/postgres
- template1 | postgres | UTF8 | en_US.utf8 | en_US.utf8 | =c/postgres +
- | | | | | postgres=CTc/postgres
-(8 rows)
-```
-
-## Dump your ephemeral database as SQL
-
-Now we can do a SQL export of our ephemeral database.
-We use zstd to automatically compress the outputed file.
-We use multiple parameters:
- - `-vv` gives use some idea on the progress
- - `-9` is a quite high value and should compress efficiently. Decrease it if your machine is low powered
- - `-T0` asks zstd to use all your cores. By default, zstd uses only one core.
-
-```bash
-pg_dumpall -h 127.0.0.1 -p 5433 -U postgres \
- | zstd -vv -9 -T0 --format=zstd > dump-`date --rfc-3339=seconds | sed 's/ /T/'`.sql.zstd
-```
-
-I get the following result:
-
-```
-[nix-shell:/tmp/sql]$ ls -lah dump*
--rw-r--r-- 1 quentin users 749M janv. 28 16:07 dump-2022-01-28T16:06:29+01:00.sql.zstd
-```
-
-Now you can stop your ephemeral server.
-
-## Restore your SQL file
-
-First, start a blank server:
-
-```bash
-docker run \
- --rm -it \
- --name postgres \
- -p 5433:5432 \
- superboum/amd64_postgres:v9 \
- bash -c '
- set -ex
- mkdir /tmp/psql
- initdb -D /tmp/psql --no-locale --encoding=UTF8
- echo "host all postgres 0.0.0.0/0 md5" >> /tmp/psql/pg_hba.conf
- postgres -D /tmp/psql
- '
-```
-
-Then set the same password as your prod for the `posgtgres` user (it will be required as part of the restore):
-
-```bash
-docker exec -ti postgres bash -c "echo \"ALTER USER postgres WITH PASSWORD '$PGPASSWORD';\" | psql"
-echo '\l' | psql -h 127.0.0.1 -p 5433 -U postgres postgres
-# the database should have no entry (except `posgtres`, `template0` and `template1`) otherwise ABORT EVERYTHING, YOU ARE ON THE WRONG DB
-```
-
-And finally, restore your SQL backup:
-
-```bash
-zstdcat -vv dump-* | \
- grep -P -v '^(CREATE|DROP) ROLE postgres;' | \
- psql -h 127.0.0.1 -p 5433 -U postgres --set ON_ERROR_STOP=on postgres
-```
-
-*Note: we must skip CREATE/DROP ROLE postgres during the restore as it aready exists and would generate an error.
-Because we want to be extra careful, we specifically asked to crash on every error and do not want to change this behavior.
-So, instead, we simply remove any entry that contains the specific regex stated in the previous command.*
-
-Check that the backup has been correctly restored.
-For example:
-
-```bash
-docker exec -ti postgres psql
-#then type "\l", "\c db-name", "select ..."
-```
-
-## Finally, store it safely
-
-```bash
-rsync --progress -av /tmp/sql/{*.tar.gz,backup_manifest,dump-*} backup/target
-```
-
-## Ref
-
- - https://philipmcclarence.com/backing-up-and-restoring-postgres-using-pg_basebackup/
- - https://www.cybertec-postgresql.com/en/pg_basebackup-creating-self-sufficient-backups/
- - https://www.postgresql.org/docs/14/continuous-archiving.html
- - https://www.postgresql.org/docs/14/backup-dump.html#BACKUP-DUMP-RESTORE
- - https://dba.stackexchange.com/questions/75033/how-to-restore-everything-including-postgres-role-from-pg-dumpall-backup
- - https://blog.capdata.fr/index.php/postgresql-13-les-nouveautes-interessantes/
diff --git a/op_guide/stolon/nomad_full_backup.md b/op_guide/stolon/nomad_full_backup.md
deleted file mode 100644
index 2fb5822..0000000
--- a/op_guide/stolon/nomad_full_backup.md
+++ /dev/null
@@ -1,26 +0,0 @@
-Start by following ../restic
-
-## Garbage collect old backups
-
-```
-mc ilm import deuxfleurs/${BUCKET_NAME} <<EOF
-{
- "Rules": [
- {
- "Expiration": {
- "Days": 62
- },
- "ID": "PurgeOldBackups",
- "Status": "Enabled"
- }
- ]
-}
-EOF
-```
-
-Check that it has been activated:
-
-```
- mc ilm ls deuxfleurs/${BUCKET_NAME}
-```
-