From 0e1574a82b7067910d5403cfd46e94bcf929327a Mon Sep 17 00:00:00 2001
From: Alex Auvolat <alex@adnab.me>
Date: Thu, 22 Dec 2022 23:44:00 +0100
Subject: More doc reorganization

---
 doc/adding-nodes.md    | 30 +++++++++++++++++++++
 doc/architecture.md    | 19 ++++++-------
 doc/onboarding.md      | 45 +++++++++++++++++++++++++++++++
 doc/quick-start.md     | 73 --------------------------------------------------
 doc/why-not-ansible.md | 37 +++++++++++++++++++++++++
 5 files changed, 122 insertions(+), 82 deletions(-)
 create mode 100644 doc/adding-nodes.md
 create mode 100644 doc/onboarding.md
 delete mode 100644 doc/quick-start.md
 create mode 100644 doc/why-not-ansible.md

(limited to 'doc')

diff --git a/doc/adding-nodes.md b/doc/adding-nodes.md
new file mode 100644
index 0000000..24b409c
--- /dev/null
+++ b/doc/adding-nodes.md
@@ -0,0 +1,30 @@
+# Quick start
+
+## How to create files for a new zone
+
+*The documentation is written for the production cluster, the same apply for other clusters.*
+
+Basically:
+  - Create your `site` file in `cluster/prod/site/` folder
+  - Create your `node` files in `cluster/prod/node/` folder
+  - Add your wireguard configuration to `cluster/prod/cluster.nix`
+    - You will have to edit your NAT config manually to bind one public IPv4 port to each node
+    - Nodes' public wireguard keys are generated during the first run of `deploy_nixos`, see below
+  - Add your nodes to `cluster/prod/ssh_config`, it will be used by the various SSH scripts.
+    - If you use `ssh` directly, use `ssh -F ./cluster/prod/ssh_config`
+    - Add `User root` for the first time as your user will not be declared yet on the system
+
+## How to deploy a Nix configuration on a fresh node
+
+We suppose that the node name is `datura`. 
+Start by doing the deployment one node at a time, you will have plenty of time
+in your operator's life to break everything through automation.
+
+Run:
+  - `./deploy_nixos prod datura` - to deploy the nix configuration file;
+   - a new wireguard key is printed if it hadn't been generated before, it has to be
+     added to `cluster.nix`, and then redeployed on all nodes as the new wireguard conf is needed everywhere
+  - `./deploy_passwords prod datura` - to deploy user's passwords
+   - if a user changes their password (using `./passwd`), needs to be redeployed on all nodes to setup the password on all nodes
+  - `./deploy_pki prod datura` - to deploy Nomad's and Consul's PKI
+
diff --git a/doc/architecture.md b/doc/architecture.md
index 8a9579f..ee83dca 100644
--- a/doc/architecture.md
+++ b/doc/architecture.md
@@ -1,4 +1,4 @@
-# Additional README
+# Overall architecture
 
 ## Configuring the OS
 
@@ -15,6 +15,7 @@ All deployment scripts can use the following parameters passed as environment va
 - `SUDO_PASS`: optionnally, the password for `sudo` on cluster nodes. If not set, it will be asked at the begninning.
 - `SSH_USER`: optionnally, the user to try to login using SSH. If not set, the username from your local machine will be used.
 
+
 ### Assumptions (how to setup your environment)
 
 - you have an SSH access to all of your cluster nodes (listed in `cluster/<cluster_name>/ssh_config`)
@@ -25,6 +26,7 @@ All deployment scripts can use the following parameters passed as environment va
 - you have a clone of the secrets repository in your `pass` password store, for instance at `~/.password-store/deuxfleurs`
   (scripts in this repo will read and write all secrets in `pass` under `deuxfleurs/cluster/<cluster_name>/`)
 
+
 ### Deploying the NixOS configuration
 
 The NixOS configuration makes use of a certain number of files:
@@ -48,12 +50,9 @@ or to deploy only on a single node:
 
 To upgrade NixOS, use the `./upgrade_nixos` script instead (it has the same syntax).
 
-**When adding a node to the cluster:** just do `./deploy_nixos <cluster_name> <name_of_new_node>`
 
 ### Generating and deploying a PKI for Consul and Nomad
 
-This is very similar to how we do for Wesher.
-
 First, if the PKI has not yet been created, create it with:
 
 ```
@@ -66,7 +65,8 @@ Then, deploy the PKI on all nodes with:
 ./deploy_pki <cluster_name>
 ```
 
-**When adding a node to the cluster:** just do `./deploy_pki <cluster_name> <name_of_new_node>`
+Note that certificates are valid for not much more than one year: every year in January, `gen_pki` and `deploy_pki` have to be re-run to generate certificates for the new year.
+
 
 ### Adding administrators and password management
 
@@ -89,6 +89,7 @@ Then, an administrator that already has root access must run the following (afte
 ./deploy_passwords <cluster_name>
 ```
 
+
 ## Deploying stuff on Nomad
 
 ### Connecting to Nomad
@@ -118,12 +119,12 @@ Stuff should be started in this order:
 1. `app/core`
 2. `app/frontend`
 3. `app/telemetry`
-4. `app/garage-staging`
+4. `app/garage`
 5. `app/directory`
 
-Then, other stuff can be started in any order:
+Then, other stuff can be started in any order, e.g.:
 
-- `app/im` (cluster `staging` only)
-- `app/cryptpad` (cluster `prod` only)
+- `app/im`
+- `app/cryptpad`
 - `app/drone-ci`
 
diff --git a/doc/onboarding.md b/doc/onboarding.md
new file mode 100644
index 0000000..b3bd264
--- /dev/null
+++ b/doc/onboarding.md
@@ -0,0 +1,45 @@
+# Onboarding / quick start for new administrators
+
+## How to welcome a new administrator
+
+See: https://guide.deuxfleurs.fr/operations/acces/pass/
+
+Basically:
+  - The new administrator generates a GPG key and publishes it on Gitea
+  - All existing administrators pull their key and sign it
+  - An existing administrator reencrypt the keystore with this new key and push it
+  - The new administrator clone the repo and check that they can decrypt the secrets
+  - Finally, the new administrator must choose a password to operate over SSH with `./passwd prod rick` where `rick` is the target username
+
+
+## How to operate a node (conncet to Nomad and Consul)
+
+Edit your `~/.ssh/config` file with content such as the following:
+
+```
+Host dahlia
+  HostName dahlia.machine.deuxfleurs.fr
+  LocalForward 14646 127.0.0.1:4646
+  LocalForward 8501 127.0.0.1:8501
+  LocalForward 1389 bottin.service.prod.consul:389
+  LocalForward 5432 psql-proxy.service.prod.consul:5432
+```
+
+Then run the TLS proxy and leave it running:
+
+```
+./tlsproxy prod
+```
+
+SSH to a production machine (e.g. dahlia) and leave it running:
+
+```
+ssh dahlia
+```
+
+
+Finally you should see be able to access the production Nomad and Consul by browsing: 
+
+ - Consul: http://localhost:8500
+ - Nomad: http://localhost:4646
+
diff --git a/doc/quick-start.md b/doc/quick-start.md
deleted file mode 100644
index 1307fde..0000000
--- a/doc/quick-start.md
+++ /dev/null
@@ -1,73 +0,0 @@
-# Quick start
-
-## How to welcome a new administrator
-
-See: https://guide.deuxfleurs.fr/operations/acces/pass/
-
-Basically:
-  - The new administrator generates a GPG key and publishes it on Gitea
-  - All existing administrators pull their key and sign it
-  - An existing administrator reencrypt the keystore with this new key and push it
-  - The new administrator clone the repo and check that they can decrypt the secrets
-  - Finally, the new administrator must choose a password to operate over SSH with `./passwd prod rick` where `rick` is the target username
-
-
-## How to create files for a new zone
-
-*The documentation is written for the production cluster, the same apply for other clusters.*
-
-Basically:
-  - Create your `site` file in `cluster/prod/site/` folder
-  - Create your `node` files in `cluster/prod/node/` folder
-  - Add your wireguard configuration to `cluster/prod/cluster.nix`
-    - You will have to edit your NAT config manually to bind one public IPv4 port to each node
-    - Nodes' public wireguard keys are generated during the first run of `deploy_nixos`, see below
-  - Add your nodes to `cluster/prod/ssh_config`, it will be used by the various SSH scripts.
-    - If you use `ssh` directly, use `ssh -F ./cluster/prod/ssh_config`
-    - Add `User root` for the first time as your user will not be declared yet on the system
-
-## How to deploy a Nix configuration on a fresh node
-
-We suppose that the node name is `datura`. 
-Start by doing the deployment one node at a time, you will have plenty of time
-in your operator's life to break everything through automation.
-
-Run:
-  - `./deploy_nixos prod datura` - to deploy the nix configuration file;
-   - a new wireguard key is printed if it hadn't been generated before, it has to be
-     added to `cluster.nix`, and then redeployed on all nodes as the new wireguard conf is needed everywhere
-  - `./deploy_passwords prod datura` - to deploy user's passwords
-   - if a user changes their password (using `./passwd`), needs to be redeployed on all nodes to setup the password on all nodes
-  - `./deploy_pki prod datura` - to deploy Nomad's and Consul's PKI
-
-## How to operate a node
-
-Edit your `~/.ssh/config` file:
-
-```
-Host dahlia
-  HostName dahlia.machine.deuxfleurs.fr
-  LocalForward 14646 127.0.0.1:4646
-  LocalForward 8501 127.0.0.1:8501
-  LocalForward 1389 bottin.service.prod.consul:389
-  LocalForward 5432 psql-proxy.service.prod.consul:5432
-```
-
-Then run the TLS proxy and leave it running:
-
-```
-./tlsproxy prod
-```
-
-SSH to a production machine (e.g. dahlia) and leave it running:
-
-```
-ssh dahlia
-```
-
-
-Finally you should see be able to access the production Nomad and Consul by browsing: 
-
- - Consul: http://localhost:8500
- - Nomad: http://localhost:4646
-
diff --git a/doc/why-not-ansible.md b/doc/why-not-ansible.md
new file mode 100644
index 0000000..6c8be55
--- /dev/null
+++ b/doc/why-not-ansible.md
@@ -0,0 +1,37 @@
+# Why not Ansible?
+
+I often get asked why not use Ansible to deploy to remote machines, as this
+would look like a typical use case.  There are many reasons, which basically
+boil down to "I really don't like Ansible":
+
+- Ansible tries to do declarative system configuration, but doesn't do it
+  correctly at all, like Nix does.  Example: in NixOS, to undo something you've
+  done, just comment the corresponding lines and redeploy.
+
+- Ansible is massive overkill for what we're trying to do here, we're just
+  copying a few small files and running some basic commands, leaving the rest
+  to NixOS.
+
+- YAML is a pain to manipulate as soon as you have more than two or three
+  indentation levels.  Also, why in hell would you want to write loops and
+  conditions in YAML when you could use a proper expression language?
+
+- Ansible's vocabulary is not ours, and it imposes a rigid hierarchy of
+  directories and files which I don't want.
+
+- Ansible is probably not flexible enough to do what we want, at least not
+  without getting a migraine when trying. For example, it's inventory
+  management is too simple to account for the heterogeneity of our cluster
+  nodes while still retaining a level of organization (some configuration
+  options are defined cluster-wide, some are defined for each site - physical
+  location - we deploy on, and some are specific to each node).
+
+- I never remember Ansible's command line flags.
+
+- My distribution's package for Ansible takes almost 400MB once installed,
+  WTF???  By not depending on it, we're reducing the set of tools we need to
+  deploy to a bare minimum: Git, OpenSSH, OpenSSL, socat,
+  [pass](https://www.passwordstore.org/) (and the Consul and Nomad binaries
+  which are, I'll admit, not small).
+
+
-- 
cgit v1.2.3