From 0e1574a82b7067910d5403cfd46e94bcf929327a Mon Sep 17 00:00:00 2001 From: Alex Auvolat Date: Thu, 22 Dec 2022 23:44:00 +0100 Subject: More doc reorganization --- README.md | 47 +++----------------------------- doc/adding-nodes.md | 30 +++++++++++++++++++++ doc/architecture.md | 19 ++++++------- doc/onboarding.md | 45 +++++++++++++++++++++++++++++++ doc/quick-start.md | 73 -------------------------------------------------- doc/why-not-ansible.md | 37 +++++++++++++++++++++++++ 6 files changed, 126 insertions(+), 125 deletions(-) create mode 100644 doc/adding-nodes.md create mode 100644 doc/onboarding.md delete mode 100644 doc/quick-start.md create mode 100644 doc/why-not-ansible.md diff --git a/README.md b/README.md index 9514084..c86a067 100644 --- a/README.md +++ b/README.md @@ -12,54 +12,15 @@ It sets up the following: See the following documentation topics: -- [Quick start for adding new nodes after NixOS install](doc/quick-start.md) +- [Quick start and onboarding for new administrators](doc/onboarding.md) +- [How to add new nodes to a cluster (rapid overview)](doc/adding-nodes.md) - [Architecture of this repo, how the scripts work](doc/architecture.md) - [List of TCP and UDP ports used by services](doc/ports) Additionnal documentation topics: -- [Succint guide for NixOS installation with LUKX full disk encryption](doc/nixos-install.md) (we don't do that in practice on our servers) +- [Succint guide for NixOS installation with LUKX full disk encryption](doc/nixos-install-luks.md) (we don't do that in practice on our servers) - [Example `hardware-config.nix` for a full disk encryption scenario](doc/example-hardware-configuration.nix) +- [Why not Ansible?](doc/why-not-ansible.md) -## Why not Ansible? - -I often get asked why not use Ansible to deploy to remote machines, as this -would look like a typical use case. There are many reasons, which basically -boil down to "I really don't like Ansible": - -- Ansible tries to do declarative system configuration, but doesn't do it - correctly at all, like Nix does. Example: in NixOS, to undo something you've - done, just comment the corresponding lines and redeploy. - -- Ansible is massive overkill for what we're trying to do here, we're just - copying a few small files and running some basic commands, leaving the rest - to NixOS. - -- YAML is a pain to manipulate as soon as you have more than two or three - indentation levels. Also, why in hell would you want to write loops and - conditions in YAML when you could use a proper expression language? - -- Ansible's vocabulary is not ours, and it imposes a rigid hierarchy of - directories and files which I don't want. - -- Ansible is probably not flexible enough to do what we want, at least not - without getting a migraine when trying. For example, it's inventory - management is too simple to account for the heterogeneity of our cluster - nodes while still retaining a level of organization (some configuration - options are defined cluster-wide, some are defined for each site - physical - location - we deploy on, and some are specific to each node). - -- I never remember Ansible's command line flags. - -- My distribution's package for Ansible takes almost 400MB once installed, - WTF??? By not depending on it, we're reducing the set of tools we need to - deploy to a bare minimum: Git, OpenSSH, OpenSSL, socat, - [pass](https://www.passwordstore.org/) (and the Consul and Nomad binaries - which are, I'll admit, not small). - - -## More - -Please read README.more.md for more detailed information - diff --git a/doc/adding-nodes.md b/doc/adding-nodes.md new file mode 100644 index 0000000..24b409c --- /dev/null +++ b/doc/adding-nodes.md @@ -0,0 +1,30 @@ +# Quick start + +## How to create files for a new zone + +*The documentation is written for the production cluster, the same apply for other clusters.* + +Basically: + - Create your `site` file in `cluster/prod/site/` folder + - Create your `node` files in `cluster/prod/node/` folder + - Add your wireguard configuration to `cluster/prod/cluster.nix` + - You will have to edit your NAT config manually to bind one public IPv4 port to each node + - Nodes' public wireguard keys are generated during the first run of `deploy_nixos`, see below + - Add your nodes to `cluster/prod/ssh_config`, it will be used by the various SSH scripts. + - If you use `ssh` directly, use `ssh -F ./cluster/prod/ssh_config` + - Add `User root` for the first time as your user will not be declared yet on the system + +## How to deploy a Nix configuration on a fresh node + +We suppose that the node name is `datura`. +Start by doing the deployment one node at a time, you will have plenty of time +in your operator's life to break everything through automation. + +Run: + - `./deploy_nixos prod datura` - to deploy the nix configuration file; + - a new wireguard key is printed if it hadn't been generated before, it has to be + added to `cluster.nix`, and then redeployed on all nodes as the new wireguard conf is needed everywhere + - `./deploy_passwords prod datura` - to deploy user's passwords + - if a user changes their password (using `./passwd`), needs to be redeployed on all nodes to setup the password on all nodes + - `./deploy_pki prod datura` - to deploy Nomad's and Consul's PKI + diff --git a/doc/architecture.md b/doc/architecture.md index 8a9579f..ee83dca 100644 --- a/doc/architecture.md +++ b/doc/architecture.md @@ -1,4 +1,4 @@ -# Additional README +# Overall architecture ## Configuring the OS @@ -15,6 +15,7 @@ All deployment scripts can use the following parameters passed as environment va - `SUDO_PASS`: optionnally, the password for `sudo` on cluster nodes. If not set, it will be asked at the begninning. - `SSH_USER`: optionnally, the user to try to login using SSH. If not set, the username from your local machine will be used. + ### Assumptions (how to setup your environment) - you have an SSH access to all of your cluster nodes (listed in `cluster//ssh_config`) @@ -25,6 +26,7 @@ All deployment scripts can use the following parameters passed as environment va - you have a clone of the secrets repository in your `pass` password store, for instance at `~/.password-store/deuxfleurs` (scripts in this repo will read and write all secrets in `pass` under `deuxfleurs/cluster//`) + ### Deploying the NixOS configuration The NixOS configuration makes use of a certain number of files: @@ -48,12 +50,9 @@ or to deploy only on a single node: To upgrade NixOS, use the `./upgrade_nixos` script instead (it has the same syntax). -**When adding a node to the cluster:** just do `./deploy_nixos ` ### Generating and deploying a PKI for Consul and Nomad -This is very similar to how we do for Wesher. - First, if the PKI has not yet been created, create it with: ``` @@ -66,7 +65,8 @@ Then, deploy the PKI on all nodes with: ./deploy_pki ``` -**When adding a node to the cluster:** just do `./deploy_pki ` +Note that certificates are valid for not much more than one year: every year in January, `gen_pki` and `deploy_pki` have to be re-run to generate certificates for the new year. + ### Adding administrators and password management @@ -89,6 +89,7 @@ Then, an administrator that already has root access must run the following (afte ./deploy_passwords ``` + ## Deploying stuff on Nomad ### Connecting to Nomad @@ -118,12 +119,12 @@ Stuff should be started in this order: 1. `app/core` 2. `app/frontend` 3. `app/telemetry` -4. `app/garage-staging` +4. `app/garage` 5. `app/directory` -Then, other stuff can be started in any order: +Then, other stuff can be started in any order, e.g.: -- `app/im` (cluster `staging` only) -- `app/cryptpad` (cluster `prod` only) +- `app/im` +- `app/cryptpad` - `app/drone-ci` diff --git a/doc/onboarding.md b/doc/onboarding.md new file mode 100644 index 0000000..b3bd264 --- /dev/null +++ b/doc/onboarding.md @@ -0,0 +1,45 @@ +# Onboarding / quick start for new administrators + +## How to welcome a new administrator + +See: https://guide.deuxfleurs.fr/operations/acces/pass/ + +Basically: + - The new administrator generates a GPG key and publishes it on Gitea + - All existing administrators pull their key and sign it + - An existing administrator reencrypt the keystore with this new key and push it + - The new administrator clone the repo and check that they can decrypt the secrets + - Finally, the new administrator must choose a password to operate over SSH with `./passwd prod rick` where `rick` is the target username + + +## How to operate a node (conncet to Nomad and Consul) + +Edit your `~/.ssh/config` file with content such as the following: + +``` +Host dahlia + HostName dahlia.machine.deuxfleurs.fr + LocalForward 14646 127.0.0.1:4646 + LocalForward 8501 127.0.0.1:8501 + LocalForward 1389 bottin.service.prod.consul:389 + LocalForward 5432 psql-proxy.service.prod.consul:5432 +``` + +Then run the TLS proxy and leave it running: + +``` +./tlsproxy prod +``` + +SSH to a production machine (e.g. dahlia) and leave it running: + +``` +ssh dahlia +``` + + +Finally you should see be able to access the production Nomad and Consul by browsing: + + - Consul: http://localhost:8500 + - Nomad: http://localhost:4646 + diff --git a/doc/quick-start.md b/doc/quick-start.md deleted file mode 100644 index 1307fde..0000000 --- a/doc/quick-start.md +++ /dev/null @@ -1,73 +0,0 @@ -# Quick start - -## How to welcome a new administrator - -See: https://guide.deuxfleurs.fr/operations/acces/pass/ - -Basically: - - The new administrator generates a GPG key and publishes it on Gitea - - All existing administrators pull their key and sign it - - An existing administrator reencrypt the keystore with this new key and push it - - The new administrator clone the repo and check that they can decrypt the secrets - - Finally, the new administrator must choose a password to operate over SSH with `./passwd prod rick` where `rick` is the target username - - -## How to create files for a new zone - -*The documentation is written for the production cluster, the same apply for other clusters.* - -Basically: - - Create your `site` file in `cluster/prod/site/` folder - - Create your `node` files in `cluster/prod/node/` folder - - Add your wireguard configuration to `cluster/prod/cluster.nix` - - You will have to edit your NAT config manually to bind one public IPv4 port to each node - - Nodes' public wireguard keys are generated during the first run of `deploy_nixos`, see below - - Add your nodes to `cluster/prod/ssh_config`, it will be used by the various SSH scripts. - - If you use `ssh` directly, use `ssh -F ./cluster/prod/ssh_config` - - Add `User root` for the first time as your user will not be declared yet on the system - -## How to deploy a Nix configuration on a fresh node - -We suppose that the node name is `datura`. -Start by doing the deployment one node at a time, you will have plenty of time -in your operator's life to break everything through automation. - -Run: - - `./deploy_nixos prod datura` - to deploy the nix configuration file; - - a new wireguard key is printed if it hadn't been generated before, it has to be - added to `cluster.nix`, and then redeployed on all nodes as the new wireguard conf is needed everywhere - - `./deploy_passwords prod datura` - to deploy user's passwords - - if a user changes their password (using `./passwd`), needs to be redeployed on all nodes to setup the password on all nodes - - `./deploy_pki prod datura` - to deploy Nomad's and Consul's PKI - -## How to operate a node - -Edit your `~/.ssh/config` file: - -``` -Host dahlia - HostName dahlia.machine.deuxfleurs.fr - LocalForward 14646 127.0.0.1:4646 - LocalForward 8501 127.0.0.1:8501 - LocalForward 1389 bottin.service.prod.consul:389 - LocalForward 5432 psql-proxy.service.prod.consul:5432 -``` - -Then run the TLS proxy and leave it running: - -``` -./tlsproxy prod -``` - -SSH to a production machine (e.g. dahlia) and leave it running: - -``` -ssh dahlia -``` - - -Finally you should see be able to access the production Nomad and Consul by browsing: - - - Consul: http://localhost:8500 - - Nomad: http://localhost:4646 - diff --git a/doc/why-not-ansible.md b/doc/why-not-ansible.md new file mode 100644 index 0000000..6c8be55 --- /dev/null +++ b/doc/why-not-ansible.md @@ -0,0 +1,37 @@ +# Why not Ansible? + +I often get asked why not use Ansible to deploy to remote machines, as this +would look like a typical use case. There are many reasons, which basically +boil down to "I really don't like Ansible": + +- Ansible tries to do declarative system configuration, but doesn't do it + correctly at all, like Nix does. Example: in NixOS, to undo something you've + done, just comment the corresponding lines and redeploy. + +- Ansible is massive overkill for what we're trying to do here, we're just + copying a few small files and running some basic commands, leaving the rest + to NixOS. + +- YAML is a pain to manipulate as soon as you have more than two or three + indentation levels. Also, why in hell would you want to write loops and + conditions in YAML when you could use a proper expression language? + +- Ansible's vocabulary is not ours, and it imposes a rigid hierarchy of + directories and files which I don't want. + +- Ansible is probably not flexible enough to do what we want, at least not + without getting a migraine when trying. For example, it's inventory + management is too simple to account for the heterogeneity of our cluster + nodes while still retaining a level of organization (some configuration + options are defined cluster-wide, some are defined for each site - physical + location - we deploy on, and some are specific to each node). + +- I never remember Ansible's command line flags. + +- My distribution's package for Ansible takes almost 400MB once installed, + WTF??? By not depending on it, we're reducing the set of tools we need to + deploy to a bare minimum: Git, OpenSSH, OpenSSL, socat, + [pass](https://www.passwordstore.org/) (and the Consul and Nomad binaries + which are, I'll admit, not small). + + -- cgit v1.2.3