Skip to content

Commit 09dc5e6

Browse files
authored
Merge pull request #1086 from input-output-hk/jpraynaud/add-production-runbook
Add network production runbooks for Aggregator
2 parents f02a536 + d166b86 commit 09dc5e6

File tree

9 files changed

+350
-1
lines changed

9 files changed

+350
-1
lines changed

docs/runbook/README.md

+19
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
# Mithril network runbook :shield:
2+
3+
This page gathers the available guides to operate a Mithril network.
4+
5+
:fire: This guides are intended to be used by expert users, and could lead to irreversible damages or loss for a network.
6+
7+
# Guides
8+
9+
| Operation | Location | Description
10+
|------------|------------|------------
11+
| **Genesis manually** | [manual-genesis](./genesis-manually/README.md) | Proceed to manual (re)genesis of the aggregator certificate chain.
12+
| **Era markers** | [era-markers](./era-markers/README.md) | Create and update era markers on the Cardano chain.
13+
| **Signer registrations monitoring** | [registrations-monitoring](./registrations-monitoring/README.md) | Gather aggregated data about signer registrations (versions, stake, ...).
14+
| **Update protocol parameters** | [protocol-parameters](./protocol-parameters/README.md) | Update the protocol parameters of a Mithril network.
15+
| **Recompute certificates hash** | [recompute-certificates-hash](./recompute-certificates-hash/README.md) | Recompute the certificates has of an aggregator.
16+
| **Fix terraform lock** | [terraform-lock](./terraform-lock/README.md) | Fix a terraform lock in CD workflows.
17+
| **Manage SSH access to infrastructure** | [ssh-access](./ssh-access/README.md) | Manage SSH access on the VM of the infrastructure for a user.
18+
19+
+91
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,91 @@
1+
# Manual genesis of production Mithril network
2+
3+
## Configure environment variables
4+
Export the environment variables:
5+
```bash
6+
export MITHRIL_VM=**MITHRIL_VM**
7+
export CARDANO_NETWORK=**CARDANO_NETWORK**
8+
```
9+
10+
Here is an example for the `release-mainnet` network:
11+
```bash
12+
export MITHRIL_VM=aggregator.release-mainnet.api.mithril.network
13+
export CARDANO_NETWORK=mainnet
14+
```
15+
16+
## Export the genesis payload to sign
17+
18+
Connect to the aggregator VM:
19+
```bash
20+
ssh curry@$MITHRIL_VM
21+
```
22+
23+
Once connected to the aggregator VM, export the environment variables:
24+
```bash
25+
export CARDANO_NETWORK=**CARDANO_NETWORK**
26+
```
27+
28+
And create genesis dir:
29+
```bash
30+
mkdir -p /home/curry/data/$CARDANO_NETWORK/mithril-aggregator/mithril/genesis
31+
```
32+
And connect to the aggregator container:
33+
```bash
34+
docker exec -it mithril-aggregator bash
35+
```
36+
37+
Once connected to the aggregator container, export the genesis payload to sign:
38+
```bash
39+
/app/bin/mithril-aggregator -vvv genesis export --target-path /mithril-aggregator/mithril/genesis/genesis-payload-to-sign.txt
40+
```
41+
42+
Then disconnect from the aggregator container:
43+
```bash
44+
exit
45+
```
46+
47+
Then disconnect from the aggregator VM:
48+
```bash
49+
exit
50+
```
51+
52+
## Sign the genesis payload
53+
54+
Once on your local machine, copy the genesis payload to sign from the aggregator VM:
55+
```bash
56+
scp curry@$MITHRIL_VM:/home/curry/data/$CARDANO_NETWORK/mithril-aggregator/mithril/genesis/genesis-payload-to-sign.txt .
57+
```
58+
59+
Download or build the aggregator on your local machine as explained in this [documentation](https://mithril.network/doc/manual/developer-docs/nodes/mithril-aggregator#download-source)
60+
61+
Then, sign the payload with the genesis secret key:
62+
```bash
63+
./mithril-aggregator -vvv genesis sign --to-sign-payload-path genesis-payload-to-sign.txt --target-signed-payload-path genesis-payload-signed.txt --genesis-secret-key-path genesis.sk
64+
```
65+
66+
## Import the signed genesis payload
67+
68+
Then, copy the signed genesis payload back to the aggregator VM:
69+
```bash
70+
scp ./genesis-payload-signed.txt curry@$MITHRIL_VM:/home/curry/data/$CARDANO_NETWORK/mithril-aggregator/mithril/genesis/genesis-payload-signed.txt
71+
```
72+
73+
Then, connect back to the aggregator VM:
74+
```bash
75+
ssh curry@$MITHRIL_VM
76+
```
77+
78+
Export the environment variable:
79+
```bash
80+
export CARDANO_NETWORK=**CARDANO_NETWORK**
81+
```
82+
83+
And connect back to the aggregator container:
84+
```bash
85+
docker exec -it mithril-aggregator bash
86+
```
87+
88+
Once connected to the aggregator container, import the signed genesis payload:
89+
```bash
90+
/app/bin/mithril-aggregator -vvv genesis import --signed-payload-path /mithril-aggregator/mithril/genesis/genesis-payload-signed.txt
91+
```
+71
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
# Update the protocol parameters of a Mithril network
2+
3+
## Introduction
4+
5+
The protocol parameters of a network are currently defined when starting the aggregator of the network.
6+
During startup, the aggregator will store the parameters in its stores, and will use them **3** epochs later. The protocol parameters are broadcasted by the aggregator to the signers of the network through the `/epoch-settings` route.
7+
8+
## Update parameters of a Mithril network
9+
The aggregator has the following configuration parameter used to set the protocol parameters: `protocol_parameters` which is a JSON representation of the `ProtocolParameter` type:
10+
```bash
11+
pub struct ProtocolParameters {
12+
/// Quorum parameter
13+
pub k: u64,
14+
15+
/// Security parameter (number of lotteries)
16+
pub m: u64,
17+
18+
/// f in phi(w) = 1 - (1 - f)^w, where w is the stake of a participant
19+
pub phi_f: f64,
20+
}
21+
```
22+
23+
Each parameter can also be set via an environment variable:
24+
- `PROTOCOL_PARAMETERS__K` for `k`
25+
- `PROTOCOL_PARAMETERS__M` for `m`
26+
- `PROTOCOL_PARAMETERS__PHI_F` for `phi-f`
27+
28+
When setting up a Mithril network with a `terraform` deployment, the protocol parameters are set with a JSON definition.
29+
30+
## Find the workflow used to deploy a Mithril network
31+
32+
Currently, the following [Mithril networks](https://mithril.network/doc/manual/developer-docs/references#mithril-networks) are generally available, and deployed with `terraform`:
33+
- `testing-preview`: with the workflow [`.github/workflows/ci.yml`](../../github/workflows/ci.yml)
34+
- `pre-release-preview`: with the workflow [`.github/workflows/pre-release.yml`](../../github/workflows/pre-release.yml)
35+
- `release-preprod`: with the workflow [`.github/workflows/release.yml`](../../github/workflows/release.yml)
36+
- `release-mainnet`: with the workflow [`.github/workflows/release.yml`](../../github/workflows/release.yml)
37+
38+
## Update the protocol parameters
39+
40+
Update the following value of the targeted network in the deployment matrix with the new values that need to be used:
41+
```bash
42+
mithril_protocol_parameters: |
43+
{
44+
k = 5
45+
m = 100
46+
phi_f = 0.6
47+
}
48+
```
49+
50+
Which will be replaced eg with:
51+
```bash
52+
mithril_protocol_parameters: |
53+
{
54+
k = 2422
55+
m = 20973
56+
phi_f = 0.2
57+
}
58+
```
59+
60+
The modifications should be created in a dedicated PR, and the result of the **Plan** job of the terraform deployment should be analyzed precisely to make sure that the change has been taken into consideration.
61+
62+
## Deployment of the new protocol parameters
63+
64+
The update of the new protocol parameters will take place as detailed in the following table:
65+
| Workflow | Deployed at | Effective at
66+
|------------|------------|------------
67+
| [`.github/workflows/ci.yml`](../../github/workflows/ci.yml) | Merge on `main` branch | **3** epochs later
68+
| [`.github/workflows/pre-release.yml`](../../github/workflows/pre-release.yml) | Pre-release of a distribution | **3** epochs later
69+
| [`.github/workflows/release.yml`](../../github/workflows/release.yml) | Release of a distribution | **3** epochs later
70+
71+
For more information about the CD, please refer to [Release process and versioning](https://mithril.network/doc/adr/3).
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,92 @@
1+
# Recompute the certificates hashes of Mithril aggregator
2+
3+
## Configure environment variables
4+
Export the environment variables:
5+
```bash
6+
export MITHRIL_VM=**MITHRIL_VM**
7+
export CARDANO_NETWORK=**CARDANO_NETWORK**
8+
```
9+
10+
Here is an exmaple for the `release-mainnet` network:
11+
```bash
12+
export MITHRIL_VM=aggregator.release-mainnet.api.mithril.network
13+
export CARDANO_NETWORK=mainnet
14+
```
15+
16+
## Make a backup of the aggregator database
17+
18+
Connect to the aggregator VM:
19+
```bash
20+
ssh curry@$MITHRIL_VM
21+
```
22+
23+
Once connected to the aggregator VM, export the environment variables:
24+
```bash
25+
export CARDANO_NETWORK=**CARDANO_NETWORK**
26+
```
27+
28+
And copy the SQLite database file `aggregator.sqlite3`:
29+
```bash
30+
cp /home/curry/data/$CARDANO_NETWORK/mithril-aggregator/mithril/stores/aggregator.sqlite3 cp /home/curry/data/$CARDANO_NETWORK/mithril-aggregator/mithril/stores/aggregator.sqlite3.bak.$(date +%Y-%m-%d)
31+
```
32+
33+
And connect to the aggregator container:
34+
```bash
35+
docker exec -it mithril-aggregator bash
36+
```
37+
38+
Once connected to the aggregator container, recompute the certificates hashes:
39+
```bash
40+
/app/bin/mithril-aggregator -vvv tools recompute-certificates-hash
41+
```
42+
43+
Then disconnect from the aggregator container:
44+
```bash
45+
exit
46+
```
47+
48+
## Restart the aggregator
49+
50+
Restart the aggregator to make sure that the certificate chain is valid:
51+
```bash
52+
docker restart mithril-aggregator
53+
```
54+
55+
Make sure that the certificate chain is valid (wait for the state machiene to go into the state `READY`):
56+
```bash
57+
docker logs -f --tail 1000 mithril-aggregator
58+
```
59+
60+
Then disconnect from the aggregator VM:
61+
```bash
62+
exit
63+
```
64+
65+
## Rollback procedure
66+
67+
If the recomputation fails, you can rollback the database.
68+
69+
First, stop the aggregator:
70+
```bash
71+
docker stop mithril-aggregator
72+
```
73+
74+
Then, restore the backed up database:
75+
```bash
76+
cp /home/curry/data/$CARDANO_NETWORK/mithril-aggregator/mithril/stores/aggregator.sqlite3.sqlite3.bak.$(date +%Y-%m-%d) cp /home/curry/data/$CARDANO_NETWORK/mithril-aggregator/mithril/stores/aggregator
77+
```
78+
79+
Then, start the aggregator:
80+
```bash
81+
docker start mithril-aggregator
82+
```
83+
84+
Make sure that the certificate chain is valid (wait for the state machiene to go into the state `READY`):
85+
```bash
86+
docker logs -f --tail 1000 mithril-aggregator
87+
```
88+
89+
Then disconnect from the aggregator VM:
90+
```bash
91+
exit
92+
```

mithril-aggregator/utils/monitoring/README.md renamed to docs/runbook/registrations-monitoring/README.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ query for that.
99
```sh
1010
$> sqlite3 -table -batch \
1111
$DATA_STORES_DIRECTORY/monitoring.sqlite3 \
12-
< mithril-aggregator/utils/monitoring/stake_signer_version.sql
12+
< stake_signer_version.sql
1313
```
1414

1515
The variable `$DATA_STORES_DIRECTORY` should point to the directory where the

docs/runbook/ssh-access/README.md

+51
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
# Manage SSH access to infrastructure
2+
3+
## Add access to a user
4+
5+
### Create a SSH keypair for a user (if needed)
6+
7+
Create a new SSH keypair, with `ed25519` cryptography for maximum security:
8+
```bash
9+
ssh-keygen -t ed25519 -C "your_email@example.com"
10+
```
11+
12+
Then, add your keypair to the ssh-agent:
13+
```bash
14+
ssh-add ~/.ssh/id_ed25519
15+
```
16+
17+
### Retrieve the public key of your SSH keypair
18+
19+
Run the following command to retrieve your public key:
20+
```bash
21+
cat ~/.ssh/id_ed25519.pub
22+
```
23+
24+
### Declare the public key
25+
26+
Add a line with the format `**REMOTE_USER**:*PUBLIC_KEY**` in the `mithril-infra/assets/ssh_keys` file for each:
27+
```bash
28+
echo "curry:ssh-ed25519 AAAE53AC3NzQ2vlZDI1aC1O4CpX+S2y1X9NTB4rv4k3pAAAAIF3b7L9sPV5ZiGgogmko your_email@example.com" >> **REPOSITORY_PATH**/mithril-infra/assets/ssh_keys
29+
```
30+
31+
Then, create a PR with the updated `ssh_keys` file.
32+
33+
## Remove access to a user
34+
35+
To remove an access, simply remove the line(s) related to this user.
36+
37+
Then, create a PR with the updated `ssh_keys` file.
38+
39+
## When are the modifications applied?
40+
41+
The modifications will be applied the next time the terraform deployment is done:
42+
- next **merge** in `main` branch for `testing-preview`
43+
- next **pre-release** created for `pre-release-preview`
44+
- next **release** created for `release-preprod`
45+
- next **release** created for `release-mainnet`
46+
47+
When the modifications are applied, the VM is updated in place by terraform.
48+
49+
:warning: In case of emergency, the SSH keys can be modified by an administrator:
50+
- In GCP [**Compute Engine**](https://console.cloud.google.com/compute/instances)
51+
- The SSH keys can be edited in the targeted VM(s)

docs/runbook/terraform-lock/README.md

+25
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
# Fix terraform deployment lock
2+
3+
## Introduction
4+
5+
When the CI cancels a job that is in the middle of a terraform deployment, there is a chance that the lock file used by terraform under the hood to avoid concurrent deployment is not removed. In that cas, the next time a CI job tries to deploy, it will receive an error stating that there is a lock that prevents the deployment to be operated.
6+
7+
## Find the workflow used to deploy a Mithril network
8+
9+
Currently, the following [Mithril networks](https://mithril.network/doc/manual/developer-docs/references#mithril-networks) are generally available, and deployed with `terraform`:
10+
- `testing-preview`: with the workflow [`.github/workflows/ci.yml`](../../github/workflows/ci.yml)
11+
- `pre-release-preview`: with the workflow [`.github/workflows/pre-release.yml`](../../github/workflows/pre-release.yml)
12+
- `release-preprod`: with the workflow [`.github/workflows/release.yml`](../../github/workflows/release.yml)
13+
- `release-mainnet`: with the workflow [`.github/workflows/release.yml`](../../github/workflows/release.yml)
14+
15+
16+
## Identify the terraform backend bucket
17+
In the workflow file, there is a `terraform_backend_bucket` that details the GCP bucket that is used by terraform to store the state of the deployment.
18+
19+
## Reset the terraform lock
20+
21+
A user with administrator rights can simply remove the lock file:
22+
- In GCP [**Cloud Storage**](https://console.cloud.google.com/storage/browser)
23+
- In the terraform administration bucket that you have identified earlier, the file that needs to be removed is at path `**TERRAFORM_BACKEND_BUCKET**/terraform/mithril-**MITHRIL_NETWORK_IDENTIFIER**/.terraform.lock.hcl` (e.g. `mithril-terraform-prod/terraform/mithril-release-mainnet/terraform.lock.hcl`)
24+
25+
:warning: never delete/modify the `**TERRAFORM_BACKEND_BUCKET**/terraform/mithril-**MITHRIL_NETWORK_IDENTIFIER**/default.tfstate` file.

0 commit comments

Comments
 (0)