Skip to content

Add network production runbooks for Aggregator #1086

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Aug 10, 2023
19 changes: 19 additions & 0 deletions docs/runbook/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# Mithril network runbook :shield:

This page gathers the available guides to operate a Mithril network.

:fire: This guides are intended to be used by expert users, and could lead to irreversible damages or loss for a network.

# Guides

| Operation | Location | Description
|------------|------------|------------
| **Genesis manually** | [manual-genesis](./genesis-manually/README.md) | Proceed to manual (re)genesis of the aggregator certificate chain.
| **Era markers** | [era-markers](./era-markers/README.md) | Create and update era markers on the Cardano chain.
| **Signer registrations monitoring** | [registrations-monitoring](./registrations-monitoring/README.md) | Gather aggregated data about signer registrations (versions, stake, ...).
| **Update protocol parameters** | [protocol-parameters](./protocol-parameters/README.md) | Update the protocol parameters of a Mithril network.
| **Recompute certificates hash** | [recompute-certificates-hash](./recompute-certificates-hash/README.md) | Recompute the certificates has of an aggregator.
| **Fix terraform lock** | [terraform-lock](./terraform-lock/README.md) | Fix a terraform lock in CD workflows.
| **Manage SSH access to infrastructure** | [ssh-access](./ssh-access/README.md) | Manage SSH access on the VM of the infrastructure for a user.


91 changes: 91 additions & 0 deletions docs/runbook/genesis-manually/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
# Manual genesis of production Mithril network

## Configure environment variables
Export the environment variables:
```bash
export MITHRIL_VM=**MITHRIL_VM**
export CARDANO_NETWORK=**CARDANO_NETWORK**
```

Here is an example for the `release-mainnet` network:
```bash
export MITHRIL_VM=aggregator.release-mainnet.api.mithril.network
export CARDANO_NETWORK=mainnet
```

## Export the genesis payload to sign

Connect to the aggregator VM:
```bash
ssh curry@$MITHRIL_VM
```

Once connected to the aggregator VM, export the environment variables:
```bash
export CARDANO_NETWORK=**CARDANO_NETWORK**
```

And create genesis dir:
```bash
mkdir -p /home/curry/data/$CARDANO_NETWORK/mithril-aggregator/mithril/genesis
```
And connect to the aggregator container:
```bash
docker exec -it mithril-aggregator bash
```

Once connected to the aggregator container, export the genesis payload to sign:
```bash
/app/bin/mithril-aggregator -vvv genesis export --target-path /mithril-aggregator/mithril/genesis/genesis-payload-to-sign.txt
```

Then disconnect from the aggregator container:
```bash
exit
```

Then disconnect from the aggregator VM:
```bash
exit
```

## Sign the genesis payload

Once on your local machine, copy the genesis payload to sign from the aggregator VM:
```bash
scp curry@$MITHRIL_VM:/home/curry/data/$CARDANO_NETWORK/mithril-aggregator/mithril/genesis/genesis-payload-to-sign.txt .
```

Download or build the aggregator on your local machine as explained in this [documentation](https://mithril.network/doc/manual/developer-docs/nodes/mithril-aggregator#download-source)

Then, sign the payload with the genesis secret key:
```bash
./mithril-aggregator -vvv genesis sign --to-sign-payload-path genesis-payload-to-sign.txt --target-signed-payload-path genesis-payload-signed.txt --genesis-secret-key-path genesis.sk
```

## Import the signed genesis payload

Then, copy the signed genesis payload back to the aggregator VM:
```bash
scp ./genesis-payload-signed.txt curry@$MITHRIL_VM:/home/curry/data/$CARDANO_NETWORK/mithril-aggregator/mithril/genesis/genesis-payload-signed.txt
```

Then, connect back to the aggregator VM:
```bash
ssh curry@$MITHRIL_VM
```

Export the environment variable:
```bash
export CARDANO_NETWORK=**CARDANO_NETWORK**
```

And connect back to the aggregator container:
```bash
docker exec -it mithril-aggregator bash
```

Once connected to the aggregator container, import the signed genesis payload:
```bash
/app/bin/mithril-aggregator -vvv genesis import --signed-payload-path /mithril-aggregator/mithril/genesis/genesis-payload-signed.txt
```
71 changes: 71 additions & 0 deletions docs/runbook/protocol-parameters/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
# Update the protocol parameters of a Mithril network

## Introduction

The protocol parameters of a network are currently defined when starting the aggregator of the network.
During startup, the aggregator will store the parameters in its stores, and will use them **3** epochs later. The protocol parameters are broadcasted by the aggregator to the signers of the network through the `/epoch-settings` route.

## Update parameters of a Mithril network
The aggregator has the following configuration parameter used to set the protocol parameters: `protocol_parameters` which is a JSON representation of the `ProtocolParameter` type:
```bash
pub struct ProtocolParameters {
/// Quorum parameter
pub k: u64,

/// Security parameter (number of lotteries)
pub m: u64,

/// f in phi(w) = 1 - (1 - f)^w, where w is the stake of a participant
pub phi_f: f64,
}
```

Each parameter can also be set via an environment variable:
- `PROTOCOL_PARAMETERS__K` for `k`
- `PROTOCOL_PARAMETERS__M` for `m`
- `PROTOCOL_PARAMETERS__PHI_F` for `phi-f`

When setting up a Mithril network with a `terraform` deployment, the protocol parameters are set with a JSON definition.

## Find the workflow used to deploy a Mithril network

Currently, the following [Mithril networks](https://mithril.network/doc/manual/developer-docs/references#mithril-networks) are generally available, and deployed with `terraform`:
- `testing-preview`: with the workflow [`.github/workflows/ci.yml`](../../github/workflows/ci.yml)
- `pre-release-preview`: with the workflow [`.github/workflows/pre-release.yml`](../../github/workflows/pre-release.yml)
- `release-preprod`: with the workflow [`.github/workflows/release.yml`](../../github/workflows/release.yml)
- `release-mainnet`: with the workflow [`.github/workflows/release.yml`](../../github/workflows/release.yml)

## Update the protocol parameters

Update the following value of the targeted network in the deployment matrix with the new values that need to be used:
```bash
mithril_protocol_parameters: |
{
k = 5
m = 100
phi_f = 0.6
}
```

Which will be replaced eg with:
```bash
mithril_protocol_parameters: |
{
k = 2422
m = 20973
phi_f = 0.2
}
```

The modifications should be created in a dedicated PR, and the result of the **Plan** job of the terraform deployment should be analyzed precisely to make sure that the change has been taken into consideration.

## Deployment of the new protocol parameters

The update of the new protocol parameters will take place as detailed in the following table:
| Workflow | Deployed at | Effective at
|------------|------------|------------
| [`.github/workflows/ci.yml`](../../github/workflows/ci.yml) | Merge on `main` branch | **3** epochs later
| [`.github/workflows/pre-release.yml`](../../github/workflows/pre-release.yml) | Pre-release of a distribution | **3** epochs later
| [`.github/workflows/release.yml`](../../github/workflows/release.yml) | Release of a distribution | **3** epochs later

For more information about the CD, please refer to [Release process and versioning](https://mithril.network/doc/adr/3).
92 changes: 92 additions & 0 deletions docs/runbook/recompute-certificates-hash/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
# Recompute the certificates hashes of Mithril aggregator

## Configure environment variables
Export the environment variables:
```bash
export MITHRIL_VM=**MITHRIL_VM**
export CARDANO_NETWORK=**CARDANO_NETWORK**
```

Here is an exmaple for the `release-mainnet` network:
```bash
export MITHRIL_VM=aggregator.release-mainnet.api.mithril.network
export CARDANO_NETWORK=mainnet
```

## Make a backup of the aggregator database

Connect to the aggregator VM:
```bash
ssh curry@$MITHRIL_VM
```

Once connected to the aggregator VM, export the environment variables:
```bash
export CARDANO_NETWORK=**CARDANO_NETWORK**
```

And copy the SQLite database file `aggregator.sqlite3`:
```bash
cp /home/curry/data/$CARDANO_NETWORK/mithril-aggregator/mithril/stores/aggregator.sqlite3 cp /home/curry/data/$CARDANO_NETWORK/mithril-aggregator/mithril/stores/aggregator.sqlite3.bak.$(date +%Y-%m-%d)
```

And connect to the aggregator container:
```bash
docker exec -it mithril-aggregator bash
```

Once connected to the aggregator container, recompute the certificates hashes:
```bash
/app/bin/mithril-aggregator -vvv tools recompute-certificates-hash
```

Then disconnect from the aggregator container:
```bash
exit
```

## Restart the aggregator

Restart the aggregator to make sure that the certificate chain is valid:
```bash
docker restart mithril-aggregator
```

Make sure that the certificate chain is valid (wait for the state machiene to go into the state `READY`):
```bash
docker logs -f --tail 1000 mithril-aggregator
```

Then disconnect from the aggregator VM:
```bash
exit
```

## Rollback procedure

If the recomputation fails, you can rollback the database.

First, stop the aggregator:
```bash
docker stop mithril-aggregator
```

Then, restore the backed up database:
```bash
cp /home/curry/data/$CARDANO_NETWORK/mithril-aggregator/mithril/stores/aggregator.sqlite3.sqlite3.bak.$(date +%Y-%m-%d) cp /home/curry/data/$CARDANO_NETWORK/mithril-aggregator/mithril/stores/aggregator
```

Then, start the aggregator:
```bash
docker start mithril-aggregator
```

Make sure that the certificate chain is valid (wait for the state machiene to go into the state `READY`):
```bash
docker logs -f --tail 1000 mithril-aggregator
```

Then disconnect from the aggregator VM:
```bash
exit
```
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ query for that.
```sh
$> sqlite3 -table -batch \
$DATA_STORES_DIRECTORY/monitoring.sqlite3 \
< mithril-aggregator/utils/monitoring/stake_signer_version.sql
< stake_signer_version.sql
```

The variable `$DATA_STORES_DIRECTORY` should point to the directory where the
Expand Down
51 changes: 51 additions & 0 deletions docs/runbook/ssh-access/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
# Manage SSH access to infrastructure

## Add access to a user

### Create a SSH keypair for a user (if needed)

Create a new SSH keypair, with `ed25519` cryptography for maximum security:
```bash
ssh-keygen -t ed25519 -C "your_email@example.com"
```

Then, add your keypair to the ssh-agent:
```bash
ssh-add ~/.ssh/id_ed25519
```

### Retrieve the public key of your SSH keypair

Run the following command to retrieve your public key:
```bash
cat ~/.ssh/id_ed25519.pub
```

### Declare the public key

Add a line with the format `**REMOTE_USER**:*PUBLIC_KEY**` in the `mithril-infra/assets/ssh_keys` file for each:
```bash
echo "curry:ssh-ed25519 AAAE53AC3NzQ2vlZDI1aC1O4CpX+S2y1X9NTB4rv4k3pAAAAIF3b7L9sPV5ZiGgogmko your_email@example.com" >> **REPOSITORY_PATH**/mithril-infra/assets/ssh_keys
```

Then, create a PR with the updated `ssh_keys` file.

## Remove access to a user

To remove an access, simply remove the line(s) related to this user.

Then, create a PR with the updated `ssh_keys` file.

## When are the modifications applied?

The modifications will be applied the next time the terraform deployment is done:
- next **merge** in `main` branch for `testing-preview`
- next **pre-release** created for `pre-release-preview`
- next **release** created for `release-preprod`
- next **release** created for `release-mainnet`

When the modifications are applied, the VM is updated in place by terraform.

:warning: In case of emergency, the SSH keys can be modified by an administrator:
- In GCP [**Compute Engine**](https://console.cloud.google.com/compute/instances)
- The SSH keys can be edited in the targeted VM(s)
25 changes: 25 additions & 0 deletions docs/runbook/terraform-lock/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# Fix terraform deployment lock

## Introduction

When the CI cancels a job that is in the middle of a terraform deployment, there is a chance that the lock file used by terraform under the hood to avoid concurrent deployment is not removed. In that cas, the next time a CI job tries to deploy, it will receive an error stating that there is a lock that prevents the deployment to be operated.

## Find the workflow used to deploy a Mithril network

Currently, the following [Mithril networks](https://mithril.network/doc/manual/developer-docs/references#mithril-networks) are generally available, and deployed with `terraform`:
- `testing-preview`: with the workflow [`.github/workflows/ci.yml`](../../github/workflows/ci.yml)
- `pre-release-preview`: with the workflow [`.github/workflows/pre-release.yml`](../../github/workflows/pre-release.yml)
- `release-preprod`: with the workflow [`.github/workflows/release.yml`](../../github/workflows/release.yml)
- `release-mainnet`: with the workflow [`.github/workflows/release.yml`](../../github/workflows/release.yml)


## Identify the terraform backend bucket
In the workflow file, there is a `terraform_backend_bucket` that details the GCP bucket that is used by terraform to store the state of the deployment.

## Reset the terraform lock

A user with administrator rights can simply remove the lock file:
- In GCP [**Cloud Storage**](https://console.cloud.google.com/storage/browser)
- In the terraform administration bucket that you have identified earlier, the file that needs to be removed is at path `**TERRAFORM_BACKEND_BUCKET**/terraform/mithril-**MITHRIL_NETWORK_IDENTIFIER**/.terraform.lock.hcl` (e.g. `mithril-terraform-prod/terraform/mithril-release-mainnet/terraform.lock.hcl`)

:warning: never delete/modify the `**TERRAFORM_BACKEND_BUCKET**/terraform/mithril-**MITHRIL_NETWORK_IDENTIFIER**/default.tfstate` file.