[docs] Update docs and add narwhal dashboard for orchestrator (#13327)

## Description Updating the orchestrator documentation and adding a Narwhal dashboard to be used when running the benchmarks. ## Test Plan How did you test the new or updated feature? --- If your changes are not user-facing and not a breaking change, you can skip the following section. Otherwise, please indicate what changed, and then add to the Release Notes section as highlighted during the release process. ### Type of Change (Check all that apply) - [ ] protocol change - [ ] user-visible impact - [ ] breaking change for a client SDKs - [ ] breaking change for FNs (FN binary must upgrade) - [ ] breaking change for validators or node operators (must upgrade binaries) - [ ] breaking change for on-chain data layout - [ ] necessitate either a data wipe or data migration ### Release notes
MystenLabs · Sep 12, 2023 · fc0f37e · fc0f37e · vercel · Sep 12, 2023
1 parent 8001f2e
commit fc0f37e
Show file tree

Hide file tree

Showing 3 changed files with 6,078 additions and 2 deletions.
diff --git a/crates/sui-aws-orchestrator/README.md b/crates/sui-aws-orchestrator/README.md
@@ -46,7 +46,7 @@ Create a file called `settings.json` that contains all the configuration paramet
   "specs": "m5d.8xlarge",
   "repository": {
     "url": "https://github.com/MystenLabs/sui.git",
-    "commit": "orchestrator"
+    "commit": "main"
   },
   "results_directory": "./results",
   "logs_directory": "./logs"
@@ -78,6 +78,9 @@ cargo run --bin sui-aws-orchestrator testbed status
 
 Instances listed with a green number are available and ready for use, while instances listed with a red number are stopped.
 
+Also keep in mind that there is nothing stopping you from running the `deploy` command multiple times if you find your self
+needing more instances down the line.
+
 ## Step 4. Running benchmarks
 
 Running benchmarks involves installing the specified version of the codebase on the remote machines and running one validator and one load generator per instance. For example, the following command benchmarks a committee of 10 validators under a constant load of 200 tx/s for 3 minutes:
@@ -90,4 +93,41 @@ In a network of 10 validators, each with a corresponding load generator, each lo
 
 ## Step 5. Monitoring
 
-The orchestrator provides facilities to monitor metrics on clients and nodes. The orchestrator deploys a [Prometheus](https://prometheus.io) instance and a [Grafana](https://grafana.com) instance on a dedicated remote machine. Grafana is then available on the address printed on stdout (e.g., `http://3.83.97.12:3000`) with the default username and password both set to `admin`. You can either create a [new dashboard](https://grafana.com/docs/grafana/latest/getting-started/build-first-dashboard/) or [import](https://grafana.com/docs/grafana/latest/dashboards/manage-dashboards/#import-a-dashboard) the example dashboard located in the `./assets` folder.
+The orchestrator provides facilities to monitor metrics on clients and nodes. The orchestrator deploys a [Prometheus](https://prometheus.io) instance and a [Grafana](https://grafana.com) instance on a dedicated remote machine. Grafana is then available on the address printed on stdout (e.g., `http://3.83.97.12:3000`) with the default username and password both set to `admin`. You can either create a [new dashboard](https://grafana.com/docs/grafana/latest/getting-started/build-first-dashboard/) or [import](https://grafana.com/docs/grafana/latest/dashboards/manage-dashboards/#import-a-dashboard) the example dashboards located in the `./assets` folder.
+
+## Destroy a testbed
+After you have found yourself that you don't need the deployed testbed anymore you can simply run
+
+```
+cargo run --bin sui-aws-orchestrator -- testbed destroy
+```
+
+that will terminate all the deployed EC2 instances. Keep in mind that AWS is not immediately deleting the terminated instances - this could take a few hours - so in case you want to immediately deploy a new testbed it would be advised
+to use a different `testbed_id` in the `settings.json` to avoid any later conflicts (see the FAQ section for more information).
+
+## FAQ
+
+### I am getting an error "Failed to read settings file '"crates/sui-aws-orchestrator/assets/settings.json"': No such file or directory"
+To run the tool a `settings.json` file with the deployment configuration should be under the directory `crates/sui-aws-orchestrator/assets`. Also, please make sure
+that you run the orchestrator from the top level repo folder, ex `/sui $ cargo run --bin sui-aws-orchestrator`
+
+### I am getting an error "IncorrectInstanceState" with message "The instance 'i-xxxxxxx' is not in a state from which it can be started."" when I try to run a benchmark
+When a testbed is deployed the EC2 instances are tagged with the `testbed_id` as dictated in the `settings.json` file. When trying to run a benchmark the tool will try to list
+all the EC2 instances on the dictated by the configuration regions. To successfully run the benchmark all the listed instances should be in status
+`Running`. If there is any instance in different state , ex `Terminated` , then the above error will arise. Please pay attention that if you `destroy` a deployment
+and then immediately `deploy` a new one under the same `testbed_id`, then it is possible to have a mix of instances with status `Running` and `Terminated`, as AWS does not immediately
+delete the `Terminated` instances. That can eventually cause the above false positive error as well. It is advised in this case to use a different `testbed_id`  to ensure that 
+there is no overlap between instances.
+
+### I am getting an error "Not enough instances: missing X instances" when running a benchmark
+In the common case to successfully run a benchmark we need to have enough instances available to run 
+* the required validators  
+* the grafana dashboard
+* the benchmarking clients
+
+for example when running the command `cargo run --bin sui-aws-orchestrator -- benchmark --committee 4 fixed-load --loads 500 --duration 500`, we'll need the following amount of instances available:
+* `4 instances` to run the validators (since we set `--committee 4`)
+* `1 instance` to run the grafana dashboard (by default only 1 is needed)
+* no additional instances to run the benchmarking clients, as those will be co-deployed on the validator nodes
+
+so in total we must have deployed a testbed of at least `5 instances`. If we attempt to run with fewer, then the above error will be thrown.