From 428d6508b705762745a5620d14b87ce13ceeb1e3 Mon Sep 17 00:00:00 2001 From: Anton Evangelatov Date: Tue, 21 Apr 2020 18:47:00 +0200 Subject: [PATCH] one README.md (#15) * one README.md * fixup --- README.md | 219 +++++++++++++++++++++++++++++++++++++----- k8s/README.md | 259 -------------------------------------------------- 2 files changed, 194 insertions(+), 284 deletions(-) delete mode 100644 k8s/README.md diff --git a/README.md b/README.md index ed821905..b9a2f87b 100644 --- a/README.md +++ b/README.md @@ -1,45 +1,214 @@ # Testground infrastructure -This repo contains scripts for setting up a Kubernetes cluster for [Testground](https://testground.ipfs.team). +## Background -Using the `cluster:k8s` runner of Testground enables you to test distributed/p2p systems at large scales. Testing at large scale is an essential component of developing rock-solid distributed software. +This repo contains scripts for setting up a Kubernetes cluster for [Testground](http://testground.ipfs.team). + +Using the `cluster:k8s` runner of Testground enables you to test distributed/p2p systems at scale. The `cluster:k8s` Testground runner is capable of launching test workloads comprising 10k+ instances, and we aim to reach 100k at some point. -The [IPFS](https://ipfs.io/) and [libp2p](https://libp2p.io/) projects have used these scripts and playbooks to deploy large-scale test infrastructure. By crafting test scenarios that exercise components at such scale, we have been able to run simulations, carry out attacks, perform benchmarks, and execute all kinds of test to validate correctness at scale. +The [IPFS](https://ipfs.io/) and [libp2p](https://libp2p.io/) projects have used these scripts and playbooks to deploy large-scale test infrastructure. By crafting test scenarios that exercise components at such scale, we have been able to run simulations, carry out attacks, perform benchmarks, and execute all kinds of tests to validate correctness and performance. + +## Introduction + +Kubernetes Operations (`kops`) is a tool which helps to create, destroy, upgrade and maintain production-grade Kubernetes clusters from the command line. We use it to create a Kubernetes cluster on AWS. + +We use CoreOS Flannel for networking on Kubernetes - for the default Kubernetes network, which in Testground terms is called the `control` network. + +We use Weave for the `data` plane on Testground - a secondary overlay network that we attach containers to on-demand. + +`kops` uses 100.96.0.0/11 for pod CIDR range, so this is what we use for the `control` network. + +We configure Weave to use 16.0.0.0/4 as CIDR (we want to test `libp2p` nodes with IPs in public ranges), so this is the CIDR for the Testground `data` network. The `sidecar` is responsible for setting up the `data` network for every testplan instance. + +In order to have two different networks attached to pods in Kubernetes, we run the [CNI-Genie CNI](https://github.com/cni-genie/CNI-Genie). + +More information on the Testground Networking requirements can be found [here](https://github.com/testground/testground/blob/master/docs/NETWORKING.md). + + +## Requirements + +1. An AWS account with API access +2. [kops](https://github.com/kubernetes/kops/releases) >= 1.17.0 +3. [terraform](https://terraform.io) >= 0.12.21 +4. [AWS CLI](https://aws.amazon.com/cli) +5. [helm](https://github.com/helm/helm) >= 3.0 + +## Set up cloud credentials, cluster specification and repositories for dependencies + +1. [Generate your AWS IAM credentials](https://console.aws.amazon.com/iam/home#/security_credentials). + + * [Configure the aws-cli tool with your credentials](https://docs.aws.amazon.com/cli/). + * Create a `.env.toml` file (copying over the [`env-example.toml`](https://github.com/ipfs/testground/blob/master/env-example.toml) at the root of this repo as a template), and add your region to the `[aws]` section. + +2. For the Testground team: Download shared key for `kops`. The Testground team uses a shared key, so that everyone on the team can log into any ephemeral cluster and have full access. + +```sh +$ aws s3 cp s3://kops-shared-key-bucket/testground_rsa ~/.ssh/ +$ aws s3 cp s3://kops-shared-key-bucket/testground_rsa.pub ~/.ssh/ +$ chmod 700 ~/.ssh/testground_rsa +``` + +Or generate your own key, for example + +```sh +$ ssh-keygen -t rsa -b 4096 -C "your_email@example.com" +``` + +3. Create a bucket for `kops` state. This is similar to Terraform state bucket. + +```sh +$ aws s3api create-bucket \ + --bucket \ + --region --create-bucket-configuration LocationConstraint= +``` + +Where: + +* `` is a unique AWS account-wide unique bucket name to store this cluster's kops state, e.g. `kops-backend-bucket-`. +* `` is an AWS region like `eu-central-1` or `us-west-2`. + +4. Pick: + +- a cluster name, +- set AWS region +- set AWS availability zone (not region; this is something like `us-west-2a` [availability zone], not `us-west-2` \[region]) +- set `kops` state store bucket +- set number of worker nodes +- set location for cluster spec to be generated +- set location of your cluster SSH public key +- set credentials and locations for `outputs` S3 bucket + +You might want to add them to your `rc` file (`.zshrc`, `.bashrc`, etc.), or to an `.env.sh` file that you source. + +In addition to the initial cluster setup, these variables should be accessible to the daemon. If these variables are +manually set or you source them manually, you should make sure to do so before starting the Testground daemon. + +```sh +# `NAME` needs to be a subdomain of an existing Route53 domain name. +# The Testground team uses `.testground.ipfs.team`, which is already set up for our Testground AWS account. +# Alternatively you could use `name.k8s.local` and use Gossip DNS. +export NAME= +export KOPS_STATE_STORE=s3:// +export AWS_REGION= +export ZONE= +export WORKER_NODES=4 +export PUBKEY=$HOME/.ssh/testground_rsa.pub +``` + +5. Set up Helm and add the `stable` Helm Charts repository + +If you haven't, [install helm now](https://helm.sh/docs/intro/install/). + +```sh +$ helm repo add stable https://kubernetes-charts.storage.googleapis.com/ +$ helm repo add bitnami https://charts.bitnami.com/bitnami +$ helm repo update +``` + +## Create the Kubernetes cluster + +This will take about 10-15 minutes to complete. + +Once you run this command, take some time to walk the dog, clean up around the office, or go get yourself some coffee! When you return, your shiny new kubernetes cluster will be ready to run testground plans. + +To create a monitored cluster in the region specified in `$ZONE` with +`$WORKER_NODES` number of workers: + +```sh +$ ./k8s/install.sh ./k8s/cluster.yaml +``` + +## Destroy the cluster when you're done working on it + +Do not forget to delete the cluster once you are done running test plans. + +```sh +$ ./k8s/delete.sh +``` + +## Resizing the cluster + +1. Edit the cluster state and change number of nodes. + +```sh +$ kops edit ig nodes +``` -## Quick start +2. Apply the new configuration -We are using kops to create a cluster rather than a hosted kubernetes service. Doing it this way enables us to tune kernel parameters and make customizations that have proven to be important. +```sh +$ kops update cluster $NAME --yes +``` -There are a couple of dependencies required to make the `cluster:k8s` runner work. +3. Wait for nodes to come up and for DaemonSets to be Running on all new nodes -### required software - * an AWS account with API access - * helm v3+ [link](https://helm.sh/) - * kops v1.17.0+ [link](https://github.com/kubernetes/kops/releases) - * terraform v0.12+ [link](https://www.terraform.io/) +```sh +$ watch 'kubectl get pods' +``` -### environment variables -Set up environment variables before starting the cluster - * AWS_PROFILE (if you have multiple AWS accounts) - * NAME (cluster name) - * PUBKEY (SSH key for testground workers) - * ZONE (availability zone i.e. us-west-2a) - * AWS_REGION (where is your cluster. i.e. us-west-2) - * KOPS_STATE_STORE (s3 bucket for kops) - * WORKER_NODES (size of your kubernetes cluster) +## Testground observability -### Create the cluster -This will take about 15 minutes to complete. -Once you run this, take some time to walk the dog, clean up around the office, or go get yourself some coffee! When you return, your shiny new kubernetes cluster will be ready to run testground plans. +1. Access to Grafana (initial credentials are `username: admin` ; `password: testground`): +```sh +$ kubectl port-forward service/testground-infra-grafana 3000:80 ``` -k8s/install.sh k8s/cluster.yaml + +## Cleanup after Testground and other useful commands + +Testground is still in very early stage of development. It is possible that it crashes, or doesn't properly clean-up after a testplan run. Here are a few commands that could be helpful for you to inspect the state of your Kubernetes cluster and clean up after Testground. + +1. Delete all pods that have the `testground.plan=dht` label (in case you used the `--run-cfg keep_service=true` setting on Testground. + +```sh +$ kubectl delete pods -l testground.plan=dht --grace-period=0 --force +``` + +2. Restart the `sidecar` daemon which manages networks for all testplans + +```sh +$ kubectl delete pods -l name=testground-sidecar --grace-period=0 --force +``` + +3. Review all running pods + +```sh +$ kubectl get pods -o wide +``` + +4. Get logs from a given pod + +```sh +$ kubectl logs +``` + +5. Check on the monitoring infrastructure (it runs in the monitoring namespace) + +```sh +$ kubectl get pods --namespace monitoring +``` + +6. Get access to the Redis shell + +```sh +$ kubectl port-forward svc/testground-infra-redis-master 6379:6379 & +$ redis-cli -h localhost -p 6379 +``` + +## Use a Kubernetes context for another cluster + +`kops` lets you download the entire Kubernetes context config. + +If you want to let other people on your team connect to your Kubernetes cluster, you need to give them the information. + +```sh +$ kops export kubecfg --state $KOPS_STATE_STORE --name=$NAME ``` ## Documentation -Additional information about this runner and more can be found on [testground gitbook](https://app.gitbook.com/@protocol-labs/s/testground/) +Additional information about this runner and more can be found on the [Testground gitbook](https://app.gitbook.com/@protocol-labs/s/testground/) ## Contribute diff --git a/k8s/README.md b/k8s/README.md deleted file mode 100644 index f2367ee1..00000000 --- a/k8s/README.md +++ /dev/null @@ -1,259 +0,0 @@ -# Setting up a self-managed Kubernetes cluster with kops on AWS for Testground - -In this directory, you will find: - -``` -» tree -. -├── README.md -└── kops-weave # Kubernetes resources for setting up networking with Weave and Flannel -``` - -## Introduction - -Kubernetes Operations (kops) is a tool which helps to create, destroy, upgrade and maintain production-grade Kubernetes clusters from the command line. We use it to create a k8s cluster on AWS. - -We use CoreOS Flannel for networking on Kubernetes - for the default Kubernetes network, which in Testground terms is called the `control` network. - -We use Weave for the `data` plane on Testground - a secondary overlay network that we attach containers to on-demand. - -`kops` uses 100.96.0.0/11 for pod CIDR range, so this is what we use for the `control` network. - -We configure Weave to use 16.0.0.0/4 as CIDR (we want to test `libp2p` nodes with IPs in public ranges), so this is the CIDR for the Testground `data` network. The `sidecar` is responsible for setting up the `data` network for every testplan instance. - -In order to have two different networks attached to pods in Kubernetes, we run the [CNI-Genie CNI](https://github.com/cni-genie/CNI-Genie). - - -## Requirements - -1. [kops](https://github.com/kubernetes/kops/releases) >= 1.17.0-alpha.1 -2. [terraform](https://terraform.io) >= 0.12.21 -3. [AWS CLI](https://aws.amazon.com/cli) -4. [helm](https://github.com/helm/helm) >= 3.0 - -## Set up cloud credentials, cluster specification and repositories for dependencies - -1. [Generate your AWS IAM credentials](https://console.aws.amazon.com/iam/home#/security_credentials). - - * [Configure the aws-cli tool with your credentials](https://docs.aws.amazon.com/cli/). - * Create a `.env.toml` file (copying over the [`env-example.toml`](https://github.com/ipfs/testground/blob/master/env-example.toml) at the root of this repo as a template), and add your region to the `[aws]` section. - -2. Download shared key for `kops`. We use a shared key, so that everyone on the team can log into any cluster and have full access. - -```sh -$ aws s3 cp s3://kops-shared-key-bucket/testground_rsa ~/.ssh/ -$ aws s3 cp s3://kops-shared-key-bucket/testground_rsa.pub ~/.ssh/ -$ chmod 700 ~/.ssh/testground_rsa -``` - -Or generate your own key, for example - -```sh -$ ssh-keygen -t rsa -b 4096 -C "your_email@example.com" -``` - -3. Create a bucket for `kops` state. This is similar to Terraform state bucket. - -```sh -$ aws s3api create-bucket \ - --bucket \ - --region --create-bucket-configuration LocationConstraint= -``` - -Where: - -* `` is a unique AWS account-wide unique bucket name to store this cluster's kops state, e.g. `kops-backend-bucket-`. -* `` is an AWS region like `eu-central-1` or `us-west-2`. - -4. Pick: - -- a cluster name, -- set AWS region -- set AWS availability zone (not region; this is something like `us-west-2a` [availability zone], not `us-west-2` \[region]) -- set `kops` state store bucket -- set number of worker nodes -- set location for cluster spec to be generated -- set location of your cluster SSH public key -- set credentials and locations for `outputs` S3 bucket - -You might want to add them to your `rc` file (`.zshrc`, `.bashrc`, etc.), or to an `.env.sh` file that you source. - -In addition to the initial cluster setup, these variables should be accessible to the daemon. If these variables are -manually set or you source them manually, you should make sure to do so before starting the testground daemon. - -```sh -# `NAME` needs to be a subdomain of an existing Route53 domain name. -# The Testground team uses `.testground.ipfs.team`, which is already set up for our Testground AWS account. -# Alternatively you could use `name.k8s.local` and use Gossip DNS. -export NAME= -export KOPS_STATE_STORE=s3:// -export AWS_REGION= -export ZONE= -export WORKER_NODES=4 -export PUBKEY=$HOME/.ssh/testground_rsa.pub -``` - -5. Set up Helm and add the `stable` Helm Charts repository - -If you haven't, [install helm now](https://helm.sh/docs/intro/install/). - -```sh -$ helm repo add stable https://kubernetes-charts.storage.googleapis.com/ -$ helm repo add bitnami https://charts.bitnami.com/bitnami -$ helm repo update -``` - -## Install the Kubernetes cluster - -To create a monitored cluster in the region specified in `$ZONE` with -`$WORKER_NODES` number of workers: - -```sh -$ cd /infra/k8s -$ ./install.sh ./cluster.yaml -``` - -If you're using the fish shell, you will want to summon bash: - -```sh -$ cd /infra/k8s -$ bash -c './install.sh ./cluster.yaml' -``` - -## Destroy the cluster when you're done working on it - -```sh -$ ./delete.sh -``` - -## Configure and run your Testground daemon - -```sh -$ cd -$ go build . -$ ./testground --vv daemon -``` - -## Run a Testground testplan - -Use compositions: [/docs/COMPOSITIONS.md](../../docs/COMPOSITIONS.md). - -or - -```sh -$ ./testground --vv run single network/ping-pong \ - --builder=docker:go \ - --runner=cluster:k8s \ - --build-cfg bypass_cache=true \ - --build-cfg push_registry=true \ - --build-cfg registry_type=aws \ - --run-cfg keep_service=true \ - --instances=2 -``` - -or - -```sh -$ ./testground --vv run single dht/find-peers \ - --builder=docker:go \ - --runner=cluster:k8s \ - --build-cfg push_registry=true \ - --build-cfg registry_type=aws \ - --run-cfg keep_service=true \ - --instances=16 -``` - -## Resizing the cluster - -1. Edit the cluster state and change number of nodes. - -```sh -$ kops edit ig nodes -``` - -2. Apply the new configuration - -```sh -$ kops update cluster $NAME --yes -``` - -3. Wait for nodes to come up and for DaemonSets to be Running on all new nodes - -```sh -$ watch 'kubectl get pods' -``` - -## Destroying the cluster - -Do not forget to delete the cluster once you are done running test plans. - -## Testground observability - -1. Access to Grafana (initial credentials are `username: admin` ; `password: testground`): - -```sh -$ kubectl port-forward service/testground-infra-grafana 3000:80 -``` - -2. Access the Prometheus Web UI - -```sh -$ kubectl port-forward service/testground-infra-prometheu-prometheus 9090:9090 -``` - -Direct your web browser to [http://localhost:9090](http://localhost:9090). - -## Cleanup after Testground and other useful commands - -Testground is still in very early stage of development. It is possible that it crashes, or doesn't properly clean-up after a testplan run. Here are a few commands that could be helpful for you to inspect the state of your Kubernetes cluster and clean up after Testground. - -1. Delete all pods that have the `testground.plan=dht` label (in case you used the `--run-cfg keep_service=true` setting on Testground. - -```sh -$ kubectl delete pods -l testground.plan=dht --grace-period=0 --force -``` - -2. Restart the `sidecar` daemon which manages networks for all testplans - -```sh -$ kubectl delete pods -l name=testground-sidecar --grace-period=0 --force -``` - -3. Review all running pods - -```sh -$ kubectl get pods -o wide -``` - -4. Get logs from a given pod - -```sh -$ kubectl logs -``` - -5. Check on the monitoring infrastructure (it runs in the monitoring namespace) - -```sh -$ kubectl get pods --namespace monitoring -``` - -6. Get access to the Redis shell - -```sh -$ kubectl port-forward svc/testground-infra-redis-master 6379:6379 & -$ redis-cli -h localhost -p 6379 -``` - -## Use a Kubernetes context for another cluster - -`kops` lets you download the entire Kubernetes context config. - -If you want to let other people on your team connect to your Kubernetes cluster, you need to give them the information. - -```sh -$ kops export kubecfg --state $KOPS_STATE_STORE --name=$NAME -``` - -## Known issues and future improvements - -- [ ] Alerts (and maybe auto-scaling down) for idle clusters, so that we don't incur costs.