Skip to content

Add prod recommendations and migrating guide #2334

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Jul 14, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 11 additions & 11 deletions docs/clients/install.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,16 @@
# Install

## Install with pip
## Install the CLI

<!-- CORTEX_VERSION_README x2 -->
```bash
# download CLI version 0.38.0 (Note the "v"):
bash -c "$(curl -sS https://raw.githubusercontent.com/cortexlabs/cortex/v0.38.0/get-cli.sh)"
```

By default, the Cortex CLI is installed at `/usr/local/bin/cortex`. To install the executable elsewhere, export the `CORTEX_INSTALL_PATH` environment variable to your desired location before running the command above.

## Install the CLI and Python client via pip

To install the latest version:

Expand All @@ -21,16 +31,6 @@ To upgrade to the latest version:
pip install --upgrade cortex
```

## Install without the Python client

<!-- CORTEX_VERSION_README x2 -->
```bash
# For example to download CLI version 0.38.0 (Note the "v"):
bash -c "$(curl -sS https://raw.githubusercontent.com/cortexlabs/cortex/v0.38.0/get-cli.sh)"
```

By default, the Cortex CLI is installed at `/usr/local/bin/cortex`. To install the executable elsewhere, export the `CORTEX_INSTALL_PATH` environment variable to your desired location before running the command above.

## Changing the CLI/client configuration directory

By default, the CLI/client creates a directory at `~/.cortex/` and uses it to store environment configuration. To use a different directory, export the `CORTEX_CLI_CONFIG_DIR` environment variable before running any `cortex` commands.
2 changes: 1 addition & 1 deletion docs/clusters/instances/spot.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ node_groups:
on_demand_base_capacity: 0

# percentage of on demand instances to use after the on demand base capacity has been met [0, 100] (default: 50)
# note: setting this to 0 may hinder cluster scale up when spot instances are not available
# note: setting this to 0 may hinder cluster scale-up when spot instances are not available
on_demand_percentage_above_base_capacity: 0

# max price for spot instances (default: the on-demand price of the primary instance type)
Expand Down
5 changes: 3 additions & 2 deletions docs/clusters/management/create.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,9 +9,10 @@

## Create a cluster on your AWS account

<!-- CORTEX_VERSION_README -->
```bash
# install the CLI
pip install cortex
# install the cortex CLI
bash -c "$(curl -sS https://raw.githubusercontent.com/cortexlabs/cortex/v0.38.0/get-cli.sh)"

# create a cluster
cortex cluster up cluster.yaml
Expand Down
9 changes: 6 additions & 3 deletions docs/clusters/management/delete.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,10 +8,13 @@ cortex cluster down

When a Cortex cluster is created, an S3 bucket is created for its internal use. When running `cortex cluster down`, a lifecycle rule is applied to the bucket such that its entire contents are removed within the next 24 hours. You can safely delete the bucket at any time after `cortex cluster down` has finished running.

## Delete Certificates
## Delete SSL Certificate

If you've configured a custom domain for your APIs, you can remove the SSL Certificate and Hosted Zone for the domain by
following these [instructions](../networking/custom-domain.md#cleanup).
If you've set up HTTPS, you can remove the SSL Certificate by following these [instructions](../networking/https.md#cleanup).

## Delete Hosted Zone

If you've configured a custom domain for your APIs, follow these [instructions](../networking/custom-domain.md#cleanup) to delete the Hosted Zone.

## Keep Cortex Resources

Expand Down
89 changes: 89 additions & 0 deletions docs/clusters/management/production.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
# Production guide

As you take Cortex from development to production, here are a few pointers that might be useful.

## Use images from a colocated ECR

Configure your cluster and APIs to use images from ECR in the same region as your cluster to accelerate scale-ups, reduce ingress costs, and remove the dependency on Cortex's public quay.io registry.

You can find instructions for mirroring Cortex images [here](../advanced/self-hosted-images.md)

## Handling Cortex updates/upgrades

Use a Route 53 hosted zone as a proxy in front of your Cortex cluster. Every new Cortex cluster provisions a new API load balancer with a unique endpoint. Using a Route 53 hosted zone configured with a subdomain will expose your Cortex cluster API endpoint as a static endpoint (e.g. `cortex.your-company.com`). You will be able to upgrade Cortex versions without downtime, and you will avoid the need to updated your client code every time you migrate to a new cluster. You can find instructions for setting up a custom domain with a Route 53 hosted zone [here](../networking/custom-domain.md), and instructions for updating/upgrading your cluster [here](update.md).

## Production cluster configuration

### Securing your cluster

The following configuration will improve security by preventing your cluster's nodes from being publicly accessible.

```yaml
subnet_visibility: private

nat_gateway: single # use "highly_available" for large clusters making requests to services outside of the cluster
```

You can make your load balancer private to prevent your APIs from being publicly accessed. In order to access your APIs, you will need to set up VPC peering between the Cortex cluster's VPC and the VPC containing the consumers of the Cortex APIs. See the [VPC peering guide](../networking/vpc-peering.md) for more details.

```yaml
api_load_balancer_scheme: internal
```

You can also restrict access to your load balancers by IP address:

```yaml
api_load_balancer_cidr_white_list: [0.0.0.0/0]
```

These two fields are also available for the operator load balancer. Keep in mind that if you make the operator load balancer private, you'll need to configure VPC peering to use the `cortex` CLI or Python client.

```yaml
operator_load_balancer_scheme: internal
operator_load_balancer_cidr_white_list: [0.0.0.0/0]
```

See [here](../networking/load-balancers.md) for more information about the load balancers.

### Ensure node provisioning

You can take advantage of the cost savings of spot instances and the reliability of on-demand instances by utilizing the `priority` field in node groups. You can deploy two node groups, one that is spot and another that is on-demand. Set the priority of the spot node group to be higher than the priority of the on-demand node group. This encourages the cluster-autoscaler to try to spin up instances from the spot node group first. If there are no more spot instances available, the on-demand node group will be used instead.

```yaml
node_groups:
- name: gpu-spot
instance_type: g4dn.xlarge
min_instances: 0
max_instances: 5
spot: true
priority: 100
- name: gpu-on-demand
instance_type: g4dn.xlarge
min_instances: 0
max_instances: 5
priority: 1
```

### Considerations for large clusters

If you plan on scaling your Cortex cluster past 400 nodes or 800 pods, it is recommended to set `prometheus_instance_type` to a larger instance type. A good guideline is that a t3.medium instance can reliably handle 400 nodes and 800 pods.

## API Spec

### Container design

Configure your health checks to be as accurate as possible to prevent requests from being routed to pods that aren't ready to handle traffic.

### Pods section

Make sure that `max_concurrency` is set to match the concurrency supported by your container.

Tune `max_queue_length` to lower values if you would like to more aggressively redistribute requests to newer pods as your API scales up rather than allowing requests to linger in queues. This would mean that the clients consuming your APIs should implement retry logic with a delay (such as exponential backoff).

### Compute section

Make sure to specify all of the relevant compute resources (especially cpu and memory) to ensure that your pods aren't starved for resources.

### Autoscaling

Revisit the autoscaling docs for [Realtime APIs](../../workloads/realtime/autoscaling.md) and/or [Async APIs](../../workloads/async/autoscaling.md) to effectively handle production traffic by tuning the scaling rate, sensitivity, and over-provisioning.
118 changes: 98 additions & 20 deletions docs/clusters/management/update.md
Original file line number Diff line number Diff line change
@@ -1,36 +1,114 @@
# Update

## Update node group size
## Modify existing cluster

You can add or remove node groups, resize existing node groups, and update some configuration fields of a running cluster.

Fetch the current cluster configuration:

```bash
cortex cluster scale --node-group <node-group-name> --min-instances <min-instances> --max-instances <max-instances>
cortex cluster info --print-config --name CLUSTER_NAME --region REGION > cluster.yaml
```

## Upgrade to a newer version
Make your desired changes, and then apply them:

```bash
# spin down your cluster
cortex cluster down --name <name> --region <region>
cortex cluster configure cluster.yaml
```

Cortex will calculate the difference and you will be prompted with the update plan.

If you would like to update fields that cannot be modified on a running cluster, you must create a new cluster with your desired configuration.

## Upgrade to a new version

Updating an existing Cortex cluster is not supported at the moment. Please spin down the previous version of the cluster, install the latest version of the Cortex CLI, and use it to spin up a new Cortex cluster. See the next section for how to do this without downtime.

## Update or upgrade without downtime

It is possible to update to a new version Cortex or to migrate from one cluster to another without downtime.

Note: it is important to not spin down your previous cluster until after your new cluster is receiving traffic.

### Set up a subdomain using a Route 53 hosted zone

If you've already set up a subdomain with a Route 53 hosted zone pointing to your cluster, skip this step.

Setting up a Route 53 hosted zone allows you to transfer traffic seamlessly from from an existing cluster to a new cluster, thereby avoiding downtime. You can find the instructions for setting up a subdomain [here](../networking/custom-domain.md). You will need to update any clients interacting with your Cortex APIs to point to the new subdomain.

# update your CLI to the latest version
pip install --upgrade cortex
### Export all APIs from your previous cluster

# confirm version
The `cluster export` command can be used to get the YAML specifications of all APIs deployed in your cluster:

```bash
cortex cluster export --name <previous_cluster_name> --region <region>
```

### Spin up a new cortex cluster

If you are creating a new cluster with the same Cortex version:

```bash
cortex cluster up new-cluster.yaml --configure-env cortex2
```

This will create a CLI environment named `cortex2` for accessing the new cluster.

If you are spinning a up a new cluster with a different Cortex version, first install the cortex CLI matching the desired cluster version:

```bash
# download the desired CLI version, replace 0.38.0 with the desired version (Note the "v"):
bash -c "$(curl -sS https://raw.githubusercontent.com/cortexlabs/cortex/v0.38.0/get-cli.sh)"

# confirm Cortex CLI version
cortex version

# spin up your cluster
cortex cluster up cluster.yaml
# spin up your cluster using the new CLI version
cortex cluster up cluster.yaml --configure-env cortex2
```

You can use different Cortex CLIs to interact with the different versioned clusters; here is an example:

```bash
# download the desired CLI version, replace 0.38.0 with the desired version (Note the "v"):
CORTEX_INSTALL_PATH=$(pwd)/cortex0.38.0 bash -c "$(curl -sS https://raw.githubusercontent.com/cortexlabs/cortex/v0.38.0/get-cli.sh)"

# confirm cortex CLI version
./cortex0.38.0 version
```

### Deploy the APIs to your new cluster

Please read the [changelogs](https://github.com/cortexlabs/cortex/releases) and the latest documentation to identify any features and breaking changes in the new version. You may need to make modifications to your cluster and/or API configuration files.

```bash
cortex deploy -e cortex2 <api_spec_file>
```

After you've updated the API specifications and images if necessary, you can deploy them onto your new cluster.

### Point your custom domain to your new cluster

Verify that all of the APIs in your new cluster are working as expected by accessing via the cluster's API load balancer URL.

Get the cluster's API load balancer URL:

```bash
cortex cluster info --name <new_cluster_name> --region <region>
```

## Upgrade without downtime
Once the APIs on the new cluster have been verified as working properly, it is recommended to update `min_replicas` of your APIs on the new cluster to match the current values in your previous cluster. This will avoid large sudden scale-up events as traffic is shifted to the new cluster.

In production environments, you can upgrade your cluster without downtime if you have a backend service or DNS in front of your Cortex cluster:
Then, navigate to the A record in your custom domains's Route 53 hosted zone and update the Alias to point the new cluster's API load balancer URL. Rather than suddenly routing all of your traffic from the previous cluster to the new cluster, you can use [weighted records](https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/routing-policy.html#routing-policy-weighted) to incrementally route more traffic to your new cluster.

1. Spin up a new cluster. For example: `cortex cluster up new-cluster.yaml --configure-env cortex2` (this will create a CLI environment named `cortex2` for accessing the new cluster).
1. Re-deploy your APIs in your new cluster. For example, if the name of your CLI environment for your existing cluster is `cortex`, you can use `cortex get --env cortex` to list all running APIs in your cluster, and re-deploy them in the new cluster by running `cortex deploy --env cortex2` for each API. Alternatively, you can run `cortex cluster export --name <previous_cluster_name> --region <region>` to export the API specifications for all of your running APIs, change directories the folder that was exported, and run `cortex deploy --env cortex2 <file_name>` for each API that you want to deploy in the new cluster.
1. Route requests to your new cluster.
* If you are using a custom domain: update the A record in your Route 53 hosted zone to point to your new cluster's API load balancer.
* If you have a backend service which makes requests to Cortex: update your backend service to make requests to the new cluster's endpoints.
* If you have a self-managed API Gateway in front of your Cortex cluster: update the routes to use new cluster's endpoints.
1. Spin down your previous cluster. If you updated DNS settings, wait 24-48 hours before spinning down your previous cluster to allow the DNS cache to be flushed.
1. You may now rename your new CLI environment name if you'd like (e.g. to rename it back to "cortex": `cortex env rename cortex2 cortex`)
If you increased `min_replicas` for your APIs in the new cluster during the transition, you may reduce `min_replicas` back to your desired level once all traffic has been shifted.

### Spin down the previous cluster

After confirming that your previous cluster has completed servicing all existing traffic and is not receiving any new traffic, spin down your previous cluster:

```bash
# Note: it is recommended to install the Cortex CLI matching the previous cluster's version to ensure proper deletion.

cortex cluster down --name <previous_cluster_name> --region <region>
```
Loading