Skip to content

Commit 37d0000

Browse files
committed
Update docs (#861)
(cherry picked from commit da6324b)
1 parent 0b897af commit 37d0000

File tree

22 files changed

+762
-656
lines changed

22 files changed

+762
-656
lines changed

README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -111,8 +111,8 @@ positive
111111
```bash
112112
$ cortex get sentiment-classifier --watch
113113

114-
status up-to-date requested last update avg inference 2XX
115-
live 1 1 8s 24ms 12
114+
status up-to-date requested last update avg request 2XX
115+
live 1 1 8s 24ms 12
116116

117117
class count
118118
positive 8

docs/dependency-management/python-packages.md

Lines changed: 0 additions & 54 deletions
This file was deleted.

docs/deployments/api-configuration.md

Lines changed: 128 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,128 @@
1+
# API configuration
2+
3+
Once your model is [exported](exporting.md) and you've implemented a [Predictor](predictors.md), you can configure your API via a yaml file (typically named `cortex.yaml`).
4+
5+
Reference the section below which corresponds to your Predictor type: [Python](#python-predictor), [TensorFlow](#tensorflow-predictor), or [ONNX](#onnx-predictor).
6+
7+
## Python Predictor
8+
9+
```yaml
10+
- name: <string> # API name (required)
11+
endpoint: <string> # the endpoint for the API (default: <api_name>)
12+
predictor:
13+
type: python
14+
path: <string> # path to a python file with a PythonPredictor class definition, relative to the Cortex root (required)
15+
config: <string: value> # arbitrary dictionary passed to the constructor of the Predictor (optional)
16+
python_path: <string> # path to the root of your Python folder that will be appended to PYTHONPATH (default: folder containing cortex.yaml)
17+
env: <string: string> # dictionary of environment variables
18+
tracker:
19+
key: <string> # the JSON key in the response to track (required if the response payload is a JSON object)
20+
model_type: <string> # model type, must be "classification" or "regression" (required)
21+
compute:
22+
cpu: <string | int | float> # CPU request per replica (default: 200m)
23+
gpu: <int> # GPU request per replica (default: 0)
24+
mem: <string> # memory request per replica (default: Null)
25+
autoscaling:
26+
min_replicas: <int> # minimum number of replicas (default: 1)
27+
max_replicas: <int> # maximum number of replicas (default: 100)
28+
init_replicas: <int> # initial number of replicas (default: <min_replicas>)
29+
workers_per_replica: <int> # the number of parallel serving workers to run on each replica (default: 1)
30+
threads_per_worker: <int> # the number of threads per worker (default: 1)
31+
target_replica_concurrency: <float> # the desired number of in-flight requests per replica, which the autoscaler tries to maintain (default: workers_per_replica * threads_per_worker)
32+
max_replica_concurrency: <int> # the maximum number of in-flight requests per replica before requests are rejected with error code 503 (default: 1024)
33+
window: <duration> # the time over which to average the API's concurrency (default: 60s)
34+
downscale_stabilization_period: <duration> # the API will not scale below the highest recommendation made during this period (default: 5m)
35+
upscale_stabilization_period: <duration> # the API will not scale above the lowest recommendation made during this period (default: 0m)
36+
max_downscale_factor: <float> # the maximum factor by which to scale down the API on a single scaling event (default: 0.5)
37+
max_upscale_factor: <float> # the maximum factor by which to scale up the API on a single scaling event (default: 10)
38+
downscale_tolerance: <float> # any recommendation falling within this factor below the current number of replicas will not trigger a scale down event (default: 0.1)
39+
upscale_tolerance: <float> # any recommendation falling within this factor above the current number of replicas will not trigger a scale up event (default: 0.1)
40+
update_strategy:
41+
max_surge: <string | int> # maximum number of replicas that can be scheduled above the desired number of replicas during an update; can be an absolute number, e.g. 5, or a percentage of desired replicas, e.g. 10% (default: 25%)
42+
max_unavailable: <string | int> # maximum number of replicas that can be unavailable during an update; can be an absolute number, e.g. 5, or a percentage of desired replicas, e.g. 10% (default: 25%)
43+
```
44+
45+
See additional documentation for [autoscaling](autoscaling.md), [compute](compute.md), and [prediction monitoring](prediction-monitoring.md).
46+
47+
## TensorFlow Predictor
48+
49+
```yaml
50+
- name: <string> # API name (required)
51+
endpoint: <string> # the endpoint for the API (default: <api_name>)
52+
predictor:
53+
type: tensorflow
54+
path: <string> # path to a python file with a TensorFlowPredictor class definition, relative to the Cortex root (required)
55+
model: <string> # S3 path to an exported model (e.g. s3://my-bucket/exported_model) (required)
56+
signature_key: <string> # name of the signature def to use for prediction (required if your model has more than one signature def)
57+
config: <string: value> # arbitrary dictionary passed to the constructor of the Predictor (optional)
58+
python_path: <string> # path to the root of your Python folder that will be appended to PYTHONPATH (default: folder containing cortex.yaml)
59+
env: <string: string> # dictionary of environment variables
60+
tracker:
61+
key: <string> # the JSON key in the response to track (required if the response payload is a JSON object)
62+
model_type: <string> # model type, must be "classification" or "regression" (required)
63+
compute:
64+
cpu: <string | int | float> # CPU request per replica (default: 200m)
65+
gpu: <int> # GPU request per replica (default: 0)
66+
mem: <string> # memory request per replica (default: Null)
67+
autoscaling:
68+
min_replicas: <int> # minimum number of replicas (default: 1)
69+
max_replicas: <int> # maximum number of replicas (default: 100)
70+
init_replicas: <int> # initial number of replicas (default: <min_replicas>)
71+
workers_per_replica: <int> # the number of parallel serving workers to run on each replica (default: 1)
72+
threads_per_worker: <int> # the number of threads per worker (default: 1)
73+
target_replica_concurrency: <float> # the desired number of in-flight requests per replica, which the autoscaler tries to maintain (default: workers_per_replica * threads_per_worker)
74+
max_replica_concurrency: <int> # the maximum number of in-flight requests per replica before requests are rejected with error code 503 (default: 1024)
75+
window: <duration> # the time over which to average the API's concurrency (default: 60s)
76+
downscale_stabilization_period: <duration> # the API will not scale below the highest recommendation made during this period (default: 5m)
77+
upscale_stabilization_period: <duration> # the API will not scale above the lowest recommendation made during this period (default: 0m)
78+
max_downscale_factor: <float> # the maximum factor by which to scale down the API on a single scaling event (default: 0.5)
79+
max_upscale_factor: <float> # the maximum factor by which to scale up the API on a single scaling event (default: 10)
80+
downscale_tolerance: <float> # any recommendation falling within this factor below the current number of replicas will not trigger a scale down event (default: 0.1)
81+
upscale_tolerance: <float> # any recommendation falling within this factor above the current number of replicas will not trigger a scale up event (default: 0.1)
82+
update_strategy:
83+
max_surge: <string | int> # maximum number of replicas that can be scheduled above the desired number of replicas during an update; can be an absolute number, e.g. 5, or a percentage of desired replicas, e.g. 10% (default: 25%)
84+
max_unavailable: <string | int> # maximum number of replicas that can be unavailable during an update; can be an absolute number, e.g. 5, or a percentage of desired replicas, e.g. 10% (default: 25%)
85+
```
86+
87+
See additional documentation for [autoscaling](autoscaling.md), [compute](compute.md), and [prediction monitoring](prediction-monitoring.md).
88+
89+
## ONNX Predictor
90+
91+
```yaml
92+
- name: <string> # API name (required)
93+
endpoint: <string> # the endpoint for the API (default: <api_name>)
94+
predictor:
95+
type: onnx
96+
path: <string> # path to a python file with an ONNXPredictor class definition, relative to the Cortex root (required)
97+
model: <string> # S3 path to an exported model (e.g. s3://my-bucket/exported_model.onnx) (required)
98+
config: <string: value> # arbitrary dictionary passed to the constructor of the Predictor (optional)
99+
python_path: <string> # path to the root of your Python folder that will be appended to PYTHONPATH (default: folder containing cortex.yaml)
100+
env: <string: string> # dictionary of environment variables
101+
tracker:
102+
key: <string> # the JSON key in the response to track (required if the response payload is a JSON object)
103+
model_type: <string> # model type, must be "classification" or "regression" (required)
104+
compute:
105+
cpu: <string | int | float> # CPU request per replica (default: 200m)
106+
gpu: <int> # GPU request per replica (default: 0)
107+
mem: <string> # memory request per replica (default: Null)
108+
autoscaling:
109+
min_replicas: <int> # minimum number of replicas (default: 1)
110+
max_replicas: <int> # maximum number of replicas (default: 100)
111+
init_replicas: <int> # initial number of replicas (default: <min_replicas>)
112+
workers_per_replica: <int> # the number of parallel serving workers to run on each replica (default: 1)
113+
threads_per_worker: <int> # the number of threads per worker (default: 1)
114+
target_replica_concurrency: <float> # the desired number of in-flight requests per replica, which the autoscaler tries to maintain (default: workers_per_replica * threads_per_worker)
115+
max_replica_concurrency: <int> # the maximum number of in-flight requests per replica before requests are rejected with error code 503 (default: 1024)
116+
window: <duration> # the time over which to average the API's concurrency (default: 60s)
117+
downscale_stabilization_period: <duration> # the API will not scale below the highest recommendation made during this period (default: 5m)
118+
upscale_stabilization_period: <duration> # the API will not scale above the lowest recommendation made during this period (default: 0m)
119+
max_downscale_factor: <float> # the maximum factor by which to scale down the API on a single scaling event (default: 0.5)
120+
max_upscale_factor: <float> # the maximum factor by which to scale up the API on a single scaling event (default: 10)
121+
downscale_tolerance: <float> # any recommendation falling within this factor below the current number of replicas will not trigger a scale down event (default: 0.1)
122+
upscale_tolerance: <float> # any recommendation falling within this factor above the current number of replicas will not trigger a scale up event (default: 0.1)
123+
update_strategy:
124+
max_surge: <string | int> # maximum number of replicas that can be scheduled above the desired number of replicas during an update; can be an absolute number, e.g. 5, or a percentage of desired replicas, e.g. 10% (default: 25%)
125+
max_unavailable: <string | int> # maximum number of replicas that can be unavailable during an update; can be an absolute number, e.g. 5, or a percentage of desired replicas, e.g. 10% (default: 25%)
126+
```
127+
128+
See additional documentation for [autoscaling](autoscaling.md), [compute](compute.md), and [prediction monitoring](prediction-monitoring.md).

docs/deployments/compute.md

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,6 @@ One unit of memory is one byte. Memory can be expressed as an integer or by usin
2525

2626
## GPU
2727

28-
1. Make sure your AWS account is subscribed to the [EKS-optimized AMI with GPU Support](https://aws.amazon.com/marketplace/pp/B07GRHFXGM).
29-
2. You may need to [file an AWS support ticket](https://console.aws.amazon.com/support/cases#/create?issueType=service-limit-increase&limitType=ec2-instances) to increase the limit for your desired instance type.
30-
3. Set instance type to an AWS GPU instance (e.g. p2.xlarge) when installing Cortex.
31-
4. Note that one unit of GPU corresponds to one virtual GPU on AWS. Fractional requests are not allowed.
28+
One unit of GPU corresponds to one virtual GPU. Fractional requests are not allowed.
29+
30+
See [GPU documentation](gpus.md) for more information.

docs/deployments/deployment.md

Lines changed: 70 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,70 @@
1+
# API deployment
2+
3+
Once your model is [exported](exporting.md), you've implemented a [Predictor](predictors.md), and you've [configured your API](api-configuration.md), you're ready to deploy!
4+
5+
## `cortex deploy`
6+
7+
The `cortex deploy` command collects your configuration and source code and deploys your API on your cluster:
8+
9+
```bash
10+
$ cortex deploy
11+
12+
creating my-api
13+
```
14+
15+
APIs are declarative, so to update your API, simply modify your source code and/or configuration and run `cortex deploy` again.
16+
17+
## `cortex get`
18+
19+
The `cortex get` command displays the status of your APIs, and `cortex get <api_name>` shows additional information about a specific API.
20+
21+
```bash
22+
$ cortex get my-api
23+
24+
status up-to-date requested last update avg request 2XX
25+
live 1 1 1m - -
26+
27+
endpoint: http://***.amazonaws.com/iris-classifier
28+
...
29+
```
30+
31+
Appending the `--watch` flag will re-run the `cortex get` command every second.
32+
33+
## `cortex logs`
34+
35+
You can stream logs from your API using the `cortex logs` command:
36+
37+
```bash
38+
$ cortex logs my-api
39+
```
40+
41+
## Making a prediction
42+
43+
You can use `curl` to test your prediction service, for example:
44+
45+
```bash
46+
$ curl http://***.amazonaws.com/my-api \
47+
-X POST -H "Content-Type: application/json" \
48+
-d '{"key": "value"}'
49+
```
50+
51+
## Debugging
52+
53+
You can log information about each request by adding the `?debug=true` parameter to your requests. This will print the payload and the value after running your `predict()` function in the API logs.
54+
55+
## `cortex delete`
56+
57+
You can delete your API with the `cortex delete` command:
58+
59+
```bash
60+
$ cortex delete my-api
61+
62+
deleting my-api
63+
```
64+
65+
## Additional resources
66+
67+
<!-- CORTEX_VERSION_MINOR -->
68+
* [Tutorial](../../examples/sklearn/iris-classifier/README.md) provides a step-by-step walkthough of deploying an iris classifier API
69+
* [CLI documentation](../cluster-management/cli.md) lists all CLI commands
70+
* [Examples](https://github.com/cortexlabs/cortex/tree/0.14/examples) demonstrate how to deploy models from common ML libraries

0 commit comments

Comments
 (0)