cortexlabs
diff --git a/‎README.md
Lines changed: 2 additions & 2 deletions b/‎README.md
Lines changed: 2 additions & 2 deletions
diff --git a/‎docs/dependency-management/python-packages.md
Lines changed: 0 additions & 54 deletions b/‎docs/dependency-management/python-packages.md
Lines changed: 0 additions & 54 deletions
diff --git a/‎docs/deployments/api-configuration.md
Lines changed: 128 additions & 0 deletions b/‎docs/deployments/api-configuration.md
Lines changed: 128 additions & 0 deletions
diff --git a/‎docs/deployments/compute.md
Lines changed: 3 additions & 4 deletions b/‎docs/deployments/compute.md
Lines changed: 3 additions & 4 deletions
diff --git a/‎docs/deployments/deployment.md
Lines changed: 70 additions & 0 deletions b/‎docs/deployments/deployment.md
Lines changed: 70 additions & 0 deletions
@@ -111,8 +111,8 @@ positive
 ```bash
 $ cortex get sentiment-classifier --watch
 
-status   up-to-date   requested   last update   avg inference   2XX
-live     1            1           8s            24ms            12
+status   up-to-date   requested   last update   avg request   2XX
+live     1            1           8s            24ms          12
 
 class     count
 positive  8
 
@@ -0,0 +1,128 @@
+# API configuration
+
+Once your model is [exported](exporting.md) and you've implemented a [Predictor](predictors.md), you can configure your API via a yaml file (typically named `cortex.yaml`).
+
+Reference the section below which corresponds to your Predictor type: [Python](#python-predictor), [TensorFlow](#tensorflow-predictor), or [ONNX](#onnx-predictor).
+
+## Python Predictor
+
+```yaml
+- name: <string>  # API name (required)
+  endpoint: <string>  # the endpoint for the API (default: <api_name>)
+  predictor:
+    type: python
+    path: <string>  # path to a python file with a PythonPredictor class definition, relative to the Cortex root (required)
+    config: <string: value>  # arbitrary dictionary passed to the constructor of the Predictor (optional)
+    python_path: <string>  # path to the root of your Python folder that will be appended to PYTHONPATH (default: folder containing cortex.yaml)
+    env: <string: string>  # dictionary of environment variables
+  tracker:
+    key: <string>  # the JSON key in the response to track (required if the response payload is a JSON object)
+    model_type: <string>  # model type, must be "classification" or "regression" (required)
+  compute:
+    cpu: <string | int | float>  # CPU request per replica (default: 200m)
+    gpu: <int>  # GPU request per replica (default: 0)
+    mem: <string>  # memory request per replica (default: Null)
+  autoscaling:
+    min_replicas: <int>  # minimum number of replicas (default: 1)
+    max_replicas: <int>  # maximum number of replicas (default: 100)
+    init_replicas: <int>  # initial number of replicas (default: <min_replicas>)
+    workers_per_replica: <int>  # the number of parallel serving workers to run on each replica (default: 1)
+    threads_per_worker: <int>  # the number of threads per worker (default: 1)
+    target_replica_concurrency: <float>  # the desired number of in-flight requests per replica, which the autoscaler tries to maintain (default: workers_per_replica * threads_per_worker)
+    max_replica_concurrency: <int>  # the maximum number of in-flight requests per replica before requests are rejected with error code 503 (default: 1024)
+    window: <duration>  # the time over which to average the API's concurrency (default: 60s)
+    downscale_stabilization_period: <duration>  # the API will not scale below the highest recommendation made during this period (default: 5m)
+    upscale_stabilization_period: <duration>  # the API will not scale above the lowest recommendation made during this period (default: 0m)
+    max_downscale_factor: <float>  # the maximum factor by which to scale down the API on a single scaling event (default: 0.5)
+    max_upscale_factor: <float>  # the maximum factor by which to scale up the API on a single scaling event (default: 10)
+    downscale_tolerance: <float>  # any recommendation falling within this factor below the current number of replicas will not trigger a scale down event (default: 0.1)
+    upscale_tolerance: <float>  # any recommendation falling within this factor above the current number of replicas will not trigger a scale up event (default: 0.1)
+  update_strategy:
+    max_surge: <string | int>  # maximum number of replicas that can be scheduled above the desired number of replicas during an update; can be an absolute number, e.g. 5, or a percentage of desired replicas, e.g. 10% (default: 25%)
+    max_unavailable: <string | int>  # maximum number of replicas that can be unavailable during an update; can be an absolute number, e.g. 5, or a percentage of desired replicas, e.g. 10% (default: 25%)
+```
+
+See additional documentation for [autoscaling](autoscaling.md), [compute](compute.md), and [prediction monitoring](prediction-monitoring.md).
+
+## TensorFlow Predictor
+
+```yaml
+- name: <string>  # API name (required)
+  endpoint: <string>  # the endpoint for the API (default: <api_name>)
+  predictor:
+    type: tensorflow
+    path: <string>  # path to a python file with a TensorFlowPredictor class definition, relative to the Cortex root (required)
+    model: <string>  # S3 path to an exported model (e.g. s3://my-bucket/exported_model) (required)
+    signature_key: <string>  # name of the signature def to use for prediction (required if your model has more than one signature def)
+    config: <string: value>  # arbitrary dictionary passed to the constructor of the Predictor (optional)
+    python_path: <string>  # path to the root of your Python folder that will be appended to PYTHONPATH (default: folder containing cortex.yaml)
+    env: <string: string>  # dictionary of environment variables
+  tracker:
+    key: <string>  # the JSON key in the response to track (required if the response payload is a JSON object)
+    model_type: <string>  # model type, must be "classification" or "regression" (required)
+  compute:
+    cpu: <string | int | float>  # CPU request per replica (default: 200m)
+    gpu: <int>  # GPU request per replica (default: 0)
+    mem: <string>  # memory request per replica (default: Null)
+  autoscaling:
+    min_replicas: <int>  # minimum number of replicas (default: 1)
+    max_replicas: <int>  # maximum number of replicas (default: 100)
+    init_replicas: <int>  # initial number of replicas (default: <min_replicas>)
+    workers_per_replica: <int>  # the number of parallel serving workers to run on each replica (default: 1)
+    threads_per_worker: <int>  # the number of threads per worker (default: 1)
+    target_replica_concurrency: <float>  # the desired number of in-flight requests per replica, which the autoscaler tries to maintain (default: workers_per_replica * threads_per_worker)
+    max_replica_concurrency: <int>  # the maximum number of in-flight requests per replica before requests are rejected with error code 503 (default: 1024)
+    window: <duration>  # the time over which to average the API's concurrency (default: 60s)
+    downscale_stabilization_period: <duration>  # the API will not scale below the highest recommendation made during this period (default: 5m)
+    upscale_stabilization_period: <duration>  # the API will not scale above the lowest recommendation made during this period (default: 0m)
+    max_downscale_factor: <float>  # the maximum factor by which to scale down the API on a single scaling event (default: 0.5)
+    max_upscale_factor: <float>  # the maximum factor by which to scale up the API on a single scaling event (default: 10)
+    downscale_tolerance: <float>  # any recommendation falling within this factor below the current number of replicas will not trigger a scale down event (default: 0.1)
+    upscale_tolerance: <float>  # any recommendation falling within this factor above the current number of replicas will not trigger a scale up event (default: 0.1)
+  update_strategy:
+    max_surge: <string | int>  # maximum number of replicas that can be scheduled above the desired number of replicas during an update; can be an absolute number, e.g. 5, or a percentage of desired replicas, e.g. 10% (default: 25%)
+    max_unavailable: <string | int>  # maximum number of replicas that can be unavailable during an update; can be an absolute number, e.g. 5, or a percentage of desired replicas, e.g. 10% (default: 25%)
+```
+
+See additional documentation for [autoscaling](autoscaling.md), [compute](compute.md), and [prediction monitoring](prediction-monitoring.md).
+
+## ONNX Predictor
+
+```yaml
+- name: <string>  # API name (required)
+  endpoint: <string>  # the endpoint for the API (default: <api_name>)
+  predictor:
+    type: onnx
+    path: <string>  # path to a python file with an ONNXPredictor class definition, relative to the Cortex root (required)
+    model: <string>  # S3 path to an exported model (e.g. s3://my-bucket/exported_model.onnx) (required)
+    config: <string: value>  # arbitrary dictionary passed to the constructor of the Predictor (optional)
+    python_path: <string>  # path to the root of your Python folder that will be appended to PYTHONPATH (default: folder containing cortex.yaml)
+    env: <string: string>  # dictionary of environment variables
+  tracker:
+    key: <string>  # the JSON key in the response to track (required if the response payload is a JSON object)
+    model_type: <string>  # model type, must be "classification" or "regression" (required)
+  compute:
+    cpu: <string | int | float>  # CPU request per replica (default: 200m)
+    gpu: <int>  # GPU request per replica (default: 0)
+    mem: <string>  # memory request per replica (default: Null)
+  autoscaling:
+    min_replicas: <int>  # minimum number of replicas (default: 1)
+    max_replicas: <int>  # maximum number of replicas (default: 100)
+    init_replicas: <int>  # initial number of replicas (default: <min_replicas>)
+    workers_per_replica: <int>  # the number of parallel serving workers to run on each replica (default: 1)
+    threads_per_worker: <int>  # the number of threads per worker (default: 1)
+    target_replica_concurrency: <float>  # the desired number of in-flight requests per replica, which the autoscaler tries to maintain (default: workers_per_replica * threads_per_worker)
+    max_replica_concurrency: <int>  # the maximum number of in-flight requests per replica before requests are rejected with error code 503 (default: 1024)
+    window: <duration>  # the time over which to average the API's concurrency (default: 60s)
+    downscale_stabilization_period: <duration>  # the API will not scale below the highest recommendation made during this period (default: 5m)
+    upscale_stabilization_period: <duration>  # the API will not scale above the lowest recommendation made during this period (default: 0m)
+    max_downscale_factor: <float>  # the maximum factor by which to scale down the API on a single scaling event (default: 0.5)
+    max_upscale_factor: <float>  # the maximum factor by which to scale up the API on a single scaling event (default: 10)
+    downscale_tolerance: <float>  # any recommendation falling within this factor below the current number of replicas will not trigger a scale down event (default: 0.1)
+    upscale_tolerance: <float>  # any recommendation falling within this factor above the current number of replicas will not trigger a scale up event (default: 0.1)
+  update_strategy:
+    max_surge: <string | int>  # maximum number of replicas that can be scheduled above the desired number of replicas during an update; can be an absolute number, e.g. 5, or a percentage of desired replicas, e.g. 10% (default: 25%)
+    max_unavailable: <string | int>  # maximum number of replicas that can be unavailable during an update; can be an absolute number, e.g. 5, or a percentage of desired replicas, e.g. 10% (default: 25%)
+```
+
+See additional documentation for [autoscaling](autoscaling.md), [compute](compute.md), and [prediction monitoring](prediction-monitoring.md).
@@ -25,7 +25,6 @@ One unit of memory is one byte. Memory can be expressed as an integer or by usin
 
 ## GPU
 
-1. Make sure your AWS account is subscribed to the [EKS-optimized AMI with GPU Support](https://aws.amazon.com/marketplace/pp/B07GRHFXGM).
-2. You may need to [file an AWS support ticket](https://console.aws.amazon.com/support/cases#/create?issueType=service-limit-increase&limitType=ec2-instances) to increase the limit for your desired instance type.
-3. Set instance type to an AWS GPU instance (e.g. p2.xlarge) when installing Cortex.
-4. Note that one unit of GPU corresponds to one virtual GPU on AWS. Fractional requests are not allowed.
+One unit of GPU corresponds to one virtual GPU. Fractional requests are not allowed.
+
+See [GPU documentation](gpus.md) for more information.
@@ -0,0 +1,70 @@
+# API deployment
+
+Once your model is [exported](exporting.md), you've implemented a [Predictor](predictors.md), and you've [configured your API](api-configuration.md), you're ready to deploy!
+
+## `cortex deploy`
+
+The `cortex deploy` command collects your configuration and source code and deploys your API on your cluster:
+
+```bash
+$ cortex deploy
+
+creating my-api
+```
+
+APIs are declarative, so to update your API, simply modify your source code and/or configuration and run `cortex deploy` again.
+
+## `cortex get`
+
+The `cortex get` command displays the status of your APIs, and `cortex get <api_name>` shows additional information about a specific API.
+
+```bash
+$ cortex get my-api
+
+status   up-to-date   requested   last update   avg request   2XX
+live     1            1           1m            -             -
+
+endpoint: http://***.amazonaws.com/iris-classifier
+...
+```
+
+Appending the `--watch` flag will re-run the `cortex get` command every second.
+
+## `cortex logs`
+
+You can stream logs from your API using the `cortex logs` command:
+
+```bash
+$ cortex logs my-api
+```
+
+## Making a prediction
+
+You can use `curl` to test your prediction service, for example:
+
+```bash
+$ curl http://***.amazonaws.com/my-api \
+    -X POST -H "Content-Type: application/json" \
+    -d '{"key": "value"}'
+```
+
+## Debugging
+
+You can log information about each request by adding the `?debug=true` parameter to your requests. This will print the payload and the value after running your `predict()` function in the API logs.
+
+## `cortex delete`
+
+You can delete your API with the `cortex delete` command:
+
+```bash
+$ cortex delete my-api
+
+deleting my-api
+```
+
+## Additional resources
+
+<!-- CORTEX_VERSION_MINOR -->
+* [Tutorial](../../examples/sklearn/iris-classifier/README.md) provides a step-by-step walkthough of deploying an iris classifier API
+* [CLI documentation](../cluster-management/cli.md) lists all CLI commands
+* [Examples](https://github.com/cortexlabs/cortex/tree/0.14/examples) demonstrate how to deploy models from common ML libraries