Closed
Description
Description
Things to mention:
- Mention multi-model endpoints, and link to example(s)
- CPUs are cheaper than GPUs
- User should consider spot instances, include a sample config like this (also mention to use similar types in
instance_distribution
, and link to spot docs):
# cluster.yaml
cluster_name: cortex
region: us-west-2
instance_type: g4dn.xlarge
min_instances: 0
max_instances: 20
spot: true
spot_config:
on_demand_base_capacity: 0
on_demand_percentage_above_base_capacity: 0
on_demand_backup: true
Here is some sample text:
APIs will be able to scale down to 1 replica per API, but not 0. So if you have 9 APIs running, there will be a minimum of 9 replicas. Terminating instances from the AWS console will not help, since cortex will consider this as an unexpected state, and will re-create the instance. You can delete APIs to reduce the number of instances (cortex delete <api_name>), or you can serve multiple models from a single API, as is done in the pytorch/multi-model-text-analyzer example (this way you would have one endpoint, and would choose which one of the 9 APIs to run for the request based on a query parameter in the request URL).