Closed
Description
It would be really useful especially for smaller applications to be able to scale GPU's down to 0 when there is no traffic.
Possible approach
- To trigger scaling 1 -> 0, check CloudWatch metrics for no requests for a certain amount of time (user-configurable?).
- Scale 1 -> 0 by setting
deployment.spec.replicas
to 0. - When scaling 1 -> 0, also update the Istio Virtual Service to route requests to that API to a new deployment running in the Cortex node (or use the existing operator)
- 0 -> 1 scaling is triggered when a request comes in to that service
- Scale 0 -> 1 by setting deployment.spec.replicas to 0
- Either the service holds onto the request until the pod is ready, forwards it, and replies with the response, or responds immediately with a message saying e.g. "0 -> 1 scaling has been triggered, please try again in a few minutes"