Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable Replica Auto Scaling down to zero #445

Closed
nickwalton opened this issue Sep 10, 2019 · 5 comments · Fixed by #2298
Closed

Enable Replica Auto Scaling down to zero #445

nickwalton opened this issue Sep 10, 2019 · 5 comments · Fixed by #2298
Assignees
Labels
enhancement New feature or request research Determine technical constraints

Comments

@nickwalton
Copy link

nickwalton commented Sep 10, 2019

It would be really useful especially for smaller applications to be able to scale GPU's down to 0 when there is no traffic.

Possible approach

  • To trigger scaling 1 -> 0, check CloudWatch metrics for no requests for a certain amount of time (user-configurable?).
  • Scale 1 -> 0 by setting deployment.spec.replicas to 0.
  • When scaling 1 -> 0, also update the Istio Virtual Service to route requests to that API to a new deployment running in the Cortex node (or use the existing operator)
  • 0 -> 1 scaling is triggered when a request comes in to that service
  • Scale 0 -> 1 by setting deployment.spec.replicas to 0
  • Either the service holds onto the request until the pod is ready, forwards it, and replies with the response, or responds immediately with a message saying e.g. "0 -> 1 scaling has been triggered, please try again in a few minutes"
@nickwalton nickwalton added the enhancement New feature or request label Sep 10, 2019
@deliahu deliahu added the v0.9 label Sep 24, 2019
@deliahu deliahu self-assigned this Sep 24, 2019
@deliahu deliahu changed the title Enable Auto Scaling down to 0 for GPUs Enable Cluster Auto Scaling down to 0 for GPUs Oct 1, 2019
@vishalbollu vishalbollu changed the title Enable Cluster Auto Scaling down to 0 for GPUs Enable Replica Auto Scaling down to 0 for GPUs Oct 7, 2019
@vishalbollu vishalbollu removed the v0.9 label Oct 7, 2019
@deliahu deliahu added the research Determine technical constraints label Oct 11, 2019
@deliahu deliahu changed the title Enable Replica Auto Scaling down to 0 for GPUs Enable Replica Auto Scaling down to 0 for GPUs [3] Oct 11, 2019
@deliahu deliahu removed their assignment Oct 11, 2019
@deliahu deliahu added the v0.11 label Nov 6, 2019
@deliahu deliahu removed the v0.11 label Nov 20, 2019
@deliahu deliahu changed the title Enable Replica Auto Scaling down to 0 for GPUs [3] Enable Replica Auto Scaling down to 0 [3] Nov 25, 2019
@deliahu deliahu changed the title Enable Replica Auto Scaling down to 0 [3] Enable Replica Auto Scaling down to 0 Dec 20, 2019
@deliahu deliahu added the v0.15 label Jan 22, 2020
@deliahu deliahu added v0.16 and removed v0.15 labels Mar 9, 2020
@deliahu deliahu added the v0.17 label Mar 24, 2020
@deliahu deliahu changed the title Enable Replica Auto Scaling down to 0 Enable Replica Auto Scaling down to zero Apr 5, 2020
@lefnire
Copy link

lefnire commented Dec 8, 2020

Definitely interested in this. Would make Cortext a no-brainer for ML projects which don't yet have users enough to need 1 consistent GPU. Then we'd benefit from the rest of the scaling/infra when the users do come.

@lefnire
Copy link

lefnire commented Dec 11, 2020

To be clear, it's the unused running GPU instance we're concerned about. If it's the case that something has to run, that's fine. Eg, a nano/micro that serves as the orchestrator and must always be on - if that makes anything easier. (Forgive my Cortex newbishness if that's kinda how it already works)

@deliahu
Copy link
Member

deliahu commented Dec 12, 2020

@lefnire Thanks for reaching out; yes, that makes sense, and is exactly how we'd implement it!

We haven't decided yet on our priority for implementing this feature. One thing that can render it less useful (or at least "awkward") is how long it takes to spin up a GPU instance and install the dependencies on it; we'd have to hold on to the request for 5+ minutes before forwarding it along. A more intuitive approach might be to support an asynchronous API instead, where you make the API request and it responds immediately with an execution ID, and then you can make an additional request to another API to query the status/results for the execution ID (we have #1610 to track this).

In the meantime, in case it's helpful, it is possible to create/delete APIs programmatically via the Cortex CLI or Python client. So if you know you are expecting traffic, or it happens on a regular schedule, you could create/delete APIs accordingly.

Also, we do currently support batch jobs, which is a bit like the asynchronous approach I described, except that autoscaling behaves differently: for batch jobs, you submit a job and indicate how many containers you want to run it on, and then once the job is done, the containers spin down. So it does "scale to 0", but is not designed to handle real-time traffic where each individual request is fairly lightweight, and can come at any time from any source.

@lefnire
Copy link

lefnire commented Dec 12, 2020

In the meantime, in case it's helpful, it is possible to create/delete APIs programmatically via the Cortex CLI or Python client. So if you know you are expecting traffic, or it happens on a regular schedule, you could create/delete APIs accordingly.

That is indeed helpful. With how valuable Cortex is, any leg-work on our part is worth it for this use-case.

I'll look into the batch jobs. I'm currently using AWS Batch anyway, where the container exit(0)s when it hasn't received traffic for a while. So it's a hybrid of real batch jobs vs hosted model, which I might be able to pull off with Cortex batch jobs. AWS Batch has given me ulcers, so if Cortex works well even just replacing Batch could be worth it.

Thanks for the suggestions! I'll keep subscribed to this issue in case scale->0 ever comes.

@deliahu
Copy link
Member

deliahu commented Dec 12, 2020

@lefnire Sounds good, feel free to reach out on our gitter if you have any questions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request research Determine technical constraints
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants