-
Notifications
You must be signed in to change notification settings - Fork 608
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable Replica Auto Scaling down to zero #445
Comments
Definitely interested in this. Would make Cortext a no-brainer for ML projects which don't yet have users enough to need 1 consistent GPU. Then we'd benefit from the rest of the scaling/infra when the users do come. |
To be clear, it's the unused running GPU instance we're concerned about. If it's the case that something has to run, that's fine. Eg, a nano/micro that serves as the orchestrator and must always be on - if that makes anything easier. (Forgive my Cortex newbishness if that's kinda how it already works) |
@lefnire Thanks for reaching out; yes, that makes sense, and is exactly how we'd implement it! We haven't decided yet on our priority for implementing this feature. One thing that can render it less useful (or at least "awkward") is how long it takes to spin up a GPU instance and install the dependencies on it; we'd have to hold on to the request for 5+ minutes before forwarding it along. A more intuitive approach might be to support an asynchronous API instead, where you make the API request and it responds immediately with an execution ID, and then you can make an additional request to another API to query the status/results for the execution ID (we have #1610 to track this). In the meantime, in case it's helpful, it is possible to create/delete APIs programmatically via the Cortex CLI or Python client. So if you know you are expecting traffic, or it happens on a regular schedule, you could create/delete APIs accordingly. Also, we do currently support batch jobs, which is a bit like the asynchronous approach I described, except that autoscaling behaves differently: for batch jobs, you submit a job and indicate how many containers you want to run it on, and then once the job is done, the containers spin down. So it does "scale to 0", but is not designed to handle real-time traffic where each individual request is fairly lightweight, and can come at any time from any source. |
That is indeed helpful. With how valuable Cortex is, any leg-work on our part is worth it for this use-case. I'll look into the batch jobs. I'm currently using AWS Batch anyway, where the container Thanks for the suggestions! I'll keep subscribed to this issue in case scale->0 ever comes. |
It would be really useful especially for smaller applications to be able to scale GPU's down to 0 when there is no traffic.
Possible approach
deployment.spec.replicas
to 0.The text was updated successfully, but these errors were encountered: