-
Notifications
You must be signed in to change notification settings - Fork 997
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reclaim is not working with GPUSharingEnable if resource used is volcano.sh/gpu-memory #2739
Comments
The issue is link in checkNodeGPUSharingPredicate function in line
Also FilterNode in link
It should check if pod can fit node and not if there is enough cpu left as used by Allocate |
Judging whether the GPU resources of the node are sufficient in the predicate will indeed affect the functions related to preempt and reclaim. Is there a better solution? @archlitchi |
@archlitchi @wangyang0616 I will be happy to work on it as we need it to work in our production. |
Thank you very much for your contribution and look forward to your proposal. @igormishsky |
/priority important-soon |
In my opinion, it would be a better way to split original As to the question, we only need to invoke different parts of |
I understand that this should be a general problem, not just gpu, but for other algorithm plugins, if resource-related filtering logic is added to the predicate, it will affect the preemption function. I made a repair according to the method of @jiangkaihua , disassembled the resource-related filtering logic from the Please help to review the PR: #2818 |
It is indeed sensible to differentiate between preempt and allocate, as they are not equivalent. To confirm my comprehension, in the GPU context, the predicate should ascertain whether the node has sufficient GPU resources, while the resource predicate should evaluate the availability of unused resources on the node. If I have misunderstood, please feel free to provide corrections. Regarding the naming, the current predicate is more of a "predicateFit" since it checks if the pod can fit into the node, whereas the predicateResource is the actual predicate being used at the moment (e.g., assessing "free" GPU resources). We hope this can be merged seamlessly, as it is straightforward and should not cause any side effects, given that no changes have been made and further adjustments will only occur after implementation of the new predicate , leaving naming concerns aside. |
@wangyang0616 @jiangkaihua any update? can we continue with the proposed solution by @wangyang0616 ? |
Yes, this is a serious problem. I will make some adjustments to the method naming problem in pr, and then please help to review. |
/assign |
What happened:
One node with 480 memory (factor 10)
Having two queues high with weight 3 and low with weight 1.
After adding 4 jobs with requested & limit with volcano.sh/gpu-memory=120 to low queue (weight 1).
Adding new job to high with requested & limit volcano.sh/gpu-memory=120 queue with (weight 1) volcano does not reclaim the any one of the jobs running in the low queue (weight 1)
What you expected to happen:
I expect one of the jobs in low queue (weight 1) to move to pending state so that the job in the high queue (weight 3) can start.
How to reproduce it (as minimally and precisely as possible):
Node with allocatble resources
Have two queues one with low weight
And queue with higher weight
3 Jobs for running in low queue
Job for running in high queue
Anything else we need to know?:
Environment:
Volcano Version: master
Kubernetes version: v1.22.17
Install tools: helm
Others:
scheduler.conf:
The text was updated successfully, but these errors were encountered: