-
Notifications
You must be signed in to change notification settings - Fork 1.2k
fix(enqueue): capacity plugin job enqueueable #4825
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
fix(enqueue): capacity plugin job enqueueable #4825
Conversation
Adding a log line to the enqueue action with the maximum verbosity to improve traceability and make the enqueue process easier to follow. Signed-off-by: Hajnal Máté <hajnalmt@gmail.com>
|
@hajnalmt: The label(s) DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Summary of ChangesHello @hajnalmt, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly improves the job enqueueing process within the capacity plugin by resolving a bug that prevented legitimate jobs from being enqueued under certain resource allocation conditions. It introduces more flexible resource evaluation, allowing jobs to progress when their requested resources fall within a queue's guaranteed or deserved limits, even if the overall capacity is temporarily oversubscribed. This enhancement ensures more efficient utilization of cluster resources and a smoother workflow for jobs in various queue configurations. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a valuable fix and enhancement to the capacity plugin's enqueue logic, allowing jobs to be enqueued based on guarantee or deserved resources, which resolves issues in hierarchical queue scenarios. The changes are well-supported by a comprehensive set of new test cases that cover the described scenarios. My review includes a suggestion to refactor a part of the new logic to improve maintainability by reducing code duplication.
52b26f4 to
7d1fd4e
Compare
Enhancing the capacity plugin enqueue-ing process. When there is a single queue that defines guarantee on the cluster we allow enqueue below guarantee for the incoming tasks. The "below" here means that if there is a single resource dimension the task requests which with we are still equal or less in guarantee then we should enqueue the task. This means in theory that there can be dimensions where the task will be above the queue capability, but we still enqueue. This case is the matter of the allocate preempt and reclaim action I think. Signed-off-by: Hajnal Máté <hajnalmt@gmail.com>
New unit tests in the capacity plugin which are verifying the below deserved/guarantee enqueue process. Signed-off-by: Hajnal Máté <hajnalmt@gmail.com>
Moving the test functions in the capacity plugins to the common test utils, so they can be reused elsewhere too. Signed-off-by: Hajnal Máté <hajnalmt@gmail.com>
7d1fd4e to
58e0dce
Compare
zjj2wry
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, the only concern is excessive enqueueing when GPU is insufficient but CPU is sufficient. However, I can configure child queues with GPU guarantee only
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: zjj2wry The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
I'm wondering that in this case: Why leaf-1 can allocate 100CPU? Because leaf-2 has 40CPU guarantee resources, leaf-1's realCapability should be 60CPU, why leaf-1 can allocate 100CPU? BTW, I think the case is very weird, we prefer to comply guarantee < deserved < capability, don't know we have such a real scenario totalGuarantee == totalResource |
The scenario where leaf-2 is created after leaf-1 has already allocated resources. when cluster resources are insufficient and new nodes are added through scaling, existing pending tasks may immediately consume all the newly added resources, preventing new queues from scheduling pods even if they have guaranteed resources. |
|
the job will always enqueue like below. i think we should checks all dimensions of resources when determining if a job can be enqueue and preempt. |
What type of PR is this?
/kind bug
/kind enhancement
What this PR does / why we need it:
The PR contains some enhancements and a bugfix regarding the capacity plugin's enqueue process.
Currently in the capacity plugin we rely purely on realCapability to decide wheter a job shall enqueue or not.
This is a little bit problematic, as there are cases when we want to enqueue below deserved or guarantee directly as
zjj2wry points out here: #4817 (comment)
For example, when we have a queue structure where:
The leaf-2 queue cannot enqueue, although having guaranteed resources on the cluster.
There can be use cases where we don't
guaranteeresources on the cluster to any queues, rather than we are slicing the cluster ondeservedvalues, for example if we omit guarantee in the above example:leaf-2 in this case will not able to enqueue too due to maxed out cluster capacity, although it would deserve 40
For this use case (when no guarantee is specified) we should enqueue too I think.
The comparision uses
LessEqualPartlyWithDimensionZeroFilteredso if there exists any dimension that is below or equalin
deserved/guaranteewe still enqueue. This means in theory that there can be dimensions where the job's minResources will increase theallocated+inqueuedqueue resources above the queue capability value, which is fine I think as this is the realm ofallocaterather thanenqueue, so the principle that we don't let resources to be created in a queue above the capability is not harmed.Additionally, these examples are for hierarchical queues, but the causes and the induction is true for the non-hierarchical cases too.
Minor enhancement in the
enqueueaction logging and thecapacityplugins tests has been added to the PR.Which issue(s) this PR fixes:
Fixes #4817
Special notes for your reviewer:
Does this PR introduce a user-facing change?