Skip to content

Conversation

@guoqinwill
Copy link
Contributor

@guoqinwill guoqinwill commented Dec 10, 2025

What type of PR is this?

/kind bug

What this PR does / why we need it:

origin pr:#4407

Which issue(s) this PR fixes:

Fixes #3738

Special notes for your reviewer:

Does this PR introduce a user-facing change?


@volcano-sh-bot volcano-sh-bot added the kind/bug Categorizes issue or PR as related to a bug. label Dec 10, 2025
@volcano-sh-bot
Copy link
Contributor

@guoqinwill: The label(s) kind/improvement, kind//kind cannot be applied, because the repository doesn't have them.

Details

In response to this:

What type of PR is this?

/kind improvement /kind bug

What this PR does / why we need it:

origin pr:#4407

Which issue(s) this PR fixes:

Fixes #3738

Special notes for your reviewer:

Does this PR introduce a user-facing change?


Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@volcano-sh-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign shinytang6 for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@gemini-code-assist
Copy link

Summary of Changes

Hello @guoqinwill, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refactors the reclaim action within the scheduler to enhance its robustness and correctness. The primary goal is to ensure that tasks explicitly marked with a PreemptNever policy do not inadvertently prevent other eligible tasks from reclaiming necessary resources. By restructuring the reclamation process and introducing a dedicated helper function, the scheduler can now gracefully handle non-preemptable tasks while continuing to evaluate and execute preemption for others, thereby improving overall resource management and addressing a known issue.

Highlights

  • Refactored Reclaim Action Logic: The core logic for reclaiming resources has been extracted into a new helper function, reclaimForTask, to improve modularity and readability within the Execute function.
  • Improved Handling of PreemptNever Policy: The refactoring addresses a bug where tasks with a PreemptNever policy could block subsequent tasks in the same job or queue from initiating reclamation. The updated logic now correctly skips such tasks, allowing other eligible tasks to proceed.
  • Enhanced Iteration for Reclamation: The main Execute function now features nested loops that iterate through queues, jobs, and their candidate tasks more systematically, ensuring that all potential reclamation opportunities are considered, even when some tasks are non-preemptable.
  • New Test Case for Preemption Policy: A new test case, Reclaim succeeds for second task when first task has PreemptionPolicy=Never, has been added to validate the correct behavior of the reclaim action when encountering tasks with PreemptNever policies.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@volcano-sh-bot volcano-sh-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Dec 10, 2025
Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the reclaim action to correctly handle tasks with a PreemptionPolicy of Never, preventing them from blocking other tasks. The overall change is an improvement and the new test case is well-suited to verify the fix. However, I've found a critical issue in the new loop structure that could lead to an infinite loop, and another bug in how task assignment is tracked within a job. I've also noted a leftover debug log that should be removed. My suggestions aim to fix these new issues while preserving the intended refactoring.

@JesseStutler
Copy link
Member

/cc

@JesseStutler
Copy link
Member

Also /assign @hajnalmt mate is also familiar with the reclaim logic

@guoqinwill guoqinwill force-pushed the fix-reclaim-1 branch 5 times, most recently from e4f55b6 to bcc4023 Compare December 11, 2025 12:28
Copy link
Contributor

@hajnalmt hajnalmt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great contribution! Thank you @guoqinwill !
The logic behind the action looks flawless.

Following Jesse's comment, I also think the variable names for the two boolean variable assigned and the assignedInJob can be improved, to reflect the intention. (I don't like it in preempt action too.)
Could we rename them to taskPipelined and taskPipelinedInJob for clarity?
These names better reflect that the variables indicate whether a task was pipelined in the loop and within the job, respectively.

task.UID, n.Name, ssn.UID, rollbackErr)
}
}
assigned = true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a reminder that here assigned should be renamed like reclaimed, but I still feel this variable is unuseful in reclaim scenario

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a variable named reclaimed at line 213, that's why I suggested taskPipelined for this, but we can rename the other variable to reclaimedResources and name this to reclaimed maybe?
This variable indicates if there was a task pipelined in the function which is I think important to continue the reclaim in the other tasks in the job.

predicateNodes, _ := predicateHelper.PredicateNodes(task, totalNodes, ssn.PredicateForPreemptAction, false)
assigned := false

for _, n := range predicateNodes {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#3997 Currently, reclaim is done by traversing nodes, instead of first filtering candidates by node and then sorting them. So, I'm wondering if we can fix this issue together? Imagine this worst-case scenario: each job's pods are scattered across different nodes, and then a job wants to reclaim, it might reclaim one pod from each of the other jobs. If we sort by priority, this would make it easier for pods from lower-priority jobs to be reclaimed first.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see there wasn't any progress in this Issue.

How do you imagine the implementation for this?
Do we want to create a VictimNodeOrderFn implemented in the nodeOrder plugin?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the modifications to this issue can be split up for now, and we can focus on refactoring the reclaim process first.

Copy link
Contributor

@hajnalmt hajnalmt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check my comments too 👍

task.UID, n.Name, ssn.UID, rollbackErr)
}
}
assigned = true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a variable named reclaimed at line 213, that's why I suggested taskPipelined for this, but we can rename the other variable to reclaimedResources and name this to reclaimed maybe?
This variable indicates if there was a task pipelined in the function which is I think important to continue the reclaim in the other tasks in the job.

"volcano.sh/volcano/pkg/scheduler/util"
)

func init() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did you leave the function definition here? Is it called somehwere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because predicateHelper.PredicateNodes was introduced, the unit test needs to initialize options.ServerOpts; otherwise, it will fail at runtime.

},
PodGroups: []*schedulingv1beta1.PodGroup{
util.BuildPodGroupWithPrio("pg1", "c1", "q1", 1, nil, schedulingv1beta1.PodGroupInqueue, "mid-priority"),
util.BuildPodGroupWithPrio("pg1", "c1", "q1", 1, nil, schedulingv1beta1.PodGroupInqueue, "low-priority"),
Copy link
Contributor

@hajnalmt hajnalmt Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test case was fixed on master please rebase, and don't modify it in your commit directly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

predicateNodes, _ := predicateHelper.PredicateNodes(task, totalNodes, ssn.PredicateForPreemptAction, false)
assigned := false

for _, n := range predicateNodes {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see there wasn't any progress in this Issue.

How do you imagine the implementation for this?
Do we want to create a VictimNodeOrderFn implemented in the nodeOrder plugin?

Signed-off-by: GautamBytes <manchandanigautam@gmail.com>
Signed-off-by: guoqinwill <gq411will@163.com>
Signed-off-by: guoqinwill <gq411will@163.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/bug Categorizes issue or PR as related to a bug. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Need to refactor the reclaim action

5 participants