Skip to content

Conversation

@GautamBytes
Copy link
Contributor

What type of PR is this?

/kind improvement /kind bug

What this PR does / why we need it:

This PR fixes a critical bug in the scheduler's reclaim action where an entire queue could be incorrectly skipped during preemption.

The previous logic would evaluate only the first task of a starving job. If that single task failed a preliminary check (e.g., its preemptionPolicy was set to Never), the scheduler would discard the entire queue for that cycle. This prevented other valid, preemptable tasks in the same job, or other jobs in the same queue, from ever being considered for reclamation.

To fix this, the reclaim action's main loop has been refactored to mirror the robust nested queue -> job -> task structure found in the allocate action. This ensures:

  • All candidate tasks within a job are correctly evaluated.
  • The scheduler's preemption logic is more fair and correct.
  • Performance is improved by removing an inefficient pop/push cycle.
  • Code consistency across core scheduler actions is increased.

Which issue(s) this PR fixes:

Fixes #3738

Special notes for your reviewer:

The core of this change is refactoring the reclaim action's main loop to align with the existing, proven pattern in the allocate action. This not only fixes the bug but also improves code consistency and performance. A new unit test has been added to specifically cover this failure scenario.

Does this PR introduce a user-facing change?

None

Signed-off-by: GautamBytes <manchandanigautam@gmail.com>
@volcano-sh-bot volcano-sh-bot added the kind/bug Categorizes issue or PR as related to a bug. label Jun 25, 2025
@volcano-sh-bot
Copy link
Contributor

@GautamBytes: The label(s) kind/improvement, kind//kind cannot be applied. These labels are supported: ``

Details

In response to this:

What type of PR is this?

/kind improvement /kind bug

What this PR does / why we need it:

This PR fixes a critical bug in the scheduler's reclaim action where an entire queue could be incorrectly skipped during preemption.

The previous logic would evaluate only the first task of a starving job. If that single task failed a preliminary check (e.g., its preemptionPolicy was set to Never), the scheduler would discard the entire queue for that cycle. This prevented other valid, preemptable tasks in the same job, or other jobs in the same queue, from ever being considered for reclamation.

To fix this, the reclaim action's main loop has been refactored to mirror the robust nested queue -> job -> task structure found in the allocate action. This ensures:

  • All candidate tasks within a job are correctly evaluated.
  • The scheduler's preemption logic is more fair and correct.
  • Performance is improved by removing an inefficient pop/push cycle.
  • Code consistency across core scheduler actions is increased.

Which issue(s) this PR fixes:

Fixes #3738

Special notes for your reviewer:

The core of this change is refactoring the reclaim action's main loop to align with the existing, proven pattern in the allocate action. This not only fixes the bug but also improves code consistency and performance. A new unit test has been added to specifically cover this failure scenario.

Does this PR introduce a user-facing change?

None

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@volcano-sh-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign shinytang6
You can assign the PR to them by writing /assign @shinytang6 in a comment when ready.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@volcano-sh-bot volcano-sh-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jun 25, 2025
Signed-off-by: GautamBytes <manchandanigautam@gmail.com>
Signed-off-by: GautamBytes <manchandanigautam@gmail.com>
Signed-off-by: GautamBytes <manchandanigautam@gmail.com>
Signed-off-by: GautamBytes <manchandanigautam@gmail.com>
Signed-off-by: GautamBytes <manchandanigautam@gmail.com>
@GautamBytes
Copy link
Contributor Author

/assign @JesseStutler

@JesseStutler , can you help me figure out why this code_verify workflow is failing , i am having hard time fixing it . Everytime i change something to fix it , it shows completely new error. It is not happening with my other prs.

@lowang-bh
Copy link
Member

please split pr into different prs or commits, each pr/commits implement only one function, such as refact, bugfix.

@JesseStutler
Copy link
Member

/cc
/priority high

@kev1N916
Copy link

@JesseStutler I've opened a pull request to solve this issue at #4634

@volcano-sh-bot volcano-sh-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 25, 2025
@volcano-sh-bot
Copy link
Contributor

@GautamBytes: PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@JesseStutler
Copy link
Member

@GautamBytes Hi, are you still working on this? You need to rebase the latest code and fix all the CIs

@GautamBytes
Copy link
Contributor Author

GautamBytes commented Oct 25, 2025

@JesseStutler sorry , i got an internship and currently busy there . Would be great if anyone inherit my pr or asks for commit access and gets it merged!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/bug Categorizes issue or PR as related to a bug. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. priority/high size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Need to refactor the reclaim action

5 participants