Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PartialAdmission] Fix preemption while partially admitting. #2826

Merged
merged 4 commits into from
Sep 23, 2024

Conversation

trasc
Copy link
Contributor

@trasc trasc commented Aug 12, 2024

What type of PR is this?

/kind bug

What this PR does / why we need it:

  • Fix the flow that allows a workload to be partially admitted while preempting lower priority workloads.
  • Add an unit test for preemption while partially admitting.

Which issue(s) this PR fixes:

Fixes #2799

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Fix some scenarios for partial admission which are affected by wrong calculation of resources
used by the incoming workload which is partially admitted and preempting.

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. release-note-none Denotes a PR that doesn't merit a release note. labels Aug 12, 2024
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Aug 12, 2024
@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Aug 12, 2024
Copy link

netlify bot commented Aug 12, 2024

Deploy Preview for kubernetes-sigs-kueue ready!

Name Link
🔨 Latest commit 6fa2dca
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/66e9896ab4da4400089409d6
😎 Deploy Preview https://deploy-preview-2826--kubernetes-sigs-kueue.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@trasc
Copy link
Contributor Author

trasc commented Aug 12, 2024

/cc @gabesaba

@mimowo
Copy link
Contributor

mimowo commented Aug 13, 2024

Please add a release note describing what's changed from the user's perspective.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Aug 13, 2024
@trasc
Copy link
Contributor Author

trasc commented Aug 13, 2024

Please add a release note describing what's changed from the user's perspective.

done

@gabesaba
Copy link
Contributor

/assign

@trasc
Copy link
Contributor Author

trasc commented Sep 17, 2024

@gabesaba @mimowo is there anything else we should do on this?

@mimowo
Copy link
Contributor

mimowo commented Sep 17, 2024

I support the ask in comment.

pkg/scheduler/scheduler.go Outdated Show resolved Hide resolved
@mimowo
Copy link
Contributor

mimowo commented Sep 17, 2024

@trasc Could you describe what the scenario is fixed here, and what is the consequence from the user perspective when it happens?

Please update the release note with a short description of the problematic scenario.

It will also help us to guide the decision about cherry-picking.

@mimowo
Copy link
Contributor

mimowo commented Sep 17, 2024

@gabesaba is it fair to close the original issue #2799 when this PR is merged?

@trasc
Copy link
Contributor Author

trasc commented Sep 17, 2024

@trasc Could you describe what the scenario is fixed here, and what is the consequence from the user perspective when it happens?

Please update the release note with a short description of the problematic scenario.

It will also help us to guide the decision about cherry-picking.

The scenario is preemption during partial admission.

@mimowo
Copy link
Contributor

mimowo commented Sep 17, 2024

The scenario is preemption during partial admission.

Can you provide a little bit more details? In particular: is the preemption not working at all / erroring / panicing / or working sub-optimally.
Also: where the issues deterministic on every preemption, or occasional?

@mimowo mimowo closed this Sep 17, 2024
@mimowo
Copy link
Contributor

mimowo commented Sep 17, 2024

/reopen
misclicked by mistake

@k8s-ci-robot k8s-ci-robot reopened this Sep 17, 2024
@k8s-ci-robot
Copy link
Contributor

@mimowo: Reopened this PR.

In response to this:

/reopen
misclicked by mistake

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@alculquicondor
Copy link
Contributor

/cc

@mimowo
Copy link
Contributor

mimowo commented Sep 20, 2024

The scenario is preemption during partial admission.

I'm testing locally the impact of the bug and it seems to go beyond preemption.

For example:

  1. I have a CQ allowing for 9 pods (preemption disabled completely)
  2. I created 2 small jobs, each consuming 3 pods, so 6 pods total.
  3. a big Job comes in requesting 9 pods, but using partial admission to say min 3 pods
  4. 3 pods are created (correctly), but the CQ misreports usage as 15 units (6 + 9).
  5. When, the small jobs are deleted (or finished), the big Job continues to use 3 pods, but reports consuming 9 resources
  6. I create new small jobs requesting 3 units which should fit in, but they are suspended

I'm yet going to determine the scope of the problem and fix, but it seems serious enough to justify cherry-picking.

@trasc
Copy link
Contributor Author

trasc commented Sep 20, 2024

The scenario is preemption during partial admission.

I'm testing locally the impact of the bug and it seems to go beyond preemption.

For example:

  1. I have a CQ allowing for 9 pods (preemption disabled completely)
  2. I created 2 small jobs, each consuming 3 pods, so 6 pods total.
  3. a big Job comes in requesting 9 pods, but using partial admission to say min 3 pods
  4. 3 pods are created (correctly), but the CQ misreports usage as 15 units (6 + 9).
  5. When, the small jobs are deleted (or finished), the big Job continues to use 3 pods, but reports consuming 9 resources
  6. I create new small jobs requesting 3 units which should fit in, but they are suspended

I'm yet going to determine the scope of the problem and fix, but it seems serious enough to justify cherry-picking.

I think this is a different issue. Just open the issue in GH and feel free to assign it to me.

@mimowo
Copy link
Contributor

mimowo commented Sep 20, 2024

Ok, there you go: #3108

@mimowo
Copy link
Contributor

mimowo commented Sep 20, 2024

Regarding the fix here I played a bit more, and observed, that the feature generally worked before (no panics, no errors), but there are differences in some scenarios, and they are really hard for me to briefly (also the other issue is interfering to test properly, because the other issue is also present with preemption enabled).

Still, IIUC the culprit for the issue is wrong calculation of resources requested by the incoming workload. I would like to capture this in the release note as below.

/release-note-edit
```release-note
Fix some scenarios for partial admission which are affected by wrong calculation of resources
used by the incoming workload which is partially admitted and preempting.

Going forward it would be super helpful to have an integration test(s) for this feature, otherwise it is hard to judge how the unit tests translate into the e2e behavior, imo.

Regarding the cherry-pick, no strong view here, but I think we don't need to, given that we don't have customers / users affected by the issue, or demanding the fix. Additionally, the other issue opened above still interferes to fully evaluate the fix. We can re-evaluate cherry-picking them together, or when we have users demanding the fix.

@mimowo
Copy link
Contributor

mimowo commented Sep 20, 2024

LGTM, except for the point raised here, but I can accept as is, just waiting a bit for @alculquicondor in case he has something to add or other ideas to improve.

@mimowo
Copy link
Contributor

mimowo commented Sep 23, 2024

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 23, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mimowo, trasc

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: bfea979511531e090af34379b6573419f2bde216

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 23, 2024
@k8s-ci-robot k8s-ci-robot merged commit f378233 into kubernetes-sigs:main Sep 23, 2024
16 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v0.9 milestone Sep 23, 2024
@mbobrovskyi mbobrovskyi deleted the partial-admission-preempt branch September 24, 2024 06:34
@trasc
Copy link
Contributor Author

trasc commented Oct 9, 2024

/cherrypick release-0.8

@k8s-infra-cherrypick-robot
Copy link
Contributor

@trasc: new pull request created: #3205

In response to this:

/cherrypick release-0.8

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Partial Admission Preemption Panic
6 participants