Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue in LessEqual affecting enqueueability decision of proportion plugin (in v1.4.0), requesting hotfix to v1.4.0 #2014

Closed
kye308 opened this issue Feb 11, 2022 · 6 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@kye308
Copy link

kye308 commented Feb 11, 2022

What happened:

When there are no jobs running in a queue, and the queue capability specifies a ScalarResource (i.e. gpu) attribute other than cpu/memory, the proportion plugin will always chose to reject enqueueing new jobs.

Problematic comparison code in proportion plugin:

inqueue := minReq.Add(attr.allocated).Add(attr.inqueue).LessEqual(api.NewResource(queue.Queue.Spec.Capability), api.Infinity)

Bug (?) in LessEqual:

if leftValue == -1 || !lessEqualFunc(leftValue, rightValue, minResource) {

What you expected to happen:

For a queue with no jobs, the allocated and inqueue values were both 0 CPU, 0 Gi Mem.

Our queue specifies a capability of the form

    cpu: X
    ephemeral-storage: Y Gi
    memory: Z Gi
    nvidia.com/gpu: N

nvidia.com/gpu is identified as a ScalarResource and appears to trigger the if condition here:

if leftValue == -1 || !lessEqualFunc(leftValue, rightValue, minResource) {

Thus any submitted jobs will be stuck in Pending state.

How to reproduce it (as minimally and precisely as possible):

Try to schedule any job with the scheduler config below and a queue configured as specified above on a cluster with no running jobs.

Anything else we need to know?:

scheduler config

actions: "enqueue, allocate, backfill"
tiers:
- plugins:
  - name: priority
  - name: gang
  - name: conformance
- plugins:
  - name: proportion
  - name: predicates
    arguments:
      predicate.GPUSharingEnable: false
      predicate.CacheEnable: false
      predicate.ProportionalEnable: false

Environment:

  • Volcano Version: v1.4.0
  • Kubernetes version (use kubectl version):
  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:
@kye308 kye308 added the kind/bug Categorizes issue or PR as related to a bug. label Feb 11, 2022
@kye308
Copy link
Author

kye308 commented Feb 11, 2022

The observed issue was fixed when we added this patch: #1769. Can we add this to v1.4.0 as well.

@Thor-wl
Copy link
Contributor

Thor-wl commented Feb 12, 2022

Can we add this to v1.4.0 as well.

Yes, glad to see that.

@Thor-wl
Copy link
Contributor

Thor-wl commented Feb 12, 2022

/cc @hwdef Can you help for that?

@hwdef
Copy link
Member

hwdef commented Feb 12, 2022

@Thor-wl
ok, I'll check this.It may have to wait until Monday.

@hwdef
Copy link
Member

hwdef commented Feb 18, 2022

@kye308 @Thor-wl I think this issue can be closed.

@kye308
Copy link
Author

kye308 commented Feb 18, 2022

thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

4 participants