Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reclaim is not working with GPUSharingEnable if resource used is volcano.sh/gpu-memory #2739

Closed
igormish opened this issue Mar 15, 2023 · 11 comments · Fixed by #2916
Closed
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.

Comments

@igormish
Copy link

What happened:
One node with 480 memory (factor 10)
Having two queues high with weight 3 and low with weight 1.
After adding 4 jobs with requested & limit with volcano.sh/gpu-memory=120 to low queue (weight 1).
Adding new job to high with requested & limit volcano.sh/gpu-memory=120 queue with (weight 1) volcano does not reclaim the any one of the jobs running in the low queue (weight 1)

What you expected to happen:
I expect one of the jobs in low queue (weight 1) to move to pending state so that the job in the high queue (weight 3) can start.

How to reproduce it (as minimally and precisely as possible):
Node with allocatble resources

  volcano.sh/gpu-memory:  480
  volcano.sh/gpu-number:  4

Have two queues one with low weight

kind: Queue
metadata:
  name: low
spec:
  reclaimable: true
  weight: 1

And queue with higher weight

kind: Queue
metadata:
  name: low
spec:
  reclaimable: true
  weight: 3

3 Jobs for running in low queue

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  annotations:
    volcano.sh/preemptable: "true"
    volcano.sh/revocable-zone: '*'
  labels:
    volcano.sh/preemptable: "true"
  name: low //1-4
  namespace: gpu-cluster
spec:
  maxRetry: 3
  minAvailable: 1
  plugins:
    env: []
    ssh: []
    svc:
    - --disable-network-policy=true
  policies:
  - action: RestartJob
    event: PodEvicted
  - action: RestartJob
    event: PodFailed
  queue: low
  schedulerName: volcano
  tasks:
  - maxRetry: 3
    minAvailable: 1
    name: low //1-4
    replicas: 1
    template:
      metadata:
        annotations:
          volcano.sh/preemptable: "true"
          volcano.sh/revocable-zone: '*'
        labels:
          volcano.sh/preemptable: "true"
        name: web
      spec:
        containers:
        - image: docker.io/library/nginx:1.14.2
          name: nginx
          resources:
            limits:
              volcano.sh/gpu-memory: 120
            requests:
              volcano.sh/gpu-memory: 120
        tolerations:
        - effect: NoSchedule
          key: gpu_cluster
          operator: Equal
          value: "true"  

Job for running in high queue

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  annotations:
    volcano.sh/preemptable: "true"
    volcano.sh/revocable-zone: '*'
  labels:
    volcano.sh/preemptable: "true"
  name: high
  namespace: gpu-cluster
spec:
  maxRetry: 3
  minAvailable: 1
  plugins:
    env: []
    ssh: []
    svc:
    - --disable-network-policy=true
  policies:
  - action: RestartJob
    event: PodEvicted
  - action: RestartJob
    event: PodFailed
  queue: high
  schedulerName: volcano
  tasks:
  - maxRetry: 3
    minAvailable: 1
    name: high
    replicas: 1
    template:
      metadata:
        annotations:
          volcano.sh/preemptable: "true"
          volcano.sh/revocable-zone: '*'
        labels:
          volcano.sh/preemptable: "true"
        name: web
      spec:
        containers:
        - image: docker.io/library/nginx:1.14.2
          name: nginx
          resources:
            limits:
              volcano.sh/gpu-memory: 120
            requests:
              volcano.sh/gpu-memory: 120
        tolerations:
        - effect: NoSchedule
          key: gpu_cluster
          operator: Equal
          value: "true"  

Anything else we need to know?:

Environment:

  • Volcano Version: master

  • Kubernetes version: v1.22.17

  • Install tools: helm

  • Others:
    scheduler.conf:

actions: "enqueue, allocate, reclaim, preempt "
tiers:
- plugins:
  - name: priority
  - name: gang
  - name: drf
- plugins:
  - name: proportion
  - name: predicates
    arguments:
      predicate.GPUSharingEnable: true # enable gpu sharing.
  - name: drf
  - name: overcommit
    arguments:
      overcommit-factor: 2.0
  - name: nodeorder
  - name: binpack
  - ```
@igormish igormish added the kind/bug Categorizes issue or PR as related to a bug. label Mar 15, 2023
@igormish
Copy link
Author

igormish commented Mar 15, 2023

The issue is link in checkNodeGPUSharingPredicate function in line

ids := predicateGPUbyMemory(pod, gs) 

Also FilterNode in link

//FiltreNode checks if the 'pod' fit in current node
FilterNode(pod *v1.Pod) (bool, error)

It should check if pod can fit node and not if there is enough cpu left as used by Allocate
function in device_info and should have different functionality from FilterNode

@wangyang0616
Copy link
Member

Judging whether the GPU resources of the node are sufficient in the predicate will indeed affect the functions related to preempt and reclaim. Is there a better solution? @archlitchi

@igormishsky
Copy link

igormishsky commented Apr 17, 2023

@archlitchi @wangyang0616 I will be happy to work on it as we need it to work in our production.

@wangyang0616
Copy link
Member

@archlitchi @wangyang0616 I will be happy to work on it as we need it to work in our production.

Thank you very much for your contribution and look forward to your proposal. @igormishsky

@wangyang0616
Copy link
Member

/priority important-soon

@jiangkaihua
Copy link
Contributor

In my opinion, it would be a better way to split original predicateFn into 2 parts: predicateWithResFn & predicateWithOtherFn. In predicateWithResFn volcano filter nodes with resources, like cpu, mem, gpu-memory, etc.; And in predicateWithOtherFn, volcano filter nodes with resource-unrelated reasons, like affinity, nodePort, podTopology, etc..

As to the question, we only need to invoke different parts of predicateFn in different action. For example, in allocate action, pod needs go through both predicateWithResFn and predicateWithOtherFn to make sure node is suitable for allocate. But in preempt and reclaim action, only go through predicateWithOtherFn is enough, volcano would get sufficient resources by evict pods on node automatically.

@wangyang0616
Copy link
Member

I understand that this should be a general problem, not just gpu, but for other algorithm plugins, if resource-related filtering logic is added to the predicate, it will affect the preemption function.

I made a repair according to the method of @jiangkaihua , disassembled the resource-related filtering logic from the predicate function, and defined it as predicateResource. The allocate phase will execute predicate and predicateResource, and preempt and reclaim only execute predicate. In this way, the same kind of question.

Please help to review the PR: #2818

@igormishsky
Copy link

igormishsky commented May 7, 2023

I understand that this should be a general problem, not just gpu, but for other algorithm plugins, if resource-related filtering logic is added to the predicate, it will affect the preemption function.

I made a repair according to the method of @jiangkaihua , disassembled the resource-related filtering logic from the predicate function, and defined it as predicateResource. The allocate phase will execute predicate and predicateResource, and preempt and reclaim only execute predicate. In this way, the same kind of question.

Please help to review the PR: #2818

It is indeed sensible to differentiate between preempt and allocate, as they are not equivalent. To confirm my comprehension, in the GPU context, the predicate should ascertain whether the node has sufficient GPU resources, while the resource predicate should evaluate the availability of unused resources on the node. If I have misunderstood, please feel free to provide corrections.

Regarding the naming, the current predicate is more of a "predicateFit" since it checks if the pod can fit into the node, whereas the predicateResource is the actual predicate being used at the moment (e.g., assessing "free" GPU resources).

We hope this can be merged seamlessly, as it is straightforward and should not cause any side effects, given that no changes have been made and further adjustments will only occur after implementation of the new predicate , leaving naming concerns aside.

@igormishsky
Copy link

@wangyang0616 @jiangkaihua any update? can we continue with the proposed solution by @wangyang0616 ?
It is critical issue that I hope will be sorted as it causes issues to our production.

@wangyang0616
Copy link
Member

Yes, this is a serious problem. I will make some adjustments to the method naming problem in pr, and then please help to review.

@wangyang0616
Copy link
Member

/assign

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants