Reclaim is not working with GPUSharingEnable if resource used is volcano.sh/gpu-memory #2739

igormish · 2023-03-15T12:39:26Z

What happened:
One node with 480 memory (factor 10)
Having two queues high with weight 3 and low with weight 1.
After adding 4 jobs with requested & limit with volcano.sh/gpu-memory=120 to low queue (weight 1).
Adding new job to high with requested & limit volcano.sh/gpu-memory=120 queue with (weight 1) volcano does not reclaim the any one of the jobs running in the low queue (weight 1)

What you expected to happen:
I expect one of the jobs in low queue (weight 1) to move to pending state so that the job in the high queue (weight 3) can start.

How to reproduce it (as minimally and precisely as possible):
Node with allocatble resources

  volcano.sh/gpu-memory:  480
  volcano.sh/gpu-number:  4

Have two queues one with low weight

kind: Queue
metadata:
  name: low
spec:
  reclaimable: true
  weight: 1

And queue with higher weight

kind: Queue
metadata:
  name: low
spec:
  reclaimable: true
  weight: 3

3 Jobs for running in low queue

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  annotations:
    volcano.sh/preemptable: "true"
    volcano.sh/revocable-zone: '*'
  labels:
    volcano.sh/preemptable: "true"
  name: low //1-4
  namespace: gpu-cluster
spec:
  maxRetry: 3
  minAvailable: 1
  plugins:
    env: []
    ssh: []
    svc:
    - --disable-network-policy=true
  policies:
  - action: RestartJob
    event: PodEvicted
  - action: RestartJob
    event: PodFailed
  queue: low
  schedulerName: volcano
  tasks:
  - maxRetry: 3
    minAvailable: 1
    name: low //1-4
    replicas: 1
    template:
      metadata:
        annotations:
          volcano.sh/preemptable: "true"
          volcano.sh/revocable-zone: '*'
        labels:
          volcano.sh/preemptable: "true"
        name: web
      spec:
        containers:
        - image: docker.io/library/nginx:1.14.2
          name: nginx
          resources:
            limits:
              volcano.sh/gpu-memory: 120
            requests:
              volcano.sh/gpu-memory: 120
        tolerations:
        - effect: NoSchedule
          key: gpu_cluster
          operator: Equal
          value: "true"

Job for running in high queue

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  annotations:
    volcano.sh/preemptable: "true"
    volcano.sh/revocable-zone: '*'
  labels:
    volcano.sh/preemptable: "true"
  name: high
  namespace: gpu-cluster
spec:
  maxRetry: 3
  minAvailable: 1
  plugins:
    env: []
    ssh: []
    svc:
    - --disable-network-policy=true
  policies:
  - action: RestartJob
    event: PodEvicted
  - action: RestartJob
    event: PodFailed
  queue: high
  schedulerName: volcano
  tasks:
  - maxRetry: 3
    minAvailable: 1
    name: high
    replicas: 1
    template:
      metadata:
        annotations:
          volcano.sh/preemptable: "true"
          volcano.sh/revocable-zone: '*'
        labels:
          volcano.sh/preemptable: "true"
        name: web
      spec:
        containers:
        - image: docker.io/library/nginx:1.14.2
          name: nginx
          resources:
            limits:
              volcano.sh/gpu-memory: 120
            requests:
              volcano.sh/gpu-memory: 120
        tolerations:
        - effect: NoSchedule
          key: gpu_cluster
          operator: Equal
          value: "true"

Anything else we need to know?:

Environment:

Volcano Version: master
Kubernetes version: v1.22.17
Install tools: helm
Others:
scheduler.conf:

actions: "enqueue, allocate, reclaim, preempt "
tiers:
- plugins:
  - name: priority
  - name: gang
  - name: drf
- plugins:
  - name: proportion
  - name: predicates
    arguments:
      predicate.GPUSharingEnable: true # enable gpu sharing.
  - name: drf
  - name: overcommit
    arguments:
      overcommit-factor: 2.0
  - name: nodeorder
  - name: binpack
  - ```

The text was updated successfully, but these errors were encountered:

igormish · 2023-03-15T12:56:23Z

The issue is link in checkNodeGPUSharingPredicate function in line

ids := predicateGPUbyMemory(pod, gs)

Also FilterNode in link

//FiltreNode checks if the 'pod' fit in current node
FilterNode(pod *v1.Pod) (bool, error)

It should check if pod can fit node and not if there is enough cpu left as used by Allocate
function in device_info and should have different functionality from FilterNode

wangyang0616 · 2023-03-24T01:30:10Z

Judging whether the GPU resources of the node are sufficient in the predicate will indeed affect the functions related to preempt and reclaim. Is there a better solution? @archlitchi

igormishsky · 2023-04-17T09:30:28Z

@archlitchi @wangyang0616 I will be happy to work on it as we need it to work in our production.

wangyang0616 · 2023-04-18T02:17:47Z

@archlitchi @wangyang0616 I will be happy to work on it as we need it to work in our production.

Thank you very much for your contribution and look forward to your proposal. @igormishsky

wangyang0616 · 2023-04-18T03:25:03Z

/priority important-soon

jiangkaihua · 2023-04-23T03:28:18Z

In my opinion, it would be a better way to split original predicateFn into 2 parts: predicateWithResFn & predicateWithOtherFn. In predicateWithResFn volcano filter nodes with resources, like cpu, mem, gpu-memory, etc.; And in predicateWithOtherFn, volcano filter nodes with resource-unrelated reasons, like affinity, nodePort, podTopology, etc..

As to the question, we only need to invoke different parts of predicateFn in different action. For example, in allocate action, pod needs go through both predicateWithResFn and predicateWithOtherFn to make sure node is suitable for allocate. But in preempt and reclaim action, only go through predicateWithOtherFn is enough, volcano would get sufficient resources by evict pods on node automatically.

wangyang0616 · 2023-05-05T03:18:00Z

I understand that this should be a general problem, not just gpu, but for other algorithm plugins, if resource-related filtering logic is added to the predicate, it will affect the preemption function.

I made a repair according to the method of @jiangkaihua , disassembled the resource-related filtering logic from the predicate function, and defined it as predicateResource. The allocate phase will execute predicate and predicateResource, and preempt and reclaim only execute predicate. In this way, the same kind of question.

Please help to review the PR: #2818

igormishsky · 2023-05-07T13:37:51Z

I understand that this should be a general problem, not just gpu, but for other algorithm plugins, if resource-related filtering logic is added to the predicate, it will affect the preemption function.

I made a repair according to the method of @jiangkaihua , disassembled the resource-related filtering logic from the predicate function, and defined it as predicateResource. The allocate phase will execute predicate and predicateResource, and preempt and reclaim only execute predicate. In this way, the same kind of question.

Please help to review the PR: #2818

It is indeed sensible to differentiate between preempt and allocate, as they are not equivalent. To confirm my comprehension, in the GPU context, the predicate should ascertain whether the node has sufficient GPU resources, while the resource predicate should evaluate the availability of unused resources on the node. If I have misunderstood, please feel free to provide corrections.

Regarding the naming, the current predicate is more of a "predicateFit" since it checks if the pod can fit into the node, whereas the predicateResource is the actual predicate being used at the moment (e.g., assessing "free" GPU resources).

We hope this can be merged seamlessly, as it is straightforward and should not cause any side effects, given that no changes have been made and further adjustments will only occur after implementation of the new predicate , leaving naming concerns aside.

igormishsky · 2023-05-18T06:20:54Z

@wangyang0616 @jiangkaihua any update? can we continue with the proposed solution by @wangyang0616 ?
It is critical issue that I hope will be sorted as it causes issues to our production.

wangyang0616 · 2023-05-19T01:33:21Z

Yes, this is a serious problem. I will make some adjustments to the method naming problem in pr, and then please help to review.

wangyang0616 · 2023-05-19T02:52:28Z

/assign

igormish added the kind/bug Categorizes issue or PR as related to a bug. label Mar 15, 2023

volcano-sh-bot added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Apr 18, 2023

igormish mentioned this issue Apr 18, 2023

Revise the checkNodeGPUSharingPredicate function to evaluate whether a pod can fit within a node, instead of checking for sufficient remaining CPU as currently implemented by the Allocate function in device_info. #2799

Closed

wangyang0616 mentioned this issue May 5, 2023

Split the predicate function to solve the resource filtering problem encountered during preemption #2818

Closed

volcano-sh-bot assigned wangyang0616 May 19, 2023

wangyang0616 mentioned this issue Jun 15, 2023

Refactor PredicateFn for allocate and preempt actions #2916

Merged

volcano-sh-bot closed this as completed in #2916 Jul 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reclaim is not working with GPUSharingEnable if resource used is volcano.sh/gpu-memory #2739

Reclaim is not working with GPUSharingEnable if resource used is volcano.sh/gpu-memory #2739

igormish commented Mar 15, 2023

igormish commented Mar 15, 2023 •

edited

Loading

wangyang0616 commented Mar 24, 2023

igormishsky commented Apr 17, 2023 •

edited

Loading

wangyang0616 commented Apr 18, 2023

wangyang0616 commented Apr 18, 2023

jiangkaihua commented Apr 23, 2023

wangyang0616 commented May 5, 2023

igormishsky commented May 7, 2023 •

edited

Loading

igormishsky commented May 18, 2023

wangyang0616 commented May 19, 2023

wangyang0616 commented May 19, 2023

Reclaim is not working with GPUSharingEnable if resource used is volcano.sh/gpu-memory #2739

Reclaim is not working with GPUSharingEnable if resource used is volcano.sh/gpu-memory #2739

Comments

igormish commented Mar 15, 2023

igormish commented Mar 15, 2023 • edited Loading

wangyang0616 commented Mar 24, 2023

igormishsky commented Apr 17, 2023 • edited Loading

wangyang0616 commented Apr 18, 2023

wangyang0616 commented Apr 18, 2023

jiangkaihua commented Apr 23, 2023

wangyang0616 commented May 5, 2023

igormishsky commented May 7, 2023 • edited Loading

igormishsky commented May 18, 2023

wangyang0616 commented May 19, 2023

wangyang0616 commented May 19, 2023

igormish commented Mar 15, 2023 •

edited

Loading

igormishsky commented Apr 17, 2023 •

edited

Loading

igormishsky commented May 7, 2023 •

edited

Loading