GPU fragmentation across nodes and Job/Pod rescheduling strategy request

## Background

In the Volcano Scheduler, the `binpack` plugin can be configured to maximize the resource usage of individual nodes (i.e., assigning jobs to fully utilize a node before allocating to empty nodes). However, idle GPUs may be scattered across different nodes due to inconsistent job finish time, resulting in insufficient resources for subsequent GPU jobs to be scheduled. For example:

1. **Scenario 1:**

   A user has three 8-GPU nodes, with the following load distribution:

   - node1: 7 GPUs used (4+2+1)
   - node2: 6 GPUs used (4+2)
   - node3: 7 GPUs used (4+2+1)
   
   If the user wants to submit a job requiring 4 GPUs, it cannot run due to fragmented GPUs.

2. **Scenario 2:**

   A user has eight 8-GPU nodes and schedules seven deployments, each with nine Pods, where each Pod uses one GPU. This results in instances of each deployment being distributed across different nodes. If the user deletes some deployments, GPU fragmentation occurs.

## Expectation

When a Job is Pending, determine whether reallocating running jobs/pods can provide enough resources to execute the pending job. If feasible, restart the jobs or Pods and migrate them to new nodes.

## Current Limitations

1. **Job-level Scheduling Limitation:**
   Simply restarting jobs does not guarantee resource allocation for the pending job. Because the volcano scheduler schedules jobs one by one, multiple jobs cannot be scheduled as a whole.

4. **Descheduler Limitations:**
   Both [volcano descheuler](https://github.com/volcano-sh/descheduler) and [k8s descheduler](https://github.com/kubernetes-sigs/descheduler/) lacks strategies for handling such scenarios. In current implementation, descheduler does not schedule replacement of evicted pods but relies on the default scheduler for that. However, when considering which pods to evict for defragmentation, it's better to combine with how to schedule later.

My request is similar to the issue described in [GPU碎片资源整理](https://www.volcengine.com/docs/6459/1222800). I would like to know if there are any solutions or plans to address this problem. I am truly eager to collaborate with you to solve it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU fragmentation across nodes and Job/Pod rescheduling strategy request #3948

Background

Expectation

Current Limitations

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development