Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support preempting BestEffort pods when the pods number of nodes reaches the upper limit #3335

Merged
merged 1 commit into from
Mar 11, 2024

Conversation

Lily922
Copy link
Contributor

@Lily922 Lily922 commented Mar 4, 2024

fix #3336

Design:
Ignore pending BestEffort pods when checking if the Job is starving.

1. "allocate" action:
   if resources.IsEmpty() :    // task with empty requsted resource:
       skip to schedule it
   else:
       try to schedule it
   if job.IsReady():               // job.ReadyTaskNum()+job.PendingBestEffortTaskNum() >= ji.MinAvailable
       job.Commit()
      

2. "preempt" action:
   if job.Starving():            // job.WaitingTaskNum()+job.ReadyTaskNum() < job.MinAvailable
      try to preempt:
           if !task.IsEmpty() && preemptor.IsEmpty():
               ignore task
           else:
               add task to victims queue
   if job.IsPipeline():         // job.WaitingTaskNum()+job.ReadyTaskNum()+job.PendingBestEffortTaskNum() >= job.MinAvailable
       job.Commit()

4. "backfill" action:
   sortPendingOrPipelinedBestEffortPods()
   if resources.IsEmpty() :    // task with empty requsted resource:
       try to schedule it
   else:
       skip to schedule it
              
  1. Pods which request none resource except "pods" number resource(BestEffort pods) will be scheduled in "backfill" action
  2. Pods request any resource including scalar resources(unBestEffort pods) will be scheduled in "allocate" action
  3. All pending pods can become a preemptor in "preempt" action, even if it is a BestEffort pod. If the preemptor is BestEffort pod, unBestEffort pod can never be the victim
  4. Don't count pending BestEffort pods when checking if the job is starving, otherwise the pending BestEffort pods can never trigger preemption
  5. Keep counting pending BestEffort pods when checking if the job is pipeline/ready, because the bug fix :Considering best-effort pods when calculating ready task number #647
  6. Pending pods will be sorted by priority in "backfill" action, otherwise a scheduled low priority pod will be evicted by a high-priority pod immediately, it is meaningless for low-priority pods to be scheduled first.
  7. Pipelined pods will be allocated in "backfill" action, in this case, "backfill" action can be configured in front of "preempt" action in configmap "volcano-scheduler-configmap".

@volcano-sh-bot volcano-sh-bot requested review from k82cn and Thor-wl March 4, 2024 01:36
@volcano-sh-bot volcano-sh-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Mar 4, 2024
@Lily922 Lily922 changed the title Support preempting BestEffort pods upport preempting BestEffort pods when the pods number of nodes reaches the upper limit Mar 4, 2024
@Lily922 Lily922 changed the title upport preempting BestEffort pods when the pods number of nodes reaches the upper limit Support preempting BestEffort pods when the pods number of nodes reaches the upper limit Mar 4, 2024
@lowang-bh
Copy link
Member

Hi, thanks for the contributions. Please add some UTs about your scenario。

for status, tasks := range ji.TaskStatusIndex {
if status == Pending {
for _, task := range tasks {
if task.InitResreq.IsEmpty() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already has a BestEffort flag, don't need to calculate it any more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -707,6 +699,20 @@ func (ji *JobInfo) WaitingTaskNum() int32 {
return int32(len(ji.TaskStatusIndex[Pipelined]))
}

func (ji *JobInfo) PendingBestEffortTaskNum() int32 {
count := 0
for status, tasks := range ji.TaskStatusIndex {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my opinion, we can directly check ji.TaskStatusIndex[Pending] and sum up the count.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

job.NodesFitErrors[task.UID] = fe
break
}
if !task.InitResreq.IsEmpty() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use task.BestEffort property directly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@Lily922 Lily922 force-pushed the preempt branch 4 times, most recently from d08ed78 to a66a7fa Compare March 5, 2024 02:31
@volcano-sh-bot volcano-sh-bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 5, 2024
@Lily922
Copy link
Contributor Author

Lily922 commented Mar 5, 2024

Hi, thanks for the contributions. Please add some UTs about your scenario。

done

for _, job := range tc.jobs {
jobInfo := api.NewJobInfo(job.uid, job.tasks...)
jobInfo.Queue = "queue-1"
jobInfo.PodGroup = &api.PodGroup{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use BuildXX function in pkg/scheduler/util/test_utils.go to gen those basic node/pg/queue to clean code.
And use NewDefaultMockSchedulerCache in pkg/scheduler/cache/cache_mock.go to gen the SchedulerCache.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -117,8 +117,8 @@ func (pmpt *Action) Execute(ssn *framework.Session) {
if !api.PreemptableStatus(task.Status) {
return false
}
// Ignore task with empty resource request.
if task.Resreq.IsEmpty() {
// BestEffort pod is not supported to preempt unBestEffort pod.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a besteffort pod has a higher priority, preemption will not happen, is this expected?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a besteffort pod has a higher priority, preemption will not happen, is this expected?

yes, besteffort pod will be scheduled in "backfill" action, unbesteffort pods will be scheduled in "allocate" action. In normal use, allocate is executed before backfill, the preempted pod will be scheduled first in the next scheduler cycle

@@ -17,14 +17,16 @@ limitations under the License.
package api

import (
"k8s.io/apimachinery/pkg/api/resource"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please sort imports.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please sort imports.
done

…hes the upper limit

Signed-off-by: lili <lili_9309@163.com>
@wangyang0616
Copy link
Member

/lgtm

@volcano-sh-bot volcano-sh-bot added the lgtm Indicates that a PR is ready to be merged. label Mar 11, 2024
Copy link
Member

@william-wang william-wang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@volcano-sh-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: william-wang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@volcano-sh-bot volcano-sh-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 11, 2024
@volcano-sh-bot volcano-sh-bot merged commit 0559427 into volcano-sh:master Mar 11, 2024
14 checks passed
@Lily922 Lily922 deleted the preempt branch April 7, 2024 07:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

support preempting BestEffort pods when the pods number of nodes reaches the upper limit
6 participants