Skip to content

Commit

Permalink
proposal
Browse files Browse the repository at this point in the history
Signed-off-by: ykcai-daniel <1155141377@link.cuhk.edu.hk>
  • Loading branch information
ykcai-daniel committed Sep 6, 2024
1 parent fada548 commit ba86b30
Show file tree
Hide file tree
Showing 29 changed files with 117 additions and 552 deletions.
5 changes: 0 additions & 5 deletions config/crd/volcano/bases/scheduling.volcano.sh_queues.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -136,11 +136,6 @@ spec:
parent:
description: Parent define the parent of queue
type: string
priority:
description: Priority define the priority of queue. Higher values
are prioritized for scheduling and considered later during reclamation.
format: int32
type: integer
reclaimable:
description: Reclaimable indicate whether the queue can be reclaimed
by other queue
Expand Down
4 changes: 0 additions & 4 deletions docs/design/job-resource-reservation-design.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,6 @@

@[Thor-wl](https://github.com/Thor-wl); Aug 19th, 2020

## Deprecated

The `Reservation` plugin has been deleted. This doc is deprecated. It will be deleted in the future.

## Motivation
As [issue 13](https://github.com/volcano-sh/volcano/issues/13) / [issue 748](https://github.com/volcano-sh/volcano/issues/748)
/ [issue 947](https://github.com/volcano-sh/volcano/issues/947) mentioned, current scheduler strategy may result in
Expand Down
29 changes: 29 additions & 0 deletions docs/design/pod-scheduling-readiness.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# Pod Scheduling Readiness
## Motivation

Pod Scheduling Readiness is a beta feature in Kubernetes v1.27. Users expect Volcano to be aware of it. By specifying/removing a Pod's `.spec.schedulingGates`, which is an array of strings, users can control when a Pod is ready to be considered for scheduling. The scheduling gates are usually removed by external controllers. `kube-scheduler` will only schedule a gated pod once all scheduling gates of it are removed (i.e. the `.spec.schedulingGates` is an empty arry)\. Note that for K8S resources that use `PodTemplate` (Deployment, StatefulSet etc.), if the `.spec.schedulingGates` field of the template is emptied, the gates of all the pods created based on that template will also be removed.

This feature is often used to implement more performant Quota Managers. K8S have a native ResourceQuota resource, which rejects a pod if it exceed the quota. However, if a quota manager is implemented using ResourceQuota, one downside is user need to keep polling the api-server until there are sufficient resources the pod is schedulable. To solve this problem, scheduling gates enable users to first create pods that are not yet schedulable, and by removing the gates, it signals kube-scheduler that the pod is ready to be allocated to a node, thus no longer require polling.


## Function Detail
To support Pod Scheduling Readiness, the modifications mainly happen in `scheduler/actions`. In each actions except Enqueue such as Allocation, Preempte, Backfill, the scheduler will skip the tasks of a job whose Pods are scheduling gated, ensuring that a scheduling gated pod will not be bound to a node.




## Feature Interactions
1. **Actions**: Jobs with scheduling gates will first be enqueued by the Enqueue action and transit to the Inqueue state. In the following allocate, preempt actions, tasks with scheduling gated Pod will be skipped, which might result in gang-unschedulable job.
2. **Plugins**: Since scheduling gated pods are not ready to be allocated, we don't want scheduling gated tasks to consume inqueue resources, making other potentially schedulable jobs uninqueuable. Therefore, proportion, capacity and overcommit plugins have to be changed. Since scheduling gated pods will not be allocated, we only need to deduct it from inqueue resources.
3. **Events and Conditions**: As shown below, when listing Pod with Kubectl, the pods that are shown below will have `SchedulingGated` status, which is the Reason of the condition in pod status. To align with K8S, we need to skip scheduling gated tasks when updating pods conditions after each scheduling cycle and show the original condition created by K8S.
4. **Controllers**: K8S native resources like Deployment support template level removal of scheduling gates. In other words, if a Deployment has scheduling gates in its pod template, by patching the Deployment and remove the scheduling gates, all its pods will be deleted and recreated without gates. However, for Vcjob, currently the job controller cannot detect changes in PodTemplate and cannot support this feature. Despite this, it is uncommon to remove scheduling gates from PodTemplate and the Pod Scheduling Gates feature is usually used at pod level. What's more, scheduling gates are often added by webhooks instead of in the Job template (more details in K8S KEP). Therefore, we choose to not align this behavior with K8S.

## Limitations
1. **Vcjob does not support removing scheduling gates in template**: As mentioned above, currently, if we remove the scheduling gates field in a Vcjob, the gates of its pods are not removed. This behavior of Vcjob is different from native K8S resources.

2. **Inaccurate message for scheduling gated podgroups**: Currently, if a Job is gang-unschedulable due to gates, its condition is Unschedulable with reason `NotEnoughResources`, which comes from gang plugin. A better message should notify user about the scheduling gates. Moreover, if a K8S resources like Deployment is Pending due to scheduling gates, there is no event. However, for vcjob, we get a gang scheduling failed event each scheduling cycle, which might be a burden of performence.

## References
1. K8S Pod Scheduling Readiness Documentation: https://kubernetes.io/docs/concepts/scheduling-eviction/pod-scheduling-readiness/
2. Pod Scheduling Readiness KEP: https://github.com/kubernetes/enhancements/blob/master/keps/sig-scheduling/3521-pod-scheduling-readiness/README.md
3. K8S Pod Scheduling Readiness Blog: https://kubernetes.io/blog/2022/12/26/pod-scheduling-readiness-alpha/
38 changes: 0 additions & 38 deletions docs/design/queue-priority.md

This file was deleted.

4 changes: 2 additions & 2 deletions go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ require (
sigs.k8s.io/controller-runtime v0.13.0
sigs.k8s.io/yaml v1.3.0
stathat.com/c/consistent v1.0.0
volcano.sh/apis v1.10.0-alpha.0.0.20240904072435-17826d64a144
volcano.sh/apis v1.10.0-alpha.0.0.20240709084748-78d912ce096c
)

require (
Expand Down Expand Up @@ -118,7 +118,7 @@ require (
google.golang.org/genproto v0.0.0-20230803162519-f966b187b2e5 // indirect
google.golang.org/genproto/googleapis/api v0.0.0-20230726155614-23370e0ffb3e // indirect
google.golang.org/genproto/googleapis/rpc v0.0.0-20230822172742-b8732ec3820d // indirect
google.golang.org/grpc v1.65.0 // indirect
google.golang.org/grpc v1.58.3 // indirect
google.golang.org/protobuf v1.33.0 // indirect
gopkg.in/inf.v0 v0.9.1 // indirect
gopkg.in/natefinch/lumberjack.v2 v2.2.1 // indirect
Expand Down
4 changes: 2 additions & 2 deletions go.sum
Original file line number Diff line number Diff line change
Expand Up @@ -406,5 +406,5 @@ sigs.k8s.io/yaml v1.3.0 h1:a2VclLzOGrwOHDiV8EfBGhvjHvP46CtW5j6POvhYGGo=
sigs.k8s.io/yaml v1.3.0/go.mod h1:GeOyir5tyXNByN85N/dRIT9es5UQNerPYEKK56eTBm8=
stathat.com/c/consistent v1.0.0 h1:ezyc51EGcRPJUxfHGSgJjWzJdj3NiMU9pNfLNGiXV0c=
stathat.com/c/consistent v1.0.0/go.mod h1:QkzMWzcbB+yQBL2AttO6sgsQS/JSTapcDISJalmCDS0=
volcano.sh/apis v1.10.0-alpha.0.0.20240904072435-17826d64a144 h1:X+HAVl9WcV+hXkmXnkfs4ikaBeAEoH6V+2Wxrb3j4Vs=
volcano.sh/apis v1.10.0-alpha.0.0.20240904072435-17826d64a144/go.mod h1:z8hhFZ2qcUMR1JIjVYmBqL98CVaXNzsQAcqKiytQW9s=
volcano.sh/apis v1.10.0-alpha.0.0.20240709084748-78d912ce096c h1:22v0MX/Cd9PQX5GU/vF1DUiVzPEvPmEv2CjZuAt3fmU=
volcano.sh/apis v1.10.0-alpha.0.0.20240709084748-78d912ce096c/go.mod h1:z8hhFZ2qcUMR1JIjVYmBqL98CVaXNzsQAcqKiytQW9s=
Original file line number Diff line number Diff line change
Expand Up @@ -135,11 +135,6 @@ spec:
parent:
description: Parent define the parent of queue
type: string
priority:
description: Priority define the priority of queue. Higher values
are prioritized for scheduling and considered later during reclamation.
format: int32
type: integer
reclaimable:
description: Reclaimable indicate whether the queue can be reclaimed
by other queue
Expand Down
4 changes: 4 additions & 0 deletions installer/helm/chart/volcano/templates/admission-init.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,10 @@ kind: Job
metadata:
name: {{ .Release.Name }}-admission-init
namespace: {{ .Release.Namespace }}
{{- if .Values.custom.common_labels }}
labels:
{{- toYaml .Values.custom.common_labels | nindent 4 }}
{{- end }}
annotations:
"helm.sh/hook": pre-install,pre-upgrade
"helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded
Expand Down
5 changes: 0 additions & 5 deletions installer/volcano-development.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4849,11 +4849,6 @@ spec:
parent:
description: Parent define the parent of queue
type: string
priority:
description: Priority define the priority of queue. Higher values
are prioritized for scheduling and considered later during reclamation.
format: int32
type: integer
reclaimable:
description: Reclaimable indicate whether the queue can be reclaimed
by other queue
Expand Down
4 changes: 2 additions & 2 deletions pkg/cli/jobflow/describe.go
Original file line number Diff line number Diff line change
Expand Up @@ -71,9 +71,9 @@ func DescribeJobFlow(ctx context.Context) error {
// Remove managedFields
jobFlow.ManagedFields = nil
PrintJobFlowDetail(&jobFlow, describeJobFlowFlags.Format)
// Print space if it's not the last element
// Print a separator if it's not the last element
if len(jobFlows.Items) != 1 && i < len(jobFlows.Items)-1 {
fmt.Printf("\n\n")
fmt.Println("---------------------------------")
}
}
// Get jobflow detail
Expand Down
5 changes: 2 additions & 3 deletions pkg/cli/jobflow/jobflow_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,6 @@ import (
"reflect"
"strings"
"testing"
"time"

metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"

Expand All @@ -50,7 +49,7 @@ func TestListJobFlow(t *testing.T) {
ObjectMeta: metav1.ObjectMeta{
Name: "test-jobflow",
Namespace: "default",
CreationTimestamp: metav1.Time{Time: time.Now().UTC().AddDate(0, 0, -3)},
CreationTimestamp: metav1.Now(),
},
Status: flowv1alpha1.JobFlowStatus{
State: flowv1alpha1.State{
Expand All @@ -63,7 +62,7 @@ func TestListJobFlow(t *testing.T) {
Namespace: "default",
ExpectedErr: nil,
ExpectedOutput: `Name Namespace Phase Age
test-jobflow default Succeed 3d`,
test-jobflow default Succeed 0s`,
},
}
for _, testCase := range testCases {
Expand Down
23 changes: 5 additions & 18 deletions pkg/cli/jobtemplate/describe.go
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,6 @@ import (
"context"
"encoding/json"
"fmt"
"os"

"github.com/spf13/cobra"

Expand Down Expand Up @@ -67,28 +66,15 @@ func DescribeJobTemplate(ctx context.Context) error {
if err != nil {
return err
}
for i, jobTemplate := range jobTemplates.Items {
// Remove managedFields
jobTemplate.ManagedFields = nil
for _, jobTemplate := range jobTemplates.Items {
PrintJobTemplateDetail(&jobTemplate, describeJobTemplateFlags.Format)
// Print space if it's not the last element
if len(jobTemplates.Items) != 1 && i < len(jobTemplates.Items)-1 {
fmt.Printf("\n\n")
}
}
// Get job template detail
} else {
jobTemplate, err := jobTemplateClient.FlowV1alpha1().JobTemplates(describeJobTemplateFlags.Namespace).Get(ctx, describeJobTemplateFlags.Name, metav1.GetOptions{})
if err != nil {
return err
}
// Remove managedFields
jobTemplate.ManagedFields = nil
// Set APIVersion and Kind if not set
if jobTemplate.APIVersion == "" || jobTemplate.Kind == "" {
jobTemplate.APIVersion = v1alpha1.SchemeGroupVersion.String()
jobTemplate.Kind = "JobTemplate"
}
PrintJobTemplateDetail(jobTemplate, describeJobTemplateFlags.Format)
}

Expand All @@ -112,14 +98,15 @@ func printJSON(jobTemplate *v1alpha1.JobTemplate) {
if err != nil {
fmt.Printf("Error marshaling JSON: %v\n", err)
}
os.Stdout.Write(b)
fmt.Println("")
fmt.Println(string(b))
fmt.Println("---------------------------------")
}

func printYAML(jobTemplate *v1alpha1.JobTemplate) {
b, err := yaml.Marshal(jobTemplate)
if err != nil {
fmt.Printf("Error marshaling YAML: %v\n", err)
}
os.Stdout.Write(b)
fmt.Println(string(b))
fmt.Println("---------------------------------")
}
13 changes: 6 additions & 7 deletions pkg/cli/jobtemplate/jobtemplate_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -284,14 +284,14 @@ func TestDescribeJobTemplate(t *testing.T) {
Name: "test-jobtemplate",
Format: "yaml",
ExpectedErr: nil,
ExpectedOutput: `apiVersion: flow.volcano.sh/v1alpha1
kind: JobTemplate
metadata:
ExpectedOutput: `metadata:
creationTimestamp: null
name: test-jobtemplate
namespace: default
spec: {}
status: {}`,
status: {}
---------------------------------`,
},
{
name: "Normal Case, use json format",
Expand All @@ -306,16 +306,15 @@ status: {}`,
Format: "json",
ExpectedErr: nil,
ExpectedOutput: `{
"kind": "JobTemplate",
"apiVersion": "flow.volcano.sh/v1alpha1",
"metadata": {
"name": "test-jobtemplate",
"namespace": "default",
"creationTimestamp": null
},
"spec": {},
"status": {}
}`,
}
---------------------------------`,
},
}
for _, testCase := range testCases {
Expand Down
17 changes: 8 additions & 9 deletions pkg/cli/pod/pod_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,6 @@ import (
"reflect"
"strings"
"testing"
"time"

corev1 "k8s.io/api/core/v1"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
Expand Down Expand Up @@ -59,7 +58,7 @@ func TestListPod(t *testing.T) {
JobName: "",
ExpectedErr: nil,
ExpectedOutput: `Name Ready Status Restart Age
my-pod 0/1 Running 0 3d`,
my-pod 0/1 Running 0 0s`,
},
{
name: "Normal Case with namespace filter",
Expand All @@ -73,7 +72,7 @@ my-pod 0/1 Running 0 3d`,
JobName: "",
ExpectedErr: nil,
ExpectedOutput: `Name Ready Status Restart Age
my-pod 0/1 Running 0 3d`,
my-pod 0/1 Running 0 0s`,
},
{
name: "Normal Case with jobName filter",
Expand All @@ -87,7 +86,7 @@ my-pod 0/1 Running 0 3d`,
JobName: "my-job1",
ExpectedErr: nil,
ExpectedOutput: `Name Ready Status Restart Age
my-pod 0/1 Running 0 3d`,
my-pod 0/1 Running 0 0s`,
},
{
name: "Normal Case with queueName filter",
Expand All @@ -105,9 +104,9 @@ my-pod 0/1 Running 0 3d`,
QueueName: "my-queue1",
ExpectedErr: nil,
ExpectedOutput: `Name Ready Status Restart Age
my-pod1 0/1 Running 0 3d
my-pod2 0/1 Running 0 3d
my-pod3 0/1 Running 0 3d`,
my-pod1 0/1 Running 0 0s
my-pod2 0/1 Running 0 0s
my-pod3 0/1 Running 0 0s`,
},
{
name: "Normal Case with queueName filter and jobName filter",
Expand All @@ -122,7 +121,7 @@ my-pod3 0/1 Running 0 3d`,
JobName: "my-job1",
ExpectedErr: nil,
ExpectedOutput: `Name Ready Status Restart Age
my-pod1 0/1 Running 0 3d`,
my-pod1 0/1 Running 0 0s`,
},
{
name: "Normal Case with queueName filter and jobName filter, and does not match",
Expand Down Expand Up @@ -203,7 +202,7 @@ func buildPod(namespace, name string, labels map[string]string, annotations map[
Namespace: namespace,
Labels: labels,
Annotations: annotations,
CreationTimestamp: metav1.Time{Time: time.Now().UTC().AddDate(0, 0, -3)},
CreationTimestamp: metav1.Now(),
},
Spec: corev1.PodSpec{
Containers: []corev1.Container{
Expand Down
5 changes: 0 additions & 5 deletions pkg/scheduler/actions/preempt/preempt.go
Original file line number Diff line number Diff line change
Expand Up @@ -178,11 +178,6 @@ func (pmpt *Action) Execute(ssn *framework.Session) {
if preemptor.BestEffort && !task.BestEffort {
return false
}
// should skip not preemptable pod
if !task.Preemptable {
return false
}

// Preempt tasks within job.
return preemptor.Job == task.Job
}, ph)
Expand Down
Loading

0 comments on commit ba86b30

Please sign in to comment.