Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support more actions for volcano job failure scenario #3813

Merged
merged 1 commit into from
Jan 15, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
support more actions for volcano job failure scenario
Signed-off-by: Box Zhang <wszwbsddbk@gmail.com>
  • Loading branch information
bibibox committed Jan 15, 2025
commit 68da9635aa0c0bd8ded8b4db0916d688f0ae4113
86 changes: 76 additions & 10 deletions docs/user-guide/how_to_use_job_policy.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,28 +6,30 @@ by configuring `policy` for the volcano job under `job.spec`.

## Key Points
* Volcano allows users to configure a pair of `Event`(`Events`) and `Action` for a volcano job or a task. If the specified
event(events) happens, the target action will be triggered.
event(events) happens, the target action will be triggered. If timeout is configured, the target action will be executed after the timeout delay.
* If the policy is configured under `job.spec` only, it will work for all tasks by default. If the policy is configured
under `task.spec` only, it will only work for the task. If the policy is configured in both job and task level, it will obey
the task policy.
* Users can set multiple policy for a job or a task.
* Currently, Volcano provides **5 build-in events** for users. The details are as follows.
* Currently, Volcano provides **6 build-in events** for users. The details are as follows.

| ID | Event | Description |
|-----|-----------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 1 | `PodFailed` | Check whether there is any pod' status is `Failed`. |
| 2 | `PodEvicted` | Check whether there is any pod is evicted. |
| 3 | `Unknown` | Check whether the status of a volcano job is `Unknown`. The most possible factor is task unschedulable. It is triggered when part pods can't be scheduled while some are already running in gang-scheduling case. |
| 4 | `TaskCompleted` | Check whether there is a task whose all pods are succeed. If `minsuccess` is configured for a task, it will also be regarded as task completes. |
| 5 | `*` | It means all the events, which is not so common used. |
| ID | Event | Description |
|-----|----------------|-------------------------------------------------------------------------------------------------------------------|
| 1 | `PodFailed` | Check whether there is any pod' status is `Failed`. |
| 2 | `PodEvicted` | Check whether there is any pod is evicted. |
| 3 | `PodPending` | Check whether there is any pod is pending. It is usually used with timeout. If the pod is not pending, the timeout action will be canceled. |
| 4 | `TaskCompleted`| Check whether there is a task whose all pods are succeed. If `minsuccess` is configured for a task, it will also be regarded as task completes. |
| 4 | `Unknown` | Check whether the status of a volcano job is `Unknown`. The most possible factor is task unschedulable. It is triggered when part pods can't be scheduled while some are already running in gang-scheduling case. |
| 5 | `*` | It means all the events, which is not so common used. |

* Currently, Volcano provides **5 build-in actions** for users. The details are as follows.

| ID | Action | Description |
|-----|-------------------|------------------------------------------------------------------------------------------------------------------|
| 1 | `AbortJob` | Abort the whole job, but it can be resumed. All pods will be evicted and no pod will be recreated. |
| 2 | `RestartJob` | Restart the whole job. |
| 3 | `RestartTask` | Default action. The task will be restarted. This action **cannot** work with job level events such as `Unknown`. |
| 3 | `RestartTask` | The task will be restarted. This action **cannot** work with job level events such as `Unknown`. |
| 2 | `RestartPod` | The pod will be restarted. This action **cannot** work with job level events such as `Unknown`. |
| 4 | `TerminateJob` | Terminate the whole job and it **cannot** be resumed. All pods will be evicted and no pod will be recreated. |
| 5 | `CompleteJob` | Regard the job as completed. The unfinished pods will be killed. |

Expand Down Expand Up @@ -153,4 +155,68 @@ spec:
name: tfjob-port
resources: {}
restartPolicy: Never
```
3. Set a pair of `events` and `action`.
```yaml
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: tensorflow-dist-mnist
spec:
minAvailable: 3
schedulerName: volcano
plugins:
env: []
svc: []
queue: default
tasks:
- replicas: 1
name: ps
policies:
- events: PodFailed # Task level policy. If any pod fails in this task, restart the pod.
action: RestartPod
- events: PodEvicted # Task level policy. If any pod is evicted in this task, restart the job after 10 minutes.
action: RestartJob
timeout: 10m
template:
spec:
containers:
- command:
- sh
- -c
- |
PS_HOST=`cat /etc/volcano/ps.host | sed 's/$/&:2222/g' | sed 's/^/"/;s/$/"/' | tr "\n" ","`;
WORKER_HOST=`cat /etc/volcano/worker.host | sed 's/$/&:2222/g' | sed 's/^/"/;s/$/"/' | tr "\n" ","`;
export TF_CONFIG={\"cluster\":{\"ps\":[${PS_HOST}],\"worker\":[${WORKER_HOST}]},\"task\":{\"type\":\"ps\",\"index\":${VK_TASK_INDEX}},\"environment\":\"cloud\"}; ## Get the index from the environment variable and configure it in the TF job.
python /var/tf_dist_mnist/dist_mnist.py
image: volcanosh/dist-mnist-tf-example:0.0.1
name: tensorflow
ports:
- containerPort: 2222
name: tfjob-port
resources: {}
restartPolicy: Never
- replicas: 2
name: worker
policies:
- event: TaskCompleted # Task level policy. If this task completes, complete the job.
action: CompleteJob
template:
spec:
containers:
- command:
- sh
- -c
- |
PS_HOST=`cat /etc/volcano/ps.host | sed 's/$/&:2222/g' | sed 's/^/"/;s/$/"/' | tr "\n" ","`;
WORKER_HOST=`cat /etc/volcano/worker.host | sed 's/$/&:2222/g' | sed 's/^/"/;s/$/"/' | tr "\n" ","`;
export TF_CONFIG={\"cluster\":{\"ps\":[${PS_HOST}],\"worker\":[${WORKER_HOST}]},\"task\":{\"type\":\"worker\",\"index\":${VK_TASK_INDEX}},\"environment\":\"cloud\"};
python /var/tf_dist_mnist/dist_mnist.py
image: volcanosh/dist-mnist-tf-example:0.0.1
name: tensorflow
ports:
- containerPort: 2222
name: tfjob-port
resources: {}
restartPolicy: Never
```
3 changes: 2 additions & 1 deletion pkg/controllers/apis/job_info_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -298,12 +298,13 @@ func TestRequest_String(t *testing.T) {
JobName: "testjobname",
QueueName: "testqueuename",
TaskName: "testtaskname",
PodName: "testpodname",
Event: vcbus.AnyEvent,
ExitCode: 0,
Action: vcbus.SyncJobAction,
JobVersion: 0,
},
ExpectedValue: "Queue: testqueuename, Job: testnamespace/testjobname, Task:testtaskname, Event:*, ExitCode:0, Action:SyncJob, JobVersion: 0",
ExpectedValue: "Queue: testqueuename, Job: testnamespace/testjobname, Task:testtaskname, Pod:testpodname, Event:*, ExitCode:0, Action:SyncJob, JobVersion: 0",
},
}

Expand Down
5 changes: 3 additions & 2 deletions pkg/controllers/apis/request.go
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@ type Request struct {
JobUid types.UID
TaskName string
QueueName string
PodName string

Event v1alpha1.Event
ExitCode int32
Expand All @@ -42,8 +43,8 @@ type Request struct {
// String function returns the request in string format.
func (r Request) String() string {
return fmt.Sprintf(
"Queue: %s, Job: %s/%s, Task:%s, Event:%s, ExitCode:%d, Action:%s, JobVersion: %d",
r.QueueName, r.Namespace, r.JobName, r.TaskName, r.Event, r.ExitCode, r.Action, r.JobVersion)
"Queue: %s, Job: %s/%s, Task:%s, Pod:%s, Event:%s, ExitCode:%d, Action:%s, JobVersion: %d",
r.QueueName, r.Namespace, r.JobName, r.TaskName, r.PodName, r.Event, r.ExitCode, r.Action, r.JobVersion)
}

// FlowRequest The object of sync operation, used for JobFlow and JobTemplate
Expand Down
Loading
Loading