Skip to content

Commit

Permalink
update job-api.md for plugin
Browse files Browse the repository at this point in the history
  • Loading branch information
wangyuqing4 committed Apr 15, 2019
1 parent 6f64949 commit 3447004
Show file tree
Hide file tree
Showing 3 changed files with 51 additions and 1 deletion.
48 changes: 48 additions & 0 deletions docs/design/job-api.md
Original file line number Diff line number Diff line change
Expand Up @@ -441,6 +441,48 @@ spec:
image: executor-img
```

### Plugins for Job

As many jobs of AI frame, e.g. TensorFlow, MPI, Mxnet, need set env, pods communicate, ssh sign in without password.
We provide Job api plugins to give users a better focus on core business.
Now we have three plugins, every plugin has parameters, if not provided, we use default.

* env: set VK_TASK_INDEX to each container, is a index for giving the identity to container.
* svc: create Serivce and *.host to enable pods communicate.
* ssh: sign in ssh without password, e.g. use command mpirun or mpiexec.

```yaml
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: mpi-job
spec:
minAvailable: 2
schedulerName: kube-batch
policies:
- event: PodEvicted
action: RestartJob
plugins:
ssh: []
env: []
svc: []
tasks:
- replicas: 1
name: mpimaster
template:
spec:
containers:
image: mpi-image
name: mpimaster
- replicas: 2
name: mpiworker
template:
spec:
containers:
image: mpi-image
name: mpiworker
```

## Appendix

```go
Expand Down Expand Up @@ -584,12 +626,18 @@ const (
Running JobPhase = "Running"
// Restarting is the phase that the Job is restarted, waiting for pod releasing and recreating
Restarting JobPhase = "Restarting"
// Completing is the phase that required tasks of job are completed, job starts to clean up
Completing JobPhase = "Completing"
// Completed is the phase that all tasks of Job are completed successfully
Completed JobPhase = "Completed"
// Terminating is the phase that the Job is terminated, waiting for releasing pods
Terminating JobPhase = "Terminating"
// Terminated is the phase that the job is finished unexpected, e.g. events
Terminated JobPhase = "Terminated"
// Failed is the phase that the job is restarted failed reached the maximum number of retries.
Failed JobPhase = "Failed"
// Inqueue is the phase that cluster have idle resource to schedule the job
Inqueue JobPhase = "Inqueue"
)

// JobState contains details for the current state of the job.
Expand Down
1 change: 1 addition & 0 deletions example/openmpi-hello.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ spec:
plugins:
ssh: []
env: []
svc: []
tasks:
- replicas: 1
name: mpimaster
Expand Down
3 changes: 2 additions & 1 deletion example/tensorflow-benchmark.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ spec:
schedulerName: kube-batch
plugins:
env: []
svc: []
policies:
- event: PodEvicted
action: RestartJob
Expand Down Expand Up @@ -57,4 +58,4 @@ spec:
name: tfjob-port
resources: {}
workingDir: /opt/tf-benchmarks/scripts/tf_cnn_benchmarks
restartPolicy: OnFailure
restartPolicy: OnFailure

0 comments on commit 3447004

Please sign in to comment.