Open
Description
If the worker pod got evicted, the entire MPIJob will run into Failed
state:
status:
conditions:
- lastTransitionTime: "2024-08-14T19:45:39Z"
lastUpdateTime: "2024-08-14T19:45:39Z"
message: MPIJob xxx is created.
reason: MPIJobCreated
status: "True"
type: Created
- lastTransitionTime: "2024-08-14T19:48:02Z"
lastUpdateTime: "2024-08-14T19:48:02Z"
message: MPIJob xxx is running.
reason: MPIJobRunning
status: "False"
type: Running
- lastTransitionTime: "2024-08-15T04:01:42Z"
lastUpdateTime: "2024-08-15T04:01:42Z"
message: 1/8 workers are evicted
reason: MPIJobEvicted
status: "True"
type: Failed
replicaStatuses:
Launcher:
failed: 1
Worker:
active: 7
failed: 1
startTime: "2024-08-14T19:45:39Z"
However, the run policy is not honored as a result and the worker pods are kept in running state.
runPolicy:
backoffLimit: 1
cleanPodPolicy: Running
ttlSecondsAfterFinished: 10800
Metadata
Metadata
Assignees
Labels
No labels