Skip to content

Worker pods not cleaned up upon MPIJobEvicted event #647

Open
@shaowei-su

Description

@shaowei-su

If the worker pod got evicted, the entire MPIJob will run into Failed state:

status:
  conditions:
  - lastTransitionTime: "2024-08-14T19:45:39Z"
    lastUpdateTime: "2024-08-14T19:45:39Z"
    message: MPIJob xxx is created.
    reason: MPIJobCreated
    status: "True"
    type: Created
  - lastTransitionTime: "2024-08-14T19:48:02Z"
    lastUpdateTime: "2024-08-14T19:48:02Z"
    message: MPIJob xxx is running.
    reason: MPIJobRunning
    status: "False"
    type: Running
  - lastTransitionTime: "2024-08-15T04:01:42Z"
    lastUpdateTime: "2024-08-15T04:01:42Z"
    message: 1/8 workers are evicted
    reason: MPIJobEvicted
    status: "True"
    type: Failed
  replicaStatuses:
    Launcher:
      failed: 1
    Worker:
      active: 7
      failed: 1
  startTime: "2024-08-14T19:45:39Z"

However, the run policy is not honored as a result and the worker pods are kept in running state.

  runPolicy:
    backoffLimit: 1
    cleanPodPolicy: Running
    ttlSecondsAfterFinished: 10800

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions