Skip to content

Some "TIMED OUT" jobs not restarting properly #425

Open
@brendon-cavainolo

Description

Hello,

I'm having a weird bug happen when Maestro tries to restart jobs.

Say, I've launched one study with 8 jobs in the study. When these 8 jobs timeout, maybe half of them restart successfully (meaning they are resubmitted to Slurm, and the "TIMED OUT" status changes back to "RUNNING" and the number of restarts goes up).

The other half of the jobs never register as "TIMED OUT",and are never resubmitted back to Slurm. The maestro status command still shows them as "RUNNING", but does not increment the number of restarts. The jobs also no longer show up in the study.log file.

Something to note is that the initial runs of these jobs typically all end within a few minutes of each other.

Hopefully I've provided enough information here to help figure this out.

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions