Description
Hello,
I'm having a weird bug happen when Maestro tries to restart jobs.
Say, I've launched one study with 8 jobs in the study. When these 8 jobs timeout, maybe half of them restart successfully (meaning they are resubmitted to Slurm, and the "TIMED OUT" status changes back to "RUNNING" and the number of restarts goes up).
The other half of the jobs never register as "TIMED OUT",and are never resubmitted back to Slurm. The maestro status
command still shows them as "RUNNING", but does not increment the number of restarts. The jobs also no longer show up in the study.log
file.
Something to note is that the initial runs of these jobs typically all end within a few minutes of each other.
Hopefully I've provided enough information here to help figure this out.