Largest healpix jobs in Jura crash without logging an error message #2277

akremin · 2024-06-11T00:25:41Z

When tiles 27256 and 27258 were run in Jura, they failed due to having over 2000 and 4000 input files respectively. The original scripts:

/global/cfs/cdirs/desi/spectro/redux/jura/run/scripts/healpix/special/other/272/zpix-special-other-27256.slurm
/global/cfs/cdirs/desi/spectro/redux/jura/run/scripts/healpix/special/other/272/zpix-special-other-27258.slurm

failed, as did others in Jura with large memory footprints.

We tried reducing the number of MPI ranks, which worked on some other tiles with large memory footprints:

/global/cfs/cdirs/desi/spectro/redux/jura/run/scripts/healpix/special/other/272/zpix-special-other-27256-lowrank.slurm
/global/cfs/cdirs/desi/spectro/redux/jura/run/scripts/healpix/special/other/272/zpix-special-other-27258-lowrank.slurm

but that didn't work either.

Running with a single MPI rank on a CPU node also failed. What eventually worked was running the explicit python command desi_group_spectra in serial on a CPU node, then running with normal MPI for redrock and the afterburners.

What made this case harder to debug was that the error messages were not propagated to the logs. So two things should be done to mitigate this:

Identify why the error isn't logged and fix that.
Solve the underlying issue that caused the scripts to fail, likely through reducing ranks or eliminating MPI for these extreme cases.

The text was updated successfully, but these errors were encountered:

sbailey · 2024-09-09T22:56:33Z

Missing log messages was not an issue in Kibo. Some healpix jobs crashed due to OOM (out of memory), but all included error messages. I'll close this ticket and leave #2279 for tracking additional memory fixes.

akremin mentioned this issue Jun 12, 2024

Memory issues for large healpix jobs in Jura #2279

Open

sbailey added pipeline crash labels Jul 15, 2024

sbailey closed this as completed Sep 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Largest healpix jobs in Jura crash without logging an error message #2277

Largest healpix jobs in Jura crash without logging an error message #2277

akremin commented Jun 11, 2024

sbailey commented Sep 9, 2024

Largest healpix jobs in Jura crash without logging an error message #2277

Largest healpix jobs in Jura crash without logging an error message #2277

Comments

akremin commented Jun 11, 2024

sbailey commented Sep 9, 2024