Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Largest healpix jobs in Jura crash without logging an error message #2277

Closed
akremin opened this issue Jun 11, 2024 · 1 comment
Closed

Largest healpix jobs in Jura crash without logging an error message #2277

akremin opened this issue Jun 11, 2024 · 1 comment

Comments

@akremin
Copy link
Member

akremin commented Jun 11, 2024

When tiles 27256 and 27258 were run in Jura, they failed due to having over 2000 and 4000 input files respectively. The original scripts:

/global/cfs/cdirs/desi/spectro/redux/jura/run/scripts/healpix/special/other/272/zpix-special-other-27256.slurm
/global/cfs/cdirs/desi/spectro/redux/jura/run/scripts/healpix/special/other/272/zpix-special-other-27258.slurm

failed, as did others in Jura with large memory footprints.

We tried reducing the number of MPI ranks, which worked on some other tiles with large memory footprints:

/global/cfs/cdirs/desi/spectro/redux/jura/run/scripts/healpix/special/other/272/zpix-special-other-27256-lowrank.slurm
/global/cfs/cdirs/desi/spectro/redux/jura/run/scripts/healpix/special/other/272/zpix-special-other-27258-lowrank.slurm

but that didn't work either.

Running with a single MPI rank on a CPU node also failed. What eventually worked was running the explicit python command desi_group_spectra in serial on a CPU node, then running with normal MPI for redrock and the afterburners.

What made this case harder to debug was that the error messages were not propagated to the logs. So two things should be done to mitigate this:

  1. Identify why the error isn't logged and fix that.
  2. Solve the underlying issue that caused the scripts to fail, likely through reducing ranks or eliminating MPI for these extreme cases.
@sbailey
Copy link
Contributor

sbailey commented Sep 9, 2024

Missing log messages was not an issue in Kibo. Some healpix jobs crashed due to OOM (out of memory), but all included error messages. I'll close this ticket and leave #2279 for tracking additional memory fixes.

@sbailey sbailey closed this as completed Sep 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants