You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Running with a single MPI rank on a CPU node also failed. What eventually worked was running the explicit python command desi_group_spectra in serial on a CPU node, then running with normal MPI for redrock and the afterburners.
What made this case harder to debug was that the error messages were not propagated to the logs. So two things should be done to mitigate this:
Identify why the error isn't logged and fix that.
Solve the underlying issue that caused the scripts to fail, likely through reducing ranks or eliminating MPI for these extreme cases.
The text was updated successfully, but these errors were encountered:
Missing log messages was not an issue in Kibo. Some healpix jobs crashed due to OOM (out of memory), but all included error messages. I'll close this ticket and leave #2279 for tracking additional memory fixes.
When tiles 27256 and 27258 were run in Jura, they failed due to having over 2000 and 4000 input files respectively. The original scripts:
failed, as did others in Jura with large memory footprints.
We tried reducing the number of MPI ranks, which worked on some other tiles with large memory footprints:
but that didn't work either.
Running with a single MPI rank on a CPU node also failed. What eventually worked was running the explicit python command
desi_group_spectra
in serial on a CPU node, then running with normal MPI for redrock and the afterburners.What made this case harder to debug was that the error messages were not propagated to the logs. So two things should be done to mitigate this:
The text was updated successfully, but these errors were encountered: