Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

redrock out-of-memory on KNL cumulative redshifts #1234

Open
sbailey opened this issue Apr 7, 2021 · 0 comments
Open

redrock out-of-memory on KNL cumulative redshifts #1234

sbailey opened this issue Apr 7, 2021 · 0 comments
Labels

Comments

@sbailey
Copy link
Contributor

sbailey commented Apr 7, 2021

desi_tile_redshifts currently assigns one node per spectrograph, but this runs out of memory on KNL for larger tiles, e.g. daily/tiles/cumulative/80738/20210406/spectra-4-80738-thru20210406.fits, which has 3.5 GB of spectra from 17 exposures:

RUNNING srun -N 1 -n 68 -c 4 rrdesi_mpi tiles/cumulative/80738/20210406/spectra-4-80738-thru20210406.fits -o tiles/cumulative/80738/20210406/redrock-4-80738-thru20210406.h5 -z tiles/cumulative/80738/20210406/zbest-4-80738-thru20210406.fits
Running with 68 processes
Loading targets...
slurmstepd: error: Detected 1 oom-kill event(s) in step 41435047.3 cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
srun: error: nid09911: task 21: Out Of Memory
srun: Terminating job step 41435047.3
slurmstepd: error: *** STEP 41435047.3 ON nid09911 CANCELLED AT 2021-04-07T13:11:49 ***

In this case, either srun -N 1 -n 34 -c 8 ... (17 min) or srun -N 2 -n 68 -c 4 ... (13m) works, but would require some pipeline logic to pre-identify when there are too many input frames and drop down to fewer cores.

@sbailey sbailey added the crash label Apr 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant