-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory issues for large healpix jobs in Jura #2279
Comments
Examples from Jura for debugging:OOM failed when running N>1 healpix in same job, but worked when split out into 1 job per healpixUpdate: OOM fixed by PR #2290. Might still need to tune job runtimes.
Caveat: some of these timed out when running as single healpix and needed more walltime, but they didn't OOM OOM during spectra creation
e.g.
These required long custom runs in CPU interactive nodes to get the spectra files generated before proceeding. Slow spectra groupingsv1 dark healpix 27258 spent 21 minutes in groupspec combining 975 frame files, and then only needed 43 seconds for redrock (!) |
Kibo reportKibo was run with bundling 10 healpix per job, only two jobs having memory problems: zpix-special-dark-26192-27251.slurm with commands like
For these we resorted to generating the spectra and coadd files in a separate interactive job with commands like
Note: "272" was hardcoded and used for healpix 272xx, and similarly for 273 etc. After generating the spectra and coadd files, we then resubmitted the original At minimum it would be useful to add a desi_zproc option to process desi_group_spectra one healpix at a time instead of doing 10 healpix in parallel using sub-communicators. That could be used by The zproc wrapper to emlinefit will need more study for why that was running out of memory (given that other afterburners didn't), and what could be done. |
We ran into crashes, MPI rank OOM's, and timeouts in Jura with healpix jobs that had a large number (N>1000) of inputs. A closely related ticket is issue #2277 , but that one is specific to issues in logging rather than crashes/timeouts.
I have pushed a branch that may help in this regard, though it wasn't necessary for Jura:
prunespectragroup
.That modifies the spectral grouping code to subselect only fibers in a loaded frame that overlap a given healpixel, rather than loading information for all 500 fibers for all input files before subselecting to a given healpixel. In principle, this should reduce the memory footprint since not all fibers will overlap the tile, so we expect to retain fewer fibers in memory. I have not yet checked that the new code can produce identical results to the old code. For that reason I have not opened a pull request.
Beyond this we may want to consider scaling the MPI ranks, number of nodes, or other parallelism for the spectral grouping; while not making the redrock and afterburner processing too inefficient.
The text was updated successfully, but these errors were encountered: