-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Saving hangs #258
Comments
Note that if I just let the job above run, the job runs out of memory. Perhaps that is the fundamental issue, and I'll try with more memory:
|
To diagnose this better, could you show the full repr of your dataset and report the variable total size and chunk size? If you are using dask, the dask dashboard is invaluable for debugging. |
Oh its not a context manager? Ooops. Upping the memory to 16 G does fix the problem. However, I am concerned as to what will happen with more iterations. I can definitely share the data if this continues to be an issue, but thanks for the suggestions! |
Please post the repr. |
|
So each chunk is 614 MB. That's pretty big but should work fine if you have enough memory. Since the data are chunked in time, theoretically the writing should execute in a streaming manner and not require more memory for more timesteps. A couple of suggestions:
|
OK, I'll look into the dashboard. As an experiment running with 16 Gb completed the write in 17 s. Dropping to 8 Gb required 50 s. These memory caps are set by the cluster - I need to request a memory size for shared nodes. If I only write 2 out of 3 files, then it takes 17 s with 8 Gb. So, something funny is happening with memory. I'll give zarr a try. I've just been using netcdf because... |
Another good debugging technique is to just do a reduction, e.g. |
OK< I'll make a bigger data set and report if the memory needs stay the same or grow. zarr seems to work about the same. I can imagine it is a lot better for parallel writes, but as a portable format to transfer somewhere else, it seems suboptimal because I would guess you want a tar first? (sorry if this is turning into a chat - feel free to return to whatever else you might have been doing I think I'm OK for now ;-)) |
Saving as zarr works. Just using xarray and loading as zarr and then trying to save as netcdf fails on this machine. So I think we can safely rule out xmitgcm. Thanks for the help! |
NetCDF / HDF5 does not play particularly well with distributed writing. It should be possible, but there are lots of things that can go wrong. That was one of the motivating factors that inspired the creation of Zarr in the first place. If you want to store zarr in a single file, .zip is the way to go. Zarr supports reading / writing directly to zip files (although locking is required for writes.) |
If I try and save a multi-file xmitgcm dataset to disk, it hangs and has to be killed..
The script is reading three time steps:
This routinely hangs, and if I kill it I get:
If I select the three time steps in individually and save individually, then everything works fine. ie.
saves three files, one for each time slice. That is an acceptable work around, but not what I have done on other clusters.
Note xarray seems to be having a problem with file/thread locks as well: pydata/xarray#3961 So I wonder if this is the same thing. Using pure xarray and netcdf files, if I set
lock=False
things tend to work better, but I don't see a similar flag for xmitgcm.(Sorry for two issues in one day - new cluster ;-)
The text was updated successfully, but these errors were encountered: