-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dask.async.RuntimeError: NetCDF: HDF error on xarray to_netcdf #793
Comments
cc @mrocklin |
I should note that serialization also does not appear to be robust under reshaping the data via |
There are a large number of files (1320) where |
1024 might be a common open file handle limit. Some things to try to isolate the issue:
|
Quick question @mrocklin, for 2, are you proposing a script that just opens all the files, e.g., something like this
where |
Sure. I'm not proposing any particular approach. I'm just supporting your previous idea that maybe the problem is having too many open file handles. It would be good to check this before diving into threading or concurrency issues. |
Agreed. I'll let you know what I find out. Thanks @mrocklin. |
Test 2 passed, so it doesn't appear to be due to too many open file handles. |
@mrocklin, For option 1, should the command be |
Yes, my apologies for the typo. |
I'm pretty sure we now have a thread lock around all writes to NetCDF files, but it's possible that isn't aggressive enough (maybe we can't safely read and write a different file at the same time?). If your script works with synchronous execution I'll take another look. |
I can't fully confirm that the above scripts works with synchronous execution because the job ran out of its 16hr run time. However, it does appear to be the case that forcing synchronous execution resolves potential issues because previous runs of the script crashed and this one did not. I'll have to try more cases with synchronous execution, especially over the next half week, to see if I encounter more issues but am suspicious this is the problem. @mrocklin and I noted that the netCDF reader has problems when threading is on when we were using distributed, so this appears to be a likely candidate. We got the same I'm suspicious that the netCDF reader is not thread safe and may not have been compiled as such (http://hdf-forum.184993.n3.nabble.com/Activate-thread-safe-and-enable-cxx-in-HDF5-td2993951.html) but there appear other potential issues that could be part of the problem, e.g., Unidata/netcdf4-python#279 because I am doing so many reads. It may also be possible, as you note @shoyer, that the tread locks aren't aggressive enough. It would probably be good to come up with some type of testing strategy to better isolate the problem... I'll have to give this more thought. |
To be clear, we ran into the |
I did a little digging into this and I'm pretty sure the issue here is that HDF5 cannot do multi-threading -- at all. Moreover, many HDF5 builds are not thread safe. Right now, we use a single shared lock for all reads with xarray, but for writes we rely on dask.array.store, which only uses different locks for each array it writes. Because @pwolfram's HDF5 file includes multiple variables, each of these gets written with their own thread lock -- which means we end up writing to the same file simultaneously from multiple threads. So what we could really use here is a |
@shoyer, I'm assuming there needs to be an xarray PR corresponding to Matt's merged PR, is that correct? Do you think this will be a difficult xarray change? |
This should be pretty easy -- we'll just need to add The only subtlety is that this needs to be done in a way that is dependent on the version of dask, because the keyword argument is new -- something like |
This fixes an error on an asynchronous write for `to_netcdf` resulting in an `dask.async.RuntimeError: NetCDF: HDF error` Resolves issue pydata#793 following dask improvement at dask/dask#1053 following advice of @shoyer.
Thanks @shoyer! I ran into this problem again with this morning and as you note I had multiple arrays in the file that were being written. PR #800 implements your suggestion and should hopefully resolve the issue, although it is not clear to me how to build a reproducible test case-- perhaps write a file with a ton of random arrays to crash it out on the write? Any thoughts or suggestions you have on this would be very helpful. Note that the PR is preliminary until I can verify that it resolves the issue via testing. |
Note, also waiting on |
This fixes an error on an asynchronous write for `to_netcdf` resulting in an `dask.async.RuntimeError: NetCDF: HDF error` Resolves issue pydata#793 following dask improvement at dask/dask#1053 following advice of @shoyer.
I'm going to close this for now but will reopen it if the issue arises again following the dask release. |
Dask appears to be failing on serialization following a ds.to_netcdef() via a NETCDF: HDF error.
Excerpted error below:
Script used: https://gist.github.com/98acaa31a4533b490f78
Full output: https://gist.github.com/248efce774ad08cb1dd6
The text was updated successfully, but these errors were encountered: