Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault when writing to netcdf with dask-enabled xarray dataset #1172

Closed
hdail opened this issue Dec 19, 2016 · 6 comments
Closed
Labels

Comments

@hdail
Copy link

hdail commented Dec 19, 2016

I have a 4 GB netcdf file and am running on a machine with 32 GB of memory. The following works just fine without error on this large memory machine:

ds = xarray.open_dataset('input.nc')
ds.to_netcdf('output.nc')

This dask + pynio approach also works correctly:

ds = xarray.open_dataset('input.nc', chunks={'a': 25, 'b': 25}, engine='pynio')
ds.to_netcdf('output.nc')

But the following dask + default engine (netcdf4 probably?) approach slowly sucks up all the system memory, writes out a file twice as large as it should be with variable values that are extremely large, and then fails with seg fault, bus error or other low-level system errors we'd rather not be seeing in python!

ds = xarray.open_dataset('input.nc', chunks={'a': 25, 'b': 25})
ds.to_netcdf('output.nc')

Adding lock=True to the open_dataset call does not help.

I have two workable solutions to my problem (run without dask because I have a lot of memory available, or use engine='pynio') but this error was hard to track down so thought you would want to know. Would be glad to hear also if I missed something in the docs and the all-too-common user error is to blame =)

@hdail
Copy link
Author

hdail commented Dec 19, 2016

An important addition -- the following also causes low-level system errors (bus error or seg fault, can't remember) so the problem does not originate in the to_netcdf per se, but rather in the chunking / loading of the dataset.

ds = xarray.open_dataset('input.nc', chunks={'a': 25, 'b': 25})
ds.load()

@shoyer
Copy link
Member

shoyer commented Dec 19, 2016

@hdail thanks for the report! Which verison of xarray are you using?

We fixed something that sounds pretty similar (#936) in v0.8.2.

@hdail
Copy link
Author

hdail commented Dec 19, 2016

I'm using 0.8.2. Thanks for the issue link; I had read through that, but since I am not using open_mfdataset and lock=True did not fix my issue, I figured my problem was subtly different. Some incompatibility / race condition when using netcdf4 and dask together? This might be a tricky problem to track down as my code did complete without seg faulting when my dimensions were subtly different (about 10% small in space and 20% smaller in time), even on a server with half as much memory. Bleck.

@shoyer
Copy link
Member

shoyer commented Dec 19, 2016

Yes, this is different.

I think this is bug in how we write netCDF files. Currently, we always use a new thread lock in ArrayWriter.sync(). To avoid possible concurrency issues with the HDF5 API, we really should be reusing the same _default_lock that we use for reading netCDF files.

@hdail
Copy link
Author

hdail commented Dec 19, 2016

Thanks for the info! Given this potential bug, is the driver='pynio' solution acceptable, or is it just working for me for now and may fail in some subtly different configuration / data size?

Another possible solution suggested by a colleague - add the following at the top to enforce single-threaded reads and writes.

dask.set_options(get=dask.async.get_sync)

@shoyer
Copy link
Member

shoyer commented Dec 19, 2016

It is possible that pynio is linking to an independent HDF5 installation, which should eliminate the need for a shared lock. But if that's not the case, then you probably just got lucky.

@shoyer shoyer added the bug label Dec 20, 2016
shoyer added a commit to shoyer/xarray that referenced this issue Dec 22, 2016
…writing

Fixes pydata#1172

The serializable lock will be useful for dask.distributed or multi-processing
(xref pydata#798, pydata#1173, among others).
shoyer added a commit that referenced this issue Jan 4, 2017
…ing (#1179)

* Switch to shared Lock (SerializableLock if possible) for reading and writing

Fixes #1172

The serializable lock will be useful for dask.distributed or multi-processing
(xref #798, #1173, among others).

* Test serializable lock

* Use conda-forge for builds

* remove broken/fragile .test_lock
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants