Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use h5netcdf to read and write netcdf data #786

Open
mraspaud opened this issue May 21, 2019 · 12 comments
Open

Use h5netcdf to read and write netcdf data #786

mraspaud opened this issue May 21, 2019 · 12 comments
Labels
component:readers component:writers enhancement code enhancements, features, improvements

Comments

@mraspaud
Copy link
Member

Feature Request

Is your feature request related to a problem? Please describe.
At the moment, satpy uses two engines for handling io on netcdf files: netCDF4, which is a python interface to the netcdf4 C library, and h5netcdf, which uses h5py to read and write nc files. While both engines seem to be working, it is unnecessary to use both, and a harmonisation within satpy would be nice.

Describe the solution you'd like
Using only one engine for nc I/O would be best. The netCDF4 is the official library from unidata. However, it uses a C library in the background that is known for not interacting well with the C hdf5 library. h5netcdf uses h5py, which in turn uses the hdf5 C library, hence removing the need for the C netcdf library. h5netcdf has been reported to be faster in some cases, but might not be fully mature.

My opinion is that limiting the amount of C libraries is a good thing, and relying on only one C library for reading both netcdf and hdf5 is to be preferred. The h5netcdf project seems to be active and responsive, so any problems we might encounter with reading data with it should be fixed rapidly.

Describe any changes to existing user workflow
Hopefully, having only one interface to the netcdf format will simplify the installation of satpy, and should be totally transparent to the user.

Additional context
The h5netcdf project: https://github.com/shoyer/h5netcdf
The netCDF4 project: http://unidata.github.io/netcdf4-python/netCDF4/index.html

@mraspaud mraspaud added component:readers component:writers enhancement code enhancements, features, improvements labels May 21, 2019
@djhoese
Copy link
Member

djhoese commented May 21, 2019

I'm glad we're discussing this, but sadly I have the opposite opinion and think that netCDF4-python should be the default. Here are some reasons for my preference:

  1. netcdf4 is more common on scientific platforms and is the default engine used by xarray. Given that xarray and h5netcdf are both written by shoyer, I would assume there is a reason for that. I'm not sure we should differ from xarray.
  2. The C NetCDF4 library can be compiled to read HDF4 files as well. This is something I depend on in the geocat reader.
  3. The C NetCDF4 library has functionality for more than on-disk files like OpenDAP and I've heard rumors of support for reading from cloud/block storage systems like Google Cloud Storage. I'm not sure the h5py or HDF5 C libraries are the components implementing this.

Some opinions:

  1. If you are already installing one C library is it really that hard to install another? With conda/conda-forge this has become less of a pain.
  2. I've never used the "new" API provided by h5netcdf, but if it is much more useable than netcdf4-python then I'd be more open to using it. However, in most cases we should be using xarray by default.

Otherwise, could you enumerate the issues you had with NetCDF4/HDF5 C libraries both being used and whether or not you've experienced them any time recently?

@mraspaud
Copy link
Member Author

mraspaud commented May 21, 2019

For reference, here is the bug report for the netcdf/hdf5 interaction problem we had a few years ago:
https://groups.google.com/forum/#!topic/h5py/AZQ30pSy-RI

According to this conversation, the issue can't be solved before hdf5 1.10.x. We have now 1.8.12 in operations, however I can't reproduce the issue.

@sfinkens had another concern with getting the two C libs to install also, right ?

@djhoese
Copy link
Member

djhoese commented May 21, 2019

@mraspaud You are getting HDF5 in your current operations from the system libraries right? In the future you will be using conda right (not that this makes this a non-issue, just double checking)?

@mraspaud
Copy link
Member Author

Yes, it's system packages we have. In operations, we will use conda for satpy mostly, but I can't exclude having to use system packages.

@djhoese
Copy link
Member

djhoese commented May 21, 2019

A +1 for h5netcdf, in xarray it says that only scipy/h5netcdf backends can read byte streams and file like objects:

        if engine not in [None, 'scipy', 'h5netcdf']:
            raise ValueError("can only read bytes or file-like objects "
                             "with engine='scipy' or 'h5netcdf'")

This is in xarray/backends/api.py:open_dataset.

@djhoese
Copy link
Member

djhoese commented May 21, 2019

From Ryan May of Unidata (not tagging so that he doesn't get a ton of notifications) when asked on the pangeo-data gitter channel:

The Unidata netCDF team is working on adding official support for better block storage support, starting with Zarr, in the netCDF C library. You'll want to reach out to the mailing list or support email for more information--I don't think there are any published branch/PRs for that support. I think it's still at the spec stage ATM. Open development models are still something we're working on figuring out here.

@mraspaud
Copy link
Member Author

Looks like (some) cloud storage is available with xarray and h5netcdf: https://gist.github.com/rsignell-usgs/cc2d2d4fe1930bd949119e543b56bce1

@sfinkens
Copy link
Member

I have a slight tendency towards h5netcdf, because my experience is that linking the netCDF C library correctly against a compatible hdf5 C library is the main difficulty. And that does not only appy to expert users who compile the libraries themselves. Some time ago I had the problem that the netCDF4 and h5py wheels installed by pip were built with incompatible versions of the C libraries: Unidata/netcdf4-python#694. Issues like that could certainly be avoided if we only depended on hdf5.
Furthermore, the attribute access looks prettier in h5netcdf 😄 But as Martin said, it's probably not as mature as netCDF4, yet.

Maybe we should ask shoyer why netCDF4 is the default engine in xarray?

A +1 for h5netcdf, in xarray it says that only scipy/h5netcdf backends can read byte streams and file like objects

@djhoese netCDF4-1.2.8+ supports file-like objects, too. Maybe that hasn't been implemented in xarray, yet.

@djhoese
Copy link
Member

djhoese commented May 22, 2019

Some time ago I had the problem that the netCDF4 and h5py wheels installed by pip were built with incompatible versions of the C libraries: Unidata/netcdf4-python#694.

To me this sounds like something that should be reported and coordinated between the two projects.

From everything I'm gathering it sounds like h5netcdf is useful in the few key cases:

  1. You don't want to or can't install the NetCDF C library.
  2. You need better performance in the cases where h5netcdf actually performs better than netcdf4

As for the API, in what cases are we using h5netcdf's new API directly instead of using xarray and using xarray isn't an option?

@sfinkens
Copy link
Member

sfinkens commented May 22, 2019

The CF writer tests mostly use h5netcdf to read the generated files. But I guess that can be replaced with xarray. Unless there was a particular reason not to use xarray?

@djhoese
Copy link
Member

djhoese commented May 22, 2019

The test environments have netcdf4-python installed so they could use that instead if needed. Reading the NetCDF for verification with xarray may not be a good idea with the way that xarray handles coordinates (it ignores the coordinates attribute when determining per-variable coordinates).

@mraspaud
Copy link
Member Author

I always try to use the legacy api of h5netcdf just to able to switch to netCDF4 in case of trouble, so I actually haven't really looked at the new features unfortunately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component:readers component:writers enhancement code enhancements, features, improvements
Projects
None yet
Development

No branches or pull requests

3 participants