Dataset.encoding and unlimited dimensions for to_netcdf #1170

jhamman · 2016-12-17T01:13:04Z

Add Dataset.encoding attribute and support unlimited dimensions for scipy/netcdf4 backends.

closes #992

jhamman · 2016-12-17T01:16:21Z

xarray/core/dataset.py

+    @encoding.setter
+    def encoding(self, value):
+        self._encoding = OrderedDict(value)
+


I'm getting an error here:

In [3]: import xarray as xr In [4]: ds = xr.open_dataset('simple.nc') .../xarray/xarray/core/dataset.py in encoding(self) 291 """Dictionary of global encoding attributes on this dataset 292 """ --> 293 if self._encoding is None: 294 self._encoding = OrderedDict() 295 return self._encoding .../xarray/xarray/core/common.py in __getattr__(self, name) 219 # this avoids an infinite loop when pickle looks for the 220 # __setstate__ attribute before the xarray object is initialized --> 221 for source in self._attr_sources: 222 with suppress(KeyError): 223 return source[name] .../xarray/xarray/core/dataset.py in _attr_sources(self) 536 def _attr_sources(self): 537 """List of places to look-up items for attribute-style access""" --> 538 return [self, LevelCoordinatesSource(self), self.attrs, self.encoding] 539 540 def __contains__(self, key):

Why do you need this getter & setter? Why not just have self.encoding = OrderedDict(encoding) in the __init__? Because you need to coerce to Ordered?

It might help to switch if name != '__setstate__' in AttrAccessMixin.__getattr__ to if self._initialized.

shoyer · 2016-12-17T03:10:31Z

xarray/core/dataset.py

+        """Dictionary of global encoding attributes on this dataset
+        """
+        if self._encoding is None:
+            self._encoding = OrderedDict()


I would just make it a normal dict, like Variable.encoding, unless you have a reason why you feel preserving order would be helpful.

shoyer · 2016-12-17T03:15:20Z

xarray/core/dataset.py

@@ -519,7 +535,7 @@ def __deepcopy__(self, memo=None):
    @property
    def _attr_sources(self):
        """List of places to look-up items for attribute-style access"""
-        return [self, LevelCoordinatesSource(self), self.attrs]
+        return [self, LevelCoordinatesSource(self), self.attrs, self.encoding]


I don't think this is a good idea. This lets you write things like dataset.unlimited_dims to pull out unlimited dimensions from encoding, but dataset.encoding['unlimited_dims'] is more explicit and this runs the risk of confusing encoding/attrs.

shoyer · 2016-12-17T03:16:04Z

xarray/core/dataset.py

+    @encoding.setter
+    def encoding(self, value):
+        self._encoding = OrderedDict(value)
+


It might help to switch if name != '__setstate__' in AttrAccessMixin.__getattr__ to if self._initialized.

shoyer · 2016-12-17T03:17:18Z

xarray/backends/netCDF4_.py

@@ -209,6 +217,8 @@ def __init__(self, filename, mode='r', format='NETCDF4', group=None,
        self._opener = opener
        self._filename = filename
        self._mode = 'a' if mode == 'w' else mode
+        self._unlimited_dimensions = set()
+        self.encoding = None


I don't think you need an encoding property for datastores. Shouldn't _unlimited_dimensions be enough?

jhamman · 2016-12-20T20:22:39Z

@shoyer - I'm making progress here, although I still need to cleanup/add tests. Do you have an idea what might be causing this test failure though? I'm sure it's something I broke but I'm having trouble connecting the dots...

…ted_ncdims

shoyer · 2016-12-23T05:22:46Z

So that is definitely a bug with scipy not handling ... for assignment with unlimited dimensions. We could make that work by making a proxy object that implements __setitem__ for Ellipsis by using a slice object or assignValue for scalars, but it would be nice to fix it upstream in scipy.

…ted_ncdims

jhamman · 2016-12-27T18:55:29Z

xarray/backends/common.py

+                target[...] = source
+            except TypeError:
+                # workaround for GH: scipy/scipy#6880
+                target[slice(None, None, None)] = source


@shoyer - I raised the Ellipsis issue with the scipy folks: scipy/scipy#6880. This seems to work although maybe it isn't the most robust solution, I don't know.

Yes, this seems reasonable to me. I would just write target[:] = source, though -- no need to manually make the slice() object. This should work because variables with unlimited dimensions will never be scalars.

jhamman · 2016-12-27T19:03:13Z

this is ready for a full review.

shoyer

From a user facing API perspective, it could be nice to also add unlimited_dims as a keyword argument to Dataset.to_netcdf

shoyer · 2016-12-28T17:37:08Z

xarray/backends/h5netcdf_.py

-        self.set_necessary_dimensions(variable)
+        unlimited_dims = self.encoding.get('unlimited_dims', set())
+        if len(unlimited_dims) > 0:
+            warnings.warn('h5netcdf does not support unlimited dimensions',


If check_encoding is True, this should raise an error, not just a warning.

Actually, check_encoding is specific to variable encoding. Raising an error would make sense if you set unlimited_dims via an argument in to_netcdf (which is not yet possible).

shoyer · 2016-12-28T17:38:26Z

xarray/backends/netCDF4_.py

+    def get_encoding(self):
+        encoding = {}
+        encoding['unlimited_dims'] = set(
+            [k for k, v in self.ds.dimensions.items() if v.isunlimited()])


you can use a set comprehension here, e.g.,

encoding['unlimited_dims'] = {k for k, v in self.ds.dimensions.items() if v.isunlimited()}

shoyer · 2016-12-28T17:40:34Z

xarray/core/dataset.py

@@ -306,7 +306,7 @@ class Dataset(Mapping, ImplementsDatasetReduce, BaseDataObject,
    groupby_cls = groupby.DatasetGroupBy

    def __init__(self, data_vars=None, coords=None, attrs=None,
-                 compat='broadcast_equals'):
+                 compat='broadcast_equals', encoding=None):


Needs to be documented in the docstring.

Or we could even remove it entirely from the constructor and require setting the property.

I think I'd like to keep it here. Often times, I end up batch constructing objects with something like xr.Dataset(..., **kwargs) where kwargs = {..., 'encoding': {'unlimited_dims': ['time']}}.

Also, since we include encoding in the DataArray constructor, it seems appropriate to have it here as well.

I think consistency makes for a good argument here, though I would consider deprecating the encoding argument to DataArray instead. It would also make sense to get rid of the compat argument to Dataset.

These extra arguments are not part of the fundamental xarray data model and thus are a little distracting, especially to new users.

I'm going to open a separate issue to discuss this. See #1188.

I've removed the encoding keyword arg for now.

shoyer · 2016-12-28T17:40:52Z

xarray/core/dataset.py

+        """Dictionary of global encoding attributes on this dataset
+        """
+        if self._encoding is None:
+            self._encoding = dict()


could use self._encoding = {}

jhamman · 2016-12-28T23:44:13Z

@shoyer - comments addressed and green.

shoyer

Generally looks fine, but I am concerned about stateful storage of data on directly on DataStore objects (see below)

shoyer · 2017-01-06T04:06:05Z

xarray/backends/pynio_.py

+    def get_encoding(self):
+        encoding = {}
+        encoding['unlimited_dims'] = set(
+            [k for k in self.ds.dimensions if self.ds.unlimited(k)])


I don't think dap can represent unlimited dimensions:
http://docs.opendap.org/index.php/DAP4:_Specification_Volume_1#Dimensions

Agreed, but this is pynio which does: https://www.pyngl.ucar.edu/whatsnew.shtml#Version1.4.1

shoyer · 2017-01-06T04:07:05Z

xarray/backends/scipy_.py

+    def get_encoding(self):
+        encoding = {}
+        encoding['unlimited_dims'] = set(
+            [k for k, v in self.ds.dimensions.items() if v is None])


can use the same set comprehension you switched to in netCDF4_.py

shoyer · 2017-01-06T04:07:33Z

xarray/backends/scipy_.py

    def get_dimensions(self):
+        self._unlimited_dimensions = self._get_unlimited_dimensions()


you don't use this currently

shoyer · 2017-01-06T04:09:27Z

xarray/core/dataset.py

@@ -914,12 +929,18 @@ def to_netcdf(self, path=None, mode='w', format=None, group=None,
            Nested dictionary with variable names as keys and dictionaries of
            variable specific encodings as values, e.g.,
            ``{'my_variable': {'dtype': 'int16', 'scale_factor': 0.1, 'zlib': True}, ...}``
+        unlimited_dims : str or sequence of str, optional


I think it needs to be a sequence of str, not a str.

shoyer · 2017-01-06T04:12:09Z

xarray/backends/netCDF4_.py

@@ -251,6 +251,12 @@ def get_dimensions(self):
        return FrozenOrderedDict((k, len(v))
                                 for k, v in iteritems(self.ds.dimensions))

+    def get_encoding(self):


I would lean slightly toward just creating a get_unlimited_dims method rather than get_encoding, unless we can think of other Dataset wide encodings we might possibly add in the future.

The other encoding value that comes to mind is the dataset format (e.g. NETCDF4 vs. NETCDF3). Maybe there are others as well but nothing is mind.

shoyer · 2017-01-06T04:16:44Z

xarray/backends/api.py

@@ -565,8 +565,11 @@ def to_netcdf(dataset, path=None, mode='w', format=None, group=None,
    sync = writer is None

    store = store_cls(path, mode, format, group, writer)
+    # Copy dataset encoding to datastore
+    store.encoding = dataset.encoding


Do we ever actually use this encoding state on the datastore? If not, let's not bother setting it. I think everything necessary ends up being passed on via set_variables.

Note that as bunch as possible, I've tried to make DataStore itself stateless, only storing state in the file-like object it points to.

We were using this but I've refactored to avoid it.

shoyer · 2017-01-06T04:20:02Z

xarray/backends/common.py

@@ -96,8 +98,9 @@ def load(self):
        This function will be called anytime variables or attributes
        are requested, so care should be taken to make sure its fast.
        """
-        variables = FrozenOrderedDict((_decode_variable_name(k), v)
-                                      for k, v in iteritems(self.get_variables()))
+        self.encoding = self.get_encoding()


This is a little dangerous -- .load() needs to be called in order to guarantee a consistent encoding state on a DataStore. I would rather we didn't set such state, and simply pulled this information out of the file linked to the DataStore as necessary.

Fair point, I've removed the encoding attribute on the DataStore.

…ted_ncdims

@shoyer

…tore statespace, respond to a few of @shoyer's comments

jhamman · 2017-01-21T18:47:42Z

@shoyer - all comments addressed and tests are passing.

…ted_ncdims

shoyer · 2017-01-23T05:16:35Z

xarray/backends/pydap_.py

@@ -62,6 +62,7 @@ class PydapDataStore(AbstractDataStore):
    def __init__(self, url):
        import pydap.client
        self.ds = pydap.client.open_url(url)
+        self.encoding = {}


shoyer · 2017-01-23T05:16:44Z

xarray/backends/pynio_.py

@@ -42,6 +42,7 @@ def __init__(self, filename, mode='r'):
        self.ds = opener()
        self._opener = opener
        self._mode = mode
+        self.encoding = {}


shoyer · 2017-01-23T05:17:11Z

xarray/backends/scipy_.py

@@ -102,6 +102,7 @@ def __init__(self, filename_or_obj, mode='r', format=None, group=None,
        self.ds = opener()
        self._opener = opener
        self._mode = mode
+        self.encoding = {}


shoyer · 2017-01-23T05:21:47Z

xarray/backends/memory.py

@@ -21,6 +21,7 @@ class InMemoryDataStore(AbstractWritableDataStore):
    def __init__(self, variables=None, attributes=None, writer=None):
        self._variables = OrderedDict() if variables is None else variables
        self._attributes = OrderedDict() if attributes is None else attributes
+        self.encoding = {}


shoyer · 2017-01-23T05:22:27Z

xarray/backends/scipy_.py

@@ -116,9 +117,18 @@ def get_variables(self):
    def get_attrs(self):
        return Frozen(_decode_attrs(self.ds._attributes))

+    def _get_unlimited_dimensions(self):


I don't think you use this method anymore

shoyer · 2017-01-23T21:32:16Z

xarray/backends/h5netcdf_.py

@@ -58,6 +59,7 @@ def __init__(self, filename, mode='r', format=None, group=None,
        self._opener = opener
        self._filename = filename
        self._mode = mode
+        self.encoding = {}


This still should go away :)

shoyer · 2017-01-23T21:34:09Z

xarray/test/test_backends.py

+        ds = Dataset({'x': ('y', np.arange(10.0))})
+        ds.encoding = {'unlimited_dims': ['y']}
+        with pytest.warns(UserWarning):
+            ds.to_netcdf('foo-bar.nc', engine='h5netcdf')


use the create_tmp_file() context manager to ensure this file gets cleaned up

…ted_ncdims

jhamman · 2017-01-24T06:43:33Z

In we go. Thanks @shoyer for putting up with my short attention span on this one.

initial hack at enabling unlimited dims in to_netcdf

271a751

jhamman commented Dec 17, 2016

View reviewed changes

shoyer reviewed Dec 17, 2016

View reviewed changes

Joe Hamman added 2 commits December 20, 2016 11:37

unlimited dims for netcdf4, still working on scipy

bedad43

fix two bugs in h5netcdf tests

c797511

Merge branch 'master' of github.com:pydata/xarray into feature/unlimi…

affac00

…ted_ncdims

jhamman mentioned this pull request Dec 23, 2016

Things to complete before releasing xarray v0.9.0 #1167

Closed

4 tasks

jhamman mentioned this pull request Dec 23, 2016

BUG: scipy.io.netcdf does not handle ellipsis for unlimited dimensions scipy/scipy#6880

Open

Joe Hamman added 3 commits December 23, 2016 10:37

Merge branch 'master' of github.com:pydata/xarray into feature/unlimi…

ca60729

…ted_ncdims

Merge branch 'master' of github.com:pydata/xarray into feature/unlimi…

3d24610

…ted_ncdims

fix failing tests, try workaround for scipy/scipy#6880

e794165

jhamman commented Dec 27, 2016

View reviewed changes

Joe Hamman added 2 commits December 27, 2016 10:55

cleanup

2ba6688

simple slice in scipy workaround

b7bd0b8

jhamman mentioned this pull request Dec 28, 2016

Creating unlimited dimensions with xarray.Dataset.to_netcdf #992

Closed

shoyer reviewed Dec 28, 2016

View reviewed changes

jhamman changed the title ~~WIP: Unlimited dimensions from to_netcdf~~ Dataset.encoding and unlimited dimensions for to_netcdf Dec 28, 2016

initial fixes after @shoyer's review

fdbd55d

jhamman mentioned this pull request Dec 28, 2016

Should we deprecate the compat and encoding constructor arguments? #1188

Closed

Joe Hamman added 3 commits December 28, 2016 14:19

fix failing test by passing unlimited_dims through to in memory store

47442e6

remove encoding from dataset constructor

2df224c

more tests for unlimited_dims and update whats-new

eead3e4

shoyer reviewed Jan 6, 2017

View reviewed changes

Joe Hamman added 3 commits January 18, 2017 22:50

Merge branch 'master' of github.com:pydata/xarray into feature/unlimi…

fac2f89

…ted_ncdims

refactor unlimited dimensions / dataset encoding to avoid using DataS…

33dd062

…tore statespace, respond to a few of @shoyer's comments

raise user warning if unlimited dims is used with h5netcdf

65df346

Merge branch 'master' of github.com:pydata/xarray into feature/unlimi…

b076c15

…ted_ncdims

shoyer reviewed Jan 23, 2017

View reviewed changes

cleanup backends after unlimited_dims changes

db964a1

shoyer approved these changes Jan 23, 2017

View reviewed changes

Merge branch 'master' of github.com:pydata/xarray into feature/unlimi…

cb22ba1

…ted_ncdims

jhamman merged commit 6d5ad44 into pydata:master Jan 24, 2017

jhamman deleted the feature/unlimited_ncdims branch January 24, 2017 06:38

gerritholl mentioned this pull request Jan 8, 2018

Save to netCDF with record dimension? #678

Closed

		def get_dimensions(self):
		self._unlimited_dimensions = self._get_unlimited_dimensions()

Uh oh!

Dataset.encoding and unlimited dimensions for to_netcdf #1170

Dataset.encoding and unlimited dimensions for to_netcdf #1170

Uh oh!

Conversation

jhamman commented Dec 17, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jhamman commented Dec 20, 2016

Uh oh!

shoyer commented Dec 23, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jhamman commented Dec 27, 2016

Uh oh!

shoyer left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jhamman Dec 28, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jhamman commented Dec 28, 2016

Uh oh!

shoyer left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jhamman commented Dec 17, 2016 •

edited

Loading

jhamman Dec 28, 2016 •

edited

Loading