Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset.encoding and unlimited dimensions for to_netcdf #1170

Merged
merged 19 commits into from
Jan 24, 2017
Merged

Dataset.encoding and unlimited dimensions for to_netcdf #1170

merged 19 commits into from
Jan 24, 2017

Conversation

jhamman
Copy link
Member

@jhamman jhamman commented Dec 17, 2016

Add Dataset.encoding attribute and support unlimited dimensions for scipy/netcdf4 backends.

closes #992

@encoding.setter
def encoding(self, value):
self._encoding = OrderedDict(value)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm getting an error here:

In [3]: import xarray as xr
In [4]: ds = xr.open_dataset('simple.nc')

.../xarray/xarray/core/dataset.py in encoding(self)
    291         """Dictionary of global encoding attributes on this dataset
    292         """
--> 293         if self._encoding is None:
    294             self._encoding = OrderedDict()
    295         return self._encoding

.../xarray/xarray/core/common.py in __getattr__(self, name)
    219             # this avoids an infinite loop when pickle looks for the
    220             # __setstate__ attribute before the xarray object is initialized
--> 221             for source in self._attr_sources:
    222                 with suppress(KeyError):
    223                     return source[name]

.../xarray/xarray/core/dataset.py in _attr_sources(self)
    536     def _attr_sources(self):
    537         """List of places to look-up items for attribute-style access"""
--> 538         return [self, LevelCoordinatesSource(self), self.attrs, self.encoding]
    539 
    540     def __contains__(self, key):

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need this getter & setter? Why not just have self.encoding = OrderedDict(encoding) in the __init__? Because you need to coerce to Ordered?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might help to switch if name != '__setstate__' in AttrAccessMixin.__getattr__ to if self._initialized.

"""Dictionary of global encoding attributes on this dataset
"""
if self._encoding is None:
self._encoding = OrderedDict()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just make it a normal dict, like Variable.encoding, unless you have a reason why you feel preserving order would be helpful.

@@ -519,7 +535,7 @@ def __deepcopy__(self, memo=None):
@property
def _attr_sources(self):
"""List of places to look-up items for attribute-style access"""
return [self, LevelCoordinatesSource(self), self.attrs]
return [self, LevelCoordinatesSource(self), self.attrs, self.encoding]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is a good idea. This lets you write things like dataset.unlimited_dims to pull out unlimited dimensions from encoding, but dataset.encoding['unlimited_dims'] is more explicit and this runs the risk of confusing encoding/attrs.

@encoding.setter
def encoding(self, value):
self._encoding = OrderedDict(value)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might help to switch if name != '__setstate__' in AttrAccessMixin.__getattr__ to if self._initialized.

@@ -209,6 +217,8 @@ def __init__(self, filename, mode='r', format='NETCDF4', group=None,
self._opener = opener
self._filename = filename
self._mode = 'a' if mode == 'w' else mode
self._unlimited_dimensions = set()
self.encoding = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need an encoding property for datastores. Shouldn't _unlimited_dimensions be enough?

@jhamman
Copy link
Member Author

jhamman commented Dec 20, 2016

@shoyer - I'm making progress here, although I still need to cleanup/add tests. Do you have an idea what might be causing this test failure though? I'm sure it's something I broke but I'm having trouble connecting the dots...

@shoyer
Copy link
Member

shoyer commented Dec 23, 2016

So that is definitely a bug with scipy not handling ... for assignment with unlimited dimensions. We could make that work by making a proxy object that implements __setitem__ for Ellipsis by using a slice object or assignValue for scalars, but it would be nice to fix it upstream in scipy.

target[...] = source
except TypeError:
# workaround for GH: scipy/scipy#6880
target[slice(None, None, None)] = source
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shoyer - I raised the Ellipsis issue with the scipy folks: scipy/scipy#6880. This seems to work although maybe it isn't the most robust solution, I don't know.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this seems reasonable to me. I would just write target[:] = source, though -- no need to manually make the slice() object. This should work because variables with unlimited dimensions will never be scalars.

@jhamman
Copy link
Member Author

jhamman commented Dec 27, 2016

this is ready for a full review.

Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a user facing API perspective, it could be nice to also add unlimited_dims as a keyword argument to Dataset.to_netcdf

self.set_necessary_dimensions(variable)
unlimited_dims = self.encoding.get('unlimited_dims', set())
if len(unlimited_dims) > 0:
warnings.warn('h5netcdf does not support unlimited dimensions',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If check_encoding is True, this should raise an error, not just a warning.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, check_encoding is specific to variable encoding. Raising an error would make sense if you set unlimited_dims via an argument in to_netcdf (which is not yet possible).

def get_encoding(self):
encoding = {}
encoding['unlimited_dims'] = set(
[k for k, v in self.ds.dimensions.items() if v.isunlimited()])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use a set comprehension here, e.g.,

encoding['unlimited_dims'] = {k for k, v in self.ds.dimensions.items()
                              if v.isunlimited()}

@@ -306,7 +306,7 @@ class Dataset(Mapping, ImplementsDatasetReduce, BaseDataObject,
groupby_cls = groupby.DatasetGroupBy

def __init__(self, data_vars=None, coords=None, attrs=None,
compat='broadcast_equals'):
compat='broadcast_equals', encoding=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs to be documented in the docstring.

Or we could even remove it entirely from the constructor and require setting the property.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd like to keep it here. Often times, I end up batch constructing objects with something like xr.Dataset(..., **kwargs) where kwargs = {..., 'encoding': {'unlimited_dims': ['time']}}.

Also, since we include encoding in the DataArray constructor, it seems appropriate to have it here as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think consistency makes for a good argument here, though I would consider deprecating the encoding argument to DataArray instead. It would also make sense to get rid of the compat argument to Dataset.

These extra arguments are not part of the fundamental xarray data model and thus are a little distracting, especially to new users.

Copy link
Member Author

@jhamman jhamman Dec 28, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to open a separate issue to discuss this. See #1188.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed the encoding keyword arg for now.

"""Dictionary of global encoding attributes on this dataset
"""
if self._encoding is None:
self._encoding = dict()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could use self._encoding = {}

@jhamman jhamman changed the title WIP: Unlimited dimensions from to_netcdf Dataset.encoding and unlimited dimensions for to_netcdf Dec 28, 2016
@jhamman
Copy link
Member Author

jhamman commented Dec 28, 2016

@shoyer - comments addressed and green.

Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks fine, but I am concerned about stateful storage of data on directly on DataStore objects (see below)

def get_encoding(self):
encoding = {}
encoding['unlimited_dims'] = set(
[k for k in self.ds.dimensions if self.ds.unlimited(k)])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think dap can represent unlimited dimensions:
http://docs.opendap.org/index.php/DAP4:_Specification_Volume_1#Dimensions

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, but this is pynio which does: https://www.pyngl.ucar.edu/whatsnew.shtml#Version1.4.1

def get_encoding(self):
encoding = {}
encoding['unlimited_dims'] = set(
[k for k, v in self.ds.dimensions.items() if v is None])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can use the same set comprehension you switched to in netCDF4_.py

def get_dimensions(self):
self._unlimited_dimensions = self._get_unlimited_dimensions()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't use this currently

@@ -914,12 +929,18 @@ def to_netcdf(self, path=None, mode='w', format=None, group=None,
Nested dictionary with variable names as keys and dictionaries of
variable specific encodings as values, e.g.,
``{'my_variable': {'dtype': 'int16', 'scale_factor': 0.1, 'zlib': True}, ...}``
unlimited_dims : str or sequence of str, optional
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it needs to be a sequence of str, not a str.

@@ -251,6 +251,12 @@ def get_dimensions(self):
return FrozenOrderedDict((k, len(v))
for k, v in iteritems(self.ds.dimensions))

def get_encoding(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would lean slightly toward just creating a get_unlimited_dims method rather than get_encoding, unless we can think of other Dataset wide encodings we might possibly add in the future.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other encoding value that comes to mind is the dataset format (e.g. NETCDF4 vs. NETCDF3). Maybe there are others as well but nothing is mind.

@@ -565,8 +565,11 @@ def to_netcdf(dataset, path=None, mode='w', format=None, group=None,
sync = writer is None

store = store_cls(path, mode, format, group, writer)
# Copy dataset encoding to datastore
store.encoding = dataset.encoding
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we ever actually use this encoding state on the datastore? If not, let's not bother setting it. I think everything necessary ends up being passed on via set_variables.

Note that as bunch as possible, I've tried to make DataStore itself stateless, only storing state in the file-like object it points to.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We were using this but I've refactored to avoid it.

@@ -96,8 +98,9 @@ def load(self):
This function will be called anytime variables or attributes
are requested, so care should be taken to make sure its fast.
"""
variables = FrozenOrderedDict((_decode_variable_name(k), v)
for k, v in iteritems(self.get_variables()))
self.encoding = self.get_encoding()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a little dangerous -- .load() needs to be called in order to guarantee a consistent encoding state on a DataStore. I would rather we didn't set such state, and simply pulled this information out of the file linked to the DataStore as necessary.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point, I've removed the encoding attribute on the DataStore.

@jhamman
Copy link
Member Author

jhamman commented Jan 21, 2017

@shoyer - all comments addressed and tests are passing.

@@ -62,6 +62,7 @@ class PydapDataStore(AbstractDataStore):
def __init__(self, url):
import pydap.client
self.ds = pydap.client.open_url(url)
self.encoding = {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete

@@ -42,6 +42,7 @@ def __init__(self, filename, mode='r'):
self.ds = opener()
self._opener = opener
self._mode = mode
self.encoding = {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete

@@ -102,6 +102,7 @@ def __init__(self, filename_or_obj, mode='r', format=None, group=None,
self.ds = opener()
self._opener = opener
self._mode = mode
self.encoding = {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete

@@ -21,6 +21,7 @@ class InMemoryDataStore(AbstractWritableDataStore):
def __init__(self, variables=None, attributes=None, writer=None):
self._variables = OrderedDict() if variables is None else variables
self._attributes = OrderedDict() if attributes is None else attributes
self.encoding = {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete

@@ -116,9 +117,18 @@ def get_variables(self):
def get_attrs(self):
return Frozen(_decode_attrs(self.ds._attributes))

def _get_unlimited_dimensions(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you use this method anymore

@@ -58,6 +59,7 @@ def __init__(self, filename, mode='r', format=None, group=None,
self._opener = opener
self._filename = filename
self._mode = mode
self.encoding = {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still should go away :)

ds = Dataset({'x': ('y', np.arange(10.0))})
ds.encoding = {'unlimited_dims': ['y']}
with pytest.warns(UserWarning):
ds.to_netcdf('foo-bar.nc', engine='h5netcdf')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use the create_tmp_file() context manager to ensure this file gets cleaned up

@jhamman jhamman merged commit 6d5ad44 into pydata:master Jan 24, 2017
@jhamman jhamman deleted the feature/unlimited_ncdims branch January 24, 2017 06:38
@jhamman
Copy link
Member Author

jhamman commented Jan 24, 2017

In we go. Thanks @shoyer for putting up with my short attention span on this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Creating unlimited dimensions with xarray.Dataset.to_netcdf
3 participants