-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataset.encoding and unlimited dimensions for to_netcdf #1170
Conversation
@encoding.setter | ||
def encoding(self, value): | ||
self._encoding = OrderedDict(value) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm getting an error here:
In [3]: import xarray as xr
In [4]: ds = xr.open_dataset('simple.nc')
.../xarray/xarray/core/dataset.py in encoding(self)
291 """Dictionary of global encoding attributes on this dataset
292 """
--> 293 if self._encoding is None:
294 self._encoding = OrderedDict()
295 return self._encoding
.../xarray/xarray/core/common.py in __getattr__(self, name)
219 # this avoids an infinite loop when pickle looks for the
220 # __setstate__ attribute before the xarray object is initialized
--> 221 for source in self._attr_sources:
222 with suppress(KeyError):
223 return source[name]
.../xarray/xarray/core/dataset.py in _attr_sources(self)
536 def _attr_sources(self):
537 """List of places to look-up items for attribute-style access"""
--> 538 return [self, LevelCoordinatesSource(self), self.attrs, self.encoding]
539
540 def __contains__(self, key):
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need this getter & setter? Why not just have self.encoding = OrderedDict(encoding)
in the __init__
? Because you need to coerce to Ordered?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might help to switch if name != '__setstate__'
in AttrAccessMixin.__getattr__
to if self._initialized
.
"""Dictionary of global encoding attributes on this dataset | ||
""" | ||
if self._encoding is None: | ||
self._encoding = OrderedDict() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would just make it a normal dict, like Variable.encoding
, unless you have a reason why you feel preserving order would be helpful.
@@ -519,7 +535,7 @@ def __deepcopy__(self, memo=None): | |||
@property | |||
def _attr_sources(self): | |||
"""List of places to look-up items for attribute-style access""" | |||
return [self, LevelCoordinatesSource(self), self.attrs] | |||
return [self, LevelCoordinatesSource(self), self.attrs, self.encoding] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is a good idea. This lets you write things like dataset.unlimited_dims
to pull out unlimited dimensions from encoding
, but dataset.encoding['unlimited_dims']
is more explicit and this runs the risk of confusing encoding
/attrs
.
@encoding.setter | ||
def encoding(self, value): | ||
self._encoding = OrderedDict(value) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might help to switch if name != '__setstate__'
in AttrAccessMixin.__getattr__
to if self._initialized
.
@@ -209,6 +217,8 @@ def __init__(self, filename, mode='r', format='NETCDF4', group=None, | |||
self._opener = opener | |||
self._filename = filename | |||
self._mode = 'a' if mode == 'w' else mode | |||
self._unlimited_dimensions = set() | |||
self.encoding = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you need an encoding
property for datastores. Shouldn't _unlimited_dimensions
be enough?
So that is definitely a bug with scipy not handling |
target[...] = source | ||
except TypeError: | ||
# workaround for GH: scipy/scipy#6880 | ||
target[slice(None, None, None)] = source |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shoyer - I raised the Ellipsis
issue with the scipy folks: scipy/scipy#6880. This seems to work although maybe it isn't the most robust solution, I don't know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this seems reasonable to me. I would just write target[:] = source
, though -- no need to manually make the slice()
object. This should work because variables with unlimited dimensions will never be scalars.
this is ready for a full review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From a user facing API perspective, it could be nice to also add unlimited_dims
as a keyword argument to Dataset.to_netcdf
self.set_necessary_dimensions(variable) | ||
unlimited_dims = self.encoding.get('unlimited_dims', set()) | ||
if len(unlimited_dims) > 0: | ||
warnings.warn('h5netcdf does not support unlimited dimensions', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If check_encoding
is True
, this should raise an error, not just a warning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, check_encoding
is specific to variable encoding. Raising an error would make sense if you set unlimited_dims
via an argument in to_netcdf
(which is not yet possible).
def get_encoding(self): | ||
encoding = {} | ||
encoding['unlimited_dims'] = set( | ||
[k for k, v in self.ds.dimensions.items() if v.isunlimited()]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can use a set comprehension here, e.g.,
encoding['unlimited_dims'] = {k for k, v in self.ds.dimensions.items()
if v.isunlimited()}
@@ -306,7 +306,7 @@ class Dataset(Mapping, ImplementsDatasetReduce, BaseDataObject, | |||
groupby_cls = groupby.DatasetGroupBy | |||
|
|||
def __init__(self, data_vars=None, coords=None, attrs=None, | |||
compat='broadcast_equals'): | |||
compat='broadcast_equals', encoding=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs to be documented in the docstring.
Or we could even remove it entirely from the constructor and require setting the property.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'd like to keep it here. Often times, I end up batch constructing objects with something like xr.Dataset(..., **kwargs)
where kwargs = {..., 'encoding': {'unlimited_dims': ['time']}}
.
Also, since we include encoding
in the DataArray
constructor, it seems appropriate to have it here as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think consistency makes for a good argument here, though I would consider deprecating the encoding
argument to DataArray
instead. It would also make sense to get rid of the compat
argument to Dataset
.
These extra arguments are not part of the fundamental xarray data model and thus are a little distracting, especially to new users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to open a separate issue to discuss this. See #1188.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've removed the encoding keyword arg for now.
"""Dictionary of global encoding attributes on this dataset | ||
""" | ||
if self._encoding is None: | ||
self._encoding = dict() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could use self._encoding = {}
@shoyer - comments addressed and green. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks fine, but I am concerned about stateful storage of data on directly on DataStore objects (see below)
def get_encoding(self): | ||
encoding = {} | ||
encoding['unlimited_dims'] = set( | ||
[k for k in self.ds.dimensions if self.ds.unlimited(k)]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think dap can represent unlimited dimensions:
http://docs.opendap.org/index.php/DAP4:_Specification_Volume_1#Dimensions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, but this is pynio which does: https://www.pyngl.ucar.edu/whatsnew.shtml#Version1.4.1
def get_encoding(self): | ||
encoding = {} | ||
encoding['unlimited_dims'] = set( | ||
[k for k, v in self.ds.dimensions.items() if v is None]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can use the same set comprehension you switched to in netCDF4_.py
def get_dimensions(self): | ||
self._unlimited_dimensions = self._get_unlimited_dimensions() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't use this currently
@@ -914,12 +929,18 @@ def to_netcdf(self, path=None, mode='w', format=None, group=None, | |||
Nested dictionary with variable names as keys and dictionaries of | |||
variable specific encodings as values, e.g., | |||
``{'my_variable': {'dtype': 'int16', 'scale_factor': 0.1, 'zlib': True}, ...}`` | |||
unlimited_dims : str or sequence of str, optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it needs to be a sequence of str
, not a str
.
@@ -251,6 +251,12 @@ def get_dimensions(self): | |||
return FrozenOrderedDict((k, len(v)) | |||
for k, v in iteritems(self.ds.dimensions)) | |||
|
|||
def get_encoding(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would lean slightly toward just creating a get_unlimited_dims
method rather than get_encoding
, unless we can think of other Dataset wide encodings we might possibly add in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other encoding value that comes to mind is the dataset format (e.g. NETCDF4 vs. NETCDF3). Maybe there are others as well but nothing is mind.
@@ -565,8 +565,11 @@ def to_netcdf(dataset, path=None, mode='w', format=None, group=None, | |||
sync = writer is None | |||
|
|||
store = store_cls(path, mode, format, group, writer) | |||
# Copy dataset encoding to datastore | |||
store.encoding = dataset.encoding |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we ever actually use this encoding
state on the datastore? If not, let's not bother setting it. I think everything necessary ends up being passed on via set_variables
.
Note that as bunch as possible, I've tried to make DataStore
itself stateless, only storing state in the file-like object it points to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We were using this but I've refactored to avoid it.
@@ -96,8 +98,9 @@ def load(self): | |||
This function will be called anytime variables or attributes | |||
are requested, so care should be taken to make sure its fast. | |||
""" | |||
variables = FrozenOrderedDict((_decode_variable_name(k), v) | |||
for k, v in iteritems(self.get_variables())) | |||
self.encoding = self.get_encoding() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a little dangerous -- .load()
needs to be called in order to guarantee a consistent encoding
state on a DataStore. I would rather we didn't set such state, and simply pulled this information out of the file linked to the DataStore as necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point, I've removed the encoding
attribute on the DataStore
.
…tore statespace, respond to a few of @shoyer's comments
@shoyer - all comments addressed and tests are passing. |
@@ -62,6 +62,7 @@ class PydapDataStore(AbstractDataStore): | |||
def __init__(self, url): | |||
import pydap.client | |||
self.ds = pydap.client.open_url(url) | |||
self.encoding = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delete
@@ -42,6 +42,7 @@ def __init__(self, filename, mode='r'): | |||
self.ds = opener() | |||
self._opener = opener | |||
self._mode = mode | |||
self.encoding = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delete
@@ -102,6 +102,7 @@ def __init__(self, filename_or_obj, mode='r', format=None, group=None, | |||
self.ds = opener() | |||
self._opener = opener | |||
self._mode = mode | |||
self.encoding = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delete
@@ -21,6 +21,7 @@ class InMemoryDataStore(AbstractWritableDataStore): | |||
def __init__(self, variables=None, attributes=None, writer=None): | |||
self._variables = OrderedDict() if variables is None else variables | |||
self._attributes = OrderedDict() if attributes is None else attributes | |||
self.encoding = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delete
@@ -116,9 +117,18 @@ def get_variables(self): | |||
def get_attrs(self): | |||
return Frozen(_decode_attrs(self.ds._attributes)) | |||
|
|||
def _get_unlimited_dimensions(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you use this method anymore
@@ -58,6 +59,7 @@ def __init__(self, filename, mode='r', format=None, group=None, | |||
self._opener = opener | |||
self._filename = filename | |||
self._mode = mode | |||
self.encoding = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This still should go away :)
ds = Dataset({'x': ('y', np.arange(10.0))}) | ||
ds.encoding = {'unlimited_dims': ['y']} | ||
with pytest.warns(UserWarning): | ||
ds.to_netcdf('foo-bar.nc', engine='h5netcdf') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use the create_tmp_file()
context manager to ensure this file gets cleaned up
In we go. Thanks @shoyer for putting up with my short attention span on this one. |
Add
Dataset.encoding
attribute and support unlimited dimensions for scipy/netcdf4 backends.closes #992