Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better support for subclasses: tests, docs and API #1097

Open
shoyer opened this issue Nov 8, 2016 · 14 comments
Open

Better support for subclasses: tests, docs and API #1097

shoyer opened this issue Nov 8, 2016 · 14 comments

Comments

@shoyer
Copy link
Member

shoyer commented Nov 8, 2016

Given that people do currently subclass xarray objects, it's worth considering making a subclass API like pandas:
http://pandas.pydata.org/pandas-docs/stable/internals.html#subclassing-pandas-data-structures

At the very least, it would be nice to have docs that describe how/when it's safe to subclass, and tests that verify our support for such subclasses.

@lxkain
Copy link

lxkain commented Nov 6, 2017

Agreed. Just purely for information, I made a very simple subclass, but then a simple print statement didn't work:

Traceback (most recent call last):
File "", line 1, in
File "/Users/kain/CloudStation/CSLU/sci/timeview/miniconda/envs/timeview/lib/python3.6/site-packages/xarray/core/common.py", line 97, in repr
return formatting.array_repr(self)
File "/Users/kain/CloudStation/CSLU/sci/timeview/miniconda/envs/timeview/lib/python3.6/site-packages/xarray/core/formatting.py", line 392, in array_repr
summary.append(repr(arr.coords))
File "/Users/kain/CloudStation/CSLU/sci/timeview/miniconda/envs/timeview/lib/python3.6/site-packages/xarray/core/formatting.py", line 63, in repr
return ensure_valid_repr(self.unicode())
File "/Users/kain/CloudStation/CSLU/sci/timeview/miniconda/envs/timeview/lib/python3.6/site-packages/xarray/core/coordinates.py", line 46, in unicode
return formatting.coords_repr(self)
File "/Users/kain/CloudStation/CSLU/sci/timeview/miniconda/envs/timeview/lib/python3.6/site-packages/xarray/core/formatting.py", line 319, in coords_repr
col_width = _calculate_col_width(_get_col_items(coords))
File "/Users/kain/CloudStation/CSLU/sci/timeview/miniconda/envs/timeview/lib/python3.6/site-packages/xarray/core/formatting.py", line 281, in _get_col_items
for k, v in mapping.items():
File "/Users/kain/CloudStation/CSLU/sci/timeview/miniconda/envs/timeview/lib/python3.6/_collections_abc.py", line 744, in iter
yield (key, self._mapping[key])
File "/Users/kain/CloudStation/CSLU/sci/timeview/miniconda/envs/timeview/lib/python3.6/site-packages/xarray/core/coordinates.py", line 191, in getitem
return self._data._getitem_coord(key)
File "/Users/kain/CloudStation/CSLU/sci/timeview/miniconda/envs/timeview/lib/python3.6/site-packages/xarray/core/dataarray.py", line 465, in _getitem_coord
return self._replace_maybe_drop_dims(var, name=key)
File "/Users/kain/CloudStation/CSLU/sci/timeview/miniconda/envs/timeview/lib/python3.6/site-packages/xarray/core/dataarray.py", line 257, in _replace_maybe_drop_dims
return self._replace(variable, coords, name)
File "/Users/kain/CloudStation/CSLU/sci/timeview/miniconda/envs/timeview/lib/python3.6/site-packages/xarray/core/dataarray.py", line 248, in _replace
return type(self)(variable, coords, name=name, fastpath=True)
TypeError: init() got an unexpected keyword argument 'fastpath'

The last statement basically called my subclass, instead of DataArray, but my subclass didn't work well with that because it restricted data to be two-dimensional, but the library code required a dimension of 1 at this point.

@fmaussion
Copy link
Member

Do we have examples of such subclasses? It would be interesting to document the cases where the accessors are not good enough.

@lxkain
Copy link

lxkain commented Nov 6, 2017

Initially, I have tried this:

class Signal(Track):  # 1D wave (no coords) and 1D time-value combined (with coords)
    def __init__(self,
                 data: np.ndarray,
                 coords=None,
                 dims=None,
                 name: str=None,
                 attrs: dict=None):
        assert data.ndim == 2
        assert data.shape[1] == 1 
        if dims is None:
            dims = ('time', 'amplitude')
        if coords is not None:
            assert 'time' in coords
        super().__init__(data, coords=coords, dims=dims, name=name, attrs=attrs)

However, I just know discovered the accessors and will have a look.

@arokem
Copy link

arokem commented Aug 21, 2019

Just to add a puzzling attempt to subclass DataArray (provided by @mbeyeler) :

xarray

Are we doing it wrong? Is there documentation that shows how to do something this dumb "the right way"? Would be great to have that (which is essentially a 👍 for this issue).

@max-sixty
Copy link
Collaborator

I think we should be able improve subclass support. (We use it internally at times, mainly for encapsulating logic on specific objects we inherit from Dataset).

@arokem if you're keen to investigate further, would be interesting to know what's happening there. It's possible it's an issue with DataArray's repr rather than the subclass itself

@arokem
Copy link

arokem commented Aug 22, 2019

Yeah: I'd be happy to dig around a bit. Where should I look?

Also, @mbeyeler might be able to share a bit more about sub-classing woes. I think that he made a commendable

@max-sixty
Copy link
Collaborator

For that case, you could put a breakpoint in and see what's calling it. It is bemusing

For subclass support, you could see whether there are methods that return DataArray / Dataset rather than self.__class__.
Maybe we could add a generalized test that could run over a number of methods and assert they return a subclass

I think we're in an equilibrium where subclassing isn't supported for most operations, so it's not used, so we don't hear about it's failures. A moderate push could move us out of that equilibrium!

@crusaderky
Copy link
Contributor

The biggest problem is with all the Dataset methods and accessors that return a DataArray, and vice versa. Anybody who wants to create a coupled pair of Dataset and DataArray subclasses will need to hunt down all methods and accessors that return the other class in the pair and override them.

May I ask what are the practical use cases for subclassing? In several years worth of day-to-day use of xarray I always found that encapsulation felt much more natural.

@crusaderky
Copy link
Contributor

crusaderky commented Aug 22, 2019

There's also a funny, pickle-friendly hack that allows you to add methods to a Dataset without subclassing it - thanks to the Dataset.__getattr__ magic:

def custom(self: Dataset, ...):
    ...

ds = Dataset(...)
ds.attrs['custom'] = partial(custom, ds)
ds.custom(...)

@crusaderky
Copy link
Contributor

There's also the argument that I would love, at some point, to migrate the xarray objects to use __slots__. From a quick benchmark on my mac, that speeds up every single attribute access by 7ns. I suspect that would very quickly add up. However, any subclass that defines extra attributes (as opposed to just methods) would break on the transition between __dict__ an __slots__.

@crusaderky
Copy link
Contributor

Nevermind __slots__. I just tried and there is no noticeable speedup.
Code at crusaderky@26e0477

@max-sixty
Copy link
Collaborator

max-sixty commented Aug 22, 2019

Nevermind __slots__. I just tried and there is no noticeable speedup.

I had thought the primary saving was memory (and fairly significant with lots of objects)

@crusaderky
Copy link
Contributor

crusaderky commented Aug 22, 2019

Nevermind __slots__. I just tried and there is no noticeable speedup.

I had thought the primary saving was memory (and fairly significant with lots of objects)

It is one of the two savings - the other theoretically being attribute access time. I think I'll go on with a pull request for further discussion.

@max-sixty
Copy link
Collaborator

The biggest problem is with all the Dataset methods and accessors that return a DataArray, and vice versa. Anybody who wants to create a coupled pair of Dataset and DataArray subclasses will need to hunt down all methods and accessors that return the other class in the pair and override them.

This is correct - functions which convert between DataArray & Dataset wouldn't retain type. That said, this can still be helpful without the ability to change type.

There's a bigger piece of work which would solve this too, at the cost of abstraction: have class attributes which define _array_type=DataArray on Dataset and similar for DataArray. pandas used this

May I ask what are the practical use cases for subclassing? In several years worth of day-to-day use of xarray I always found that encapsulation felt much more natural.

Right, good question and we should catch ourselves from adding every abstraction. I have one specific use-case we've already found helpful: we have an object that is mostly a Dataset, with the addition of some behaviors and constructors - for example from_[dataset_x_and_y] or assert_invariants. Alternatives:

  • A class which had the dataset as an attribute - but this means 90%+ of object use is x.data, and the object can't be passed into methods expecting a dataset.
    • Prior to using xarray, we frequently used this pattern to aggregate lots of pandas dataframes (and I've seen this in the wild too)
  • A normal dataset with some associated functions in a module - but makes discovering the functions harder
  • The current state - an object inherited from Dataset with methods

We don't use accessors because the behaviors are specific to a class, rather than every xarray object.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants