Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sparse and other duck array issues #3245

Closed
dcherian opened this issue Aug 22, 2019 · 33 comments · Fixed by #5568
Closed

sparse and other duck array issues #3245

dcherian opened this issue Aug 22, 2019 · 33 comments · Fixed by #5568
Labels
topic-arrays related to flexible array support

Comments

@dcherian
Copy link
Contributor

Issues I've run into working with @friedrichknuth at the Pangeo meeting trying to use sparse

  1. .plot() doesn't work. We need to convert to dense before plotting
  2. .values doesn't work. Raises an error about explicitly calling todense()
  3. align doesn't seem to work.
@dcherian dcherian changed the title sparse sparse issues Aug 22, 2019
@shoyer
Copy link
Member

shoyer commented Aug 23, 2019

I think it is intentional for automatic coercion to NumPy arrays to fail. Making plot() and .values work should probably require explicitly writing datarray.to_dense() first.

@dcherian
Copy link
Contributor Author

The better way to phrase this is:

  1. Should we formalize the convention that .values always return a numpy array i.e. it will call todense() implicitly and that .data will always return the underlying container: sparse/dask etc.?
  2. Should we call todense() automatically in plot() the same way we do compute()?

@shoyer
Copy link
Member

shoyer commented Aug 23, 2019

  • Should we formalize the convention that .values always return a numpy array i.e. it will call todense() implicitly and that .data will always return the underlying container: sparse/dask etc.?

I think .values should either return a NumPy array or raise an exception, based upon whether or not the underlying duck array supports coercion with np.array().

We should have a separate API (maybe .to_dense() or .to_numpy_data()?) for explicitly converting into NumPy arrays. This should not be called automatically inside xarray.

Basically, we should leave the decision about whether automatic coercion is safe up to the authors of duck array libraries.

@dcherian
Copy link
Contributor Author

👍 I think to_numpy() would be good.

@shoyer shoyer mentioned this issue Aug 23, 2019
3 tasks
@shoyer
Copy link
Member

shoyer commented Aug 23, 2019

The main downside of to_numpy() is that it isn't obvious whether the result should be a DataArray containing a NumPy array or a raw NumPy array. In particular, pandas uses to_numpy() for converting into a raw NumPy, which is the opposite of our intent here.

@max-sixty
Copy link
Collaborator

I do think the pandas approach is pretty good here - don't break anything relying on .values - instead have .array to get the backing array (in whatever form it might be), and to_numpy() coercing to a raw numpy array

ref https://pandas.pydata.org/pandas-docs/stable/whatsnew/v0.24.0.html#accessing-the-values-in-a-series-or-index

@shoyer
Copy link
Member

shoyer commented Aug 23, 2019 via email

@khaeru
Copy link

khaeru commented Oct 7, 2019

As far as I can tell, the proposal here will require either

s = pd.Series(...)
xr.DataArray.from_series(s).to_series()

or:

xr.DataArray.from_series(s, sparse=True).to_dense().to_series()

For any code that can't guarantee sparse/non-sparse input, the first will fail sometimes, so it will always be necessary to write the latter everywhere, which IMO is unnecessarily verbose.

@fujiisoup
Copy link
Member

Do we arrive at the consensus here for API to change the sparse or numpy array?
xref #3542

To make it sparse array, to_sparse() would be better? How about or as_sparse()?

  • to_sparse() is probably consistent to todense() method
  • as_sparse() sounds similar to sparse's function, e.g. sparse.as_coo

To change the backend back from sparse array, to_dense() would be better?
FYI, sparse uses todense().

I personally like as_sparse or as_numpy (or as_dense?), which sounds similar to as_type, which gives xarray object not dtype itself.

@dcherian
Copy link
Contributor Author

I weakly prefer following the upstream API: as_sparse and todense.

though to_sparse,to_dense and to_numpy_data would be more consistent?

Also isn't the function astype not as_type?

@shoyer
Copy link
Member

shoyer commented Nov 19, 2019

I like as_sparse and as_dense, because it makes clear that the objects are still xarray objects. I agree with @fujiisoup that to_sparse``todense could be confused to return sparse arrays directly.

@dcherian
Copy link
Contributor Author

it makes clear that the objects are still xarray objects

This is a good point. I'm in favour of as_sparse, as_dense.

@shoyer
Copy link
Member

shoyer commented Nov 19, 2019

Also isn't the function astype not as_type?

Yes, but I think that's mostly works because it's so short. In general the rule is to use underscores when it improves clarity. assparse could read in at least two different ways, and one of which is not really appropriate.

@max-sixty
Copy link
Collaborator

assparse could read in at least two different ways, and one of which is not really appropriate.

brilliant

@fujiisoup
Copy link
Member

What is the best way to save sparse array into a disc?

One naive way would be to use stack -> reset_index, but it flattened coordinates and if there is another variable that depends on these coordinates, they will be also flattened and may consume a lot of space.

@TomNicholas TomNicholas added the topic-arrays related to flexible array support label Apr 6, 2020
@jacobtomlinson
Copy link
Contributor

@dcherian pointed me over here after I raised #4234.

I like as_sparse and as_dense, because it makes clear that the objects are still xarray objects. I agree with @fujiisoup that to_sparse``todense could be confused to return sparse arrays directly.

I very much agree with this reasoning. to_ sounds like it is converting the whole thing to another type. as_ sounds like it is remaining the same but something about it (the underlying array type) is changing.

Basically, we should leave the decision about whether automatic coercion is safe up to the authors of duck array libraries.

I also agree with this reasoning. In cupy many things will break because np.asarray(cupy_array) will raise. So something like cupy_da.plot() will raise. But the exception will be ValueError: object __array__ method not producing an array which isn't particularly user friendly in this case.

For cupy I do think that automatic coercion to numpy for .values and plot() is a reasonable thing to do. But I understand that may not be true for all duck type arrays.

One suggestion is to add an accessor to make things explicit. So it would become cupy_da.cupy.plot(). Where this plot function coerces to numpy and then calls the upstream plot method. However this feels less satisfying.

@dcherian
Copy link
Contributor Author

I am unsure about automatic coercion for .values. I think we've trained users to expect numpy arrays from .values so cupy raising an error is a little unfriendly. OTOH it'd be nice to have something that is equivalent to np.asarray(da.data) but maybe we can expect users who want that to just type it out.

But I think we should automatically coerce for .plot using da.as_numpy().data or something similar. We call .compute() on dask arrays anyway so it's consistent with that behaviour.

@jacobtomlinson
Copy link
Contributor

Given the current guidance in #4212 is to aim for cupy to mostly be handled via accessors I wonder if xarray could be extended a little further to allow as_ methods to be registered in the same way as accessors.

Currently with the accessor model a user would need to do something like ds.cupy.as_cupy and ds.cupy.as_numpy. When casting a DataSet or DataArray to use cupy under the hood I expect folks would look for a top level as_cupy method. And once it has been cast it would need a cupy specific as_numpy method to be able to cast back.

Enabling accessors to register as_ methods at the top level would really help neaten this up.

@keewis
Copy link
Collaborator

keewis commented Jul 20, 2020

we currently don't really encourage this way of using accessors (especially using too many of these, see #1080 (comment) and #1080 (comment)), but you can use register_*_accessor to register functions:

@xr.register_dataarray_accessor("as_cupy")
def _(da):
    def as_cupy(self, ...):
        # ...
        return da

    return as_cupy

@jacobtomlinson
Copy link
Contributor

Thanks @keewis. I did consider that but I assumed that kind of thing wouldn't be encouraged. Happy to go down that path, but I wonder if life would be easier if there were an extension API for this purpose.

I was thinking of something along these lines which would help keep things in check.

@xr.register_dataarray_as_method("cupy")
class CupyAsMethod:
    def to_cupy(self):
        """Convert all data arrays to cupy."""

        for var in self.data_vars: 
            self.data_vars[var].data = cp.asarray(self.data_vars[var].data)

        return self

    def to_numpy(self):
        """Convert all data arrays to numpy."""

        for var in self.data_vars: 
            self.data_vars[var].data = self.data_vars[var].data.get()

        return self

There would need to be some logic around dispatching as_numpy to CupyAsMethod.as_numpy, but only if the underlying data is cupy.

Perhaps some is_numpy, is_cupy boolean attributes would be useful?

I could go as you suggest and use the current accessor tooling to test this functionality out. Then upstream it into something more maintainable if it works out ok?

@keewis
Copy link
Collaborator

keewis commented Jul 20, 2020

Sounds good. It might be worth discussing the general strategy for the support of this kind of duck array (sparse, cupy and others) in the dev call on Wednesday.

@dcherian
Copy link
Contributor Author

It seems like general consensus was (please correct me if any of this is wrong):

  1. .values will return np.asarray(obj.data) and respect the underlying library's choices: this will raise an error with cupy and that's OK.
  2. .plot() is a special case and should always coerce to numpy before passing data to matplotlib. plot currently uses .values so we should add a new function that always returns xarray objects with numpy arrays (with cupy as special case) and use that in plot.
  3. for now we are OK with special as_sparse, as_cupy, as_pint methods in xarray as long as all that logic is contained in one place: as_duck_array.py?

@jacobtomlinson
Copy link
Contributor

Thanks for writing that up @dcherian. Sounds good to me!

@shoyer
Copy link
Member

shoyer commented Jul 22, 2020

3. for now we are OK with special as_sparse, as_cupy, as_pint methods in xarray as long as all that logic is contained in one place: as_duck_array.py?

+1 for as_numpy(), too

One question about as_numpy: should it convert pint arrays into NumPy by stripping units? Or should it convert the arrays underlying pint and keep the units? I guess the first would probably make more sense for DataArray.as_numpy(). The second behavior could be achieved with DataArray.pint.as_numpy().

@dcherian dcherian changed the title sparse issues sparse and other duck array issues Jul 22, 2020
@jthielen
Copy link
Contributor

One question about as_numpy: should it convert pint arrays into NumPy by stripping units? Or should it convert the arrays underlying pint and keep the units? I guess the first would probably make more sense for DataArray.as_numpy(). The second behavior could be achieved with DataArray.pint.as_numpy().

This brings up a point I was wondering about as well: how should these as_*() methods relate to multiply-nested arrays? (Not sure if best to discuss that here or new issue.)

@dcherian
Copy link
Contributor Author

IMO a good first pass for as_numpy would be to use np.asarray with a special case for cupy where it would use .get.

@shoyer
Copy link
Member

shoyer commented Jul 23, 2020 via email

@tacaswell
Copy link

Is there any motion on this? This come up in relation to akward array today (see scikit-hep/awkward#749 ).

If xarray goes with the to_numpy spelling as well then all of {pandas, xarray, akwardarray} have the same spelling which is nice from the Matplotlib side of things.

I think the as_ vs to_ distinction the "to_" makes more sense as it gets you out of xarray land and drops you back to raw numpy land.

@dcherian
Copy link
Contributor Author

dcherian commented Feb 17, 2021

My impression was that the as_* methods would return xarray objects. So we could have DataArray methods

def as_numpy(self):
    # needs cupy special handling
    data = self.data
    if isinstance(data, cupy_array_type):
        raise NotImplementedError
    else:
        return self.copy(data=np.array(data))

def to_numpy(self):
    """Coerces to and returns a numpy.ndarray"""
    return self.as_numpy().data

@keewis
Copy link
Collaborator

keewis commented Feb 17, 2021

I agree, this would be useful. Ideally, that would be implemented by deferring to the data, for example by calling the data's to_numpy method (so duck array libraries or duck array integration libraries can customize that), falling back to np.asarray(data).

@TomNicholas
Copy link
Member

TomNicholas commented Jul 2, 2021

I'm trying to implement this (I wanted pint-aware plotting to work) in #5568, but not 100% sure if I'm doing it right.

1) Do we want these as_*, to_* methods to live in Variable or DataArray?

  1. What about multiply-wrapped arrays? At the moment I have essentially
def as_numpy(self) -> T_DataArray:
    """
    Coerces wrapped data into a numpy array, and returns it wrapped in
    a DataArray.
    """
    data = self.data
    if isinstance(data, dask_array_type):
        return self.load()
    elif isinstance(data, cupy_array_type):
        return self.copy(data=data.get())
    elif isinstance(data, pint_array_type):
        return self.copy(data=data.magnitude)
    elif isinstance(data, sparse_array_type):
        return self.copy(data=data.to_dense())
    else:
        return self.copy(data=np.array(data))

but do I actually want multiple passes, like

    data = self.data
    if isinstance(data, dask_array_type):
        data = self.load().data
    if isinstance(data, cupy_array_type):
        data = data.get()
    if isinstance(data, pint_array_type):
        data = data.magnitude
    if isinstance(data, sparse_array_type):
        data = data.to_dense()

    return self.copy(data=np.array(data))
  1. @jacobtomlinson currently .values will extract the values of a cupy array by special-casing it, but based on the above discussion do we actually want to break that in favour of getting them via .to_numpy() instead?

  2. There are _as_sparse and _to_dense methods already, but they are private and only present on Variable. Should I expose them?

@TomNicholas
Copy link
Member

Also I can add a .to_pint() method, but I don't really understand what the purpose would be? To be any use to users it would need to accept a units argument, but then it's basically just a less powerful version of pint_xarray's .quantify() accessor method. Can you see any reason to add it @keewis ?

@keewis
Copy link
Collaborator

keewis commented Jul 3, 2021

right, I would leave these methods to xarray extension libraries like cupy-xarray or pint-xarray. I assume you meant .as_pint because adding .to_pint does not make much sense: following the naming convention from above, .to_* should return the duck array, which would make it an alias of either .data or the duck array's constructor called on .data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-arrays related to flexible array support
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants