Skip to content

xarray support for categorical data #91

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from
Closed

xarray support for categorical data #91

wants to merge 5 commits into from

Conversation

ceholden
Copy link

@ceholden ceholden commented Sep 6, 2016

Not sure if this is out of scope for the project, but this PR tries to smooth over the only issue I've came across so far when using Patsy with xarray.Dataset objects (in place of pd.DataFrame).

As a bit of background, xarray extends the idea of the data frame into an arbitrary number of dimensions, is now integrated into pandas, and is being considered as a replacement for the Panel capabilities in pandas. Xarray has two main data types -- DataArray and Dataset -- that are n-dimensional extensions of Panda's Series and DataFrame (see the docs here). I love using Patsy in my work with geospatial data, so such an integration would be great for me and might help down the road if Pandas deprecates Panel in favor of xarray objects.

To my surprise, feeding xarray.Dataset objects containing purely numeric data as the data into Patsy design matrix calls worked without any hangup. However, the integration breaks down when one tries to create a design matrix involving categorical data:

patsy.PatsyError: Error interpreting categorical data: all items must be hashable

For example, this error crops up in categorical.CategoricalSniffer:sniff. It seems the root of the issue is that the __iter__ for an xarray.DataArray returns a single element still wrapped up in an xarray.DataArray, unlike when iterating over a pd.Series or np.ndarray.

I gave a go at a patch by defining have_xarray in patsy/utils and "unwrapping" data from its xarray.DataArray container before trying to perform any of the checks or conversions Patsy does with categorical data. I'm happy to iterate on design or implementation if you think it's worth integrating.

Thanks for your work!

@coveralls
Copy link

coveralls commented Sep 6, 2016

Coverage Status

Coverage decreased (-0.3%) to 98.362% when pulling 476647a on ceholden:xarray_categorical into 8b6c712 on pydata:master.

@codecov-io
Copy link

codecov-io commented Sep 6, 2016

Current coverage is 99.08% (diff: 100%)

Merging #91 into master will increase coverage by <.01%

@@             master        #91   diff @@
==========================================
  Files            30         30          
  Lines          5557       5591    +34   
  Methods           0          0          
  Messages          0          0          
  Branches        776        783     +7   
==========================================
+ Hits           5506       5540    +34   
  Misses           28         28          
  Partials         23         23          

Powered by Codecov. Last update a5b54c2...c7c63a4

@njsmith
Copy link
Member

njsmith commented Oct 24, 2016

This seems reasonable at a quick skim, but because the travis build isn't installing xarray, none of the new code is actually getting tested. (You can also see this in the coverage bots complaining about lots of untested lines.) Could you add xarray to the list of libraries in .travis.yml?

Numeric data was OK as is, but checks/conversions for
categorical caused errors, largely because __iter__
on an xarray.DataArray returns a DataArray unlike
the behavior of something like pandas.Series
Current dependency is pandas>=0.15
@ceholden
Copy link
Author

ceholden commented Nov 7, 2016

Thanks for the review. I added in xarray to the Travis install list for compatible versions of Pandas as it made sense to me (2nd commit failed on tests with pandas=0.14), but I'm happy to edit to accommodate any preferences you might have.

@shoyer
Copy link
Member

shoyer commented Nov 7, 2016

I added in xarray to the Travis install list for compatible versions of Pandas as it made sense to me (2nd commit failed on tests with pandas=0.14)

xarray requires pandas 0.15 or newer, so indeed that makes sense to me.

@shoyer
Copy link
Member

shoyer commented Nov 7, 2016

Does this only work on 1D xarray.Dataset objects?

It might also be nice to mention xarray compatibility in the patsy docs somewhere (e.g., in the docstring for dmatrix).

@ceholden
Copy link
Author

ceholden commented Nov 7, 2016

Good point about mentioning the support in the docs. I've added some text to the dmatrix docstring and quickstart page as suggested.

Technically it will work for 1- or 2-dimensional data, just like Patsy works with 1- or 2-dimensional NumPy arrays or DataFrames. I say "technically" because I didn't know it worked with 2-dimensional data until just now and I'm not really sure what the use case might be. So, this works for example:

> import numpy as np
> import patsy
> import xarray as xr
> shp = (10, 20)
> ds = xr.Dataset({'a': (['time', 'x'], np.random.rand(*shp))},
                  coords={'time': np.arange(shp[0]),
                          'x': np.arange(shp[1])})
> patsy.dmatrix('1 + a + time', ds)
DesignMatrix with shape (10, 22)
  Columns:
    ['Intercept',
     'a[0]',
     'a[1]',
...
     'a[19]',
     'time']
  Terms:
    'Intercept' (column 0), 'a' (columns 1:21), 'time' (column 21)
  (to view full data, use np.asarray(this_obj))

Passing more dimensions than the 2 dimensions allowed triggers a PatsyError from this function regardless of if you use a NumPy array or xarray object, so I don't think we need special handling for the xarray case.

In case you're interested: my use cases the data is usually only along a single time coordinate with space coordinate squeezed out with a ds.sel(x=x, y=y) sort of call. I use xarray to organize, align, and store time series of remote sensing data, sometimes from different sensors with different pixel resolutions (hooray for reindex_like(other, method='nearest')). I might do some calculations with the xarray object, like derive a "spectral index" that contrasts some of the variables, but the intersection with patsy comes in when trying to statistically model each pixel as an individual. I don't really ever put design matrices back into an xarray object so I wasn't interested in having it supported as a return type.

@shoyer
Copy link
Member

shoyer commented Nov 7, 2016

I've used patsy in the past with xarray, but mostly just by converting into a pandas.DataFrame first (via .to_dataframe). There are then various ways you can pack stuff back into an xarray object.

In principle, it would be nice if patsy had generic support for handling multi-dimensional xarray.Dataset -> xarray.Dataset with new variable names. Could probably make this work (with a few extraneous memory copies) by stacking the dataset with .stack() to make it 1D, passing a dict of 1D numpy arrays into patsy.DesignMatrix, rebuilding an xarray.Dataset with the 1D coordinates and then unstacking back into the original dimensions.

@njsmith
Copy link
Member

njsmith commented Nov 7, 2016

The general rule is that patsy is happy to accept 1d or 2d numerical predictors, but only 1d categorical predictors, mostly because I don't know what use a 2d categorical predictor would be. (Maybe someone will tell me.)

Though I guess for most uses of multi-dimensional xarrays, the multiple dimensions represent something like lat x lon, whereas patsy always works in observations x predictors space which is totally different. If you do give patsy a 2d ndarray, then it interprets it as observations x predictors -- e.g. something like poly(x) returns a matrix where rows correspond to entries in x while columns are the different polynomial orders. So it might make sense to have utilities for flattening xarrays into something like observations x predictors space... I'm not sure why it would be useful to reshape back to xarrays on the output, though? It seems like the next step would be to feed into something like a least squares solver, which also works in observations x predictors space.

@njsmith
Copy link
Member

njsmith commented Nov 7, 2016

Does this code handle all of these cases?

  • "1 + x" where x is in an xarray and contains numerical values
  • "1 + x" where x is in an xarray and contains categorical values
  • "1 + C(x)" where x is in an xarray and contains numerical values (e.g. integers)
  • "1 + C(x)" where x is in an xarray and contains categorical values

@shoyer
Copy link
Member

shoyer commented Nov 7, 2016

So it might make sense to have utilities for flattening xarrays into something like observations x predictors space... I'm not sure why it would be useful to reshape back to xarrays on the output, though? It seems like the next step would be to feed into something like a least squares solver, which also works in observations x predictors space.

The use case would be for fitting a collection of models. For example, you have a number of raw features (e.g., physical variables) defined on a grid (latitude, longitude, time). You combine these raw features with patsy to build predictors, also on the same grid. Then you run a bunch of independent time series models on (time, feature) for each (latitude, longitude) grid point.

This sort of analysis is pretty common in the geosciences, e.g., to understand climate trends.

Anyways, this is definitely beyond the scope of this PR!

@ceholden
Copy link
Author

ceholden commented Nov 7, 2016

@shoyer has basically described my exact use case, though I think for a different set of science questions and datasets in mind :). The advantage of the change he's described seems to be mostly a better user experience and it'd probably be faster to stack and calculate all in one go instead of running the same design matrix calculation inside of a "select pixel at lon/lat". Something to think about adding via another PR if Patsy is amenable to supporting this use.

@njsmith I've updated the test suite to test categorical and numeric data with and without using C() in the formula. Should be a "yes" to your question.

@ceholden ceholden closed this Feb 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants