xarray support for categorical data #91

ceholden · 2016-09-06T18:50:29Z

Not sure if this is out of scope for the project, but this PR tries to smooth over the only issue I've came across so far when using Patsy with xarray.Dataset objects (in place of pd.DataFrame).

As a bit of background, xarray extends the idea of the data frame into an arbitrary number of dimensions, is now integrated into pandas, and is being considered as a replacement for the Panel capabilities in pandas. Xarray has two main data types -- DataArray and Dataset -- that are n-dimensional extensions of Panda's Series and DataFrame (see the docs here). I love using Patsy in my work with geospatial data, so such an integration would be great for me and might help down the road if Pandas deprecates Panel in favor of xarray objects.

To my surprise, feeding xarray.Dataset objects containing purely numeric data as the data into Patsy design matrix calls worked without any hangup. However, the integration breaks down when one tries to create a design matrix involving categorical data:

patsy.PatsyError: Error interpreting categorical data: all items must be hashable

For example, this error crops up in categorical.CategoricalSniffer:sniff. It seems the root of the issue is that the __iter__ for an xarray.DataArray returns a single element still wrapped up in an xarray.DataArray, unlike when iterating over a pd.Series or np.ndarray.

I gave a go at a patch by defining have_xarray in patsy/utils and "unwrapping" data from its xarray.DataArray container before trying to perform any of the checks or conversions Patsy does with categorical data. I'm happy to iterate on design or implementation if you think it's worth integrating.

Thanks for your work!

coveralls · 2016-09-06T20:24:20Z

Coverage decreased (-0.3%) to 98.362% when pulling 476647a on ceholden:xarray_categorical into 8b6c712 on pydata:master.

codecov-io · 2016-09-06T20:26:04Z

Current coverage is 99.08% (diff: 100%)

Merging #91 into master will increase coverage by <.01%

@@             master        #91   diff @@
==========================================
  Files            30         30          
  Lines          5557       5591    +34   
  Methods           0          0          
  Messages          0          0          
  Branches        776        783     +7   
==========================================
+ Hits           5506       5540    +34   
  Misses           28         28          
  Partials         23         23

Powered by Codecov. Last update a5b54c2...c7c63a4

njsmith · 2016-10-24T04:41:58Z

This seems reasonable at a quick skim, but because the travis build isn't installing xarray, none of the new code is actually getting tested. (You can also see this in the coverage bots complaining about lots of untested lines.) Could you add xarray to the list of libraries in .travis.yml?

Numeric data was OK as is, but checks/conversions for categorical caused errors, largely because __iter__ on an xarray.DataArray returns a DataArray unlike the behavior of something like pandas.Series

Current dependency is pandas>=0.15

ceholden · 2016-11-07T16:29:13Z

Thanks for the review. I added in xarray to the Travis install list for compatible versions of Pandas as it made sense to me (2nd commit failed on tests with pandas=0.14), but I'm happy to edit to accommodate any preferences you might have.

shoyer · 2016-11-07T16:33:16Z

I added in xarray to the Travis install list for compatible versions of Pandas as it made sense to me (2nd commit failed on tests with pandas=0.14)

xarray requires pandas 0.15 or newer, so indeed that makes sense to me.

shoyer · 2016-11-07T16:35:40Z

Does this only work on 1D xarray.Dataset objects?

It might also be nice to mention xarray compatibility in the patsy docs somewhere (e.g., in the docstring for dmatrix).

ceholden · 2016-11-07T18:29:59Z

Good point about mentioning the support in the docs. I've added some text to the dmatrix docstring and quickstart page as suggested.

Technically it will work for 1- or 2-dimensional data, just like Patsy works with 1- or 2-dimensional NumPy arrays or DataFrames. I say "technically" because I didn't know it worked with 2-dimensional data until just now and I'm not really sure what the use case might be. So, this works for example:

> import numpy as np
> import patsy
> import xarray as xr
> shp = (10, 20)
> ds = xr.Dataset({'a': (['time', 'x'], np.random.rand(*shp))},
                  coords={'time': np.arange(shp[0]),
                          'x': np.arange(shp[1])})
> patsy.dmatrix('1 + a + time', ds)
DesignMatrix with shape (10, 22)
  Columns:
    ['Intercept',
     'a[0]',
     'a[1]',
...
     'a[19]',
     'time']
  Terms:
    'Intercept' (column 0), 'a' (columns 1:21), 'time' (column 21)
  (to view full data, use np.asarray(this_obj))

Passing more dimensions than the 2 dimensions allowed triggers a PatsyError from this function regardless of if you use a NumPy array or xarray object, so I don't think we need special handling for the xarray case.

In case you're interested: my use cases the data is usually only along a single time coordinate with space coordinate squeezed out with a ds.sel(x=x, y=y) sort of call. I use xarray to organize, align, and store time series of remote sensing data, sometimes from different sensors with different pixel resolutions (hooray for reindex_like(other, method='nearest')). I might do some calculations with the xarray object, like derive a "spectral index" that contrasts some of the variables, but the intersection with patsy comes in when trying to statistically model each pixel as an individual. I don't really ever put design matrices back into an xarray object so I wasn't interested in having it supported as a return type.

shoyer · 2016-11-07T18:39:16Z

I've used patsy in the past with xarray, but mostly just by converting into a pandas.DataFrame first (via .to_dataframe). There are then various ways you can pack stuff back into an xarray object.

In principle, it would be nice if patsy had generic support for handling multi-dimensional xarray.Dataset -> xarray.Dataset with new variable names. Could probably make this work (with a few extraneous memory copies) by stacking the dataset with .stack() to make it 1D, passing a dict of 1D numpy arrays into patsy.DesignMatrix, rebuilding an xarray.Dataset with the 1D coordinates and then unstacking back into the original dimensions.

njsmith · 2016-11-07T20:31:47Z

The general rule is that patsy is happy to accept 1d or 2d numerical predictors, but only 1d categorical predictors, mostly because I don't know what use a 2d categorical predictor would be. (Maybe someone will tell me.)

Though I guess for most uses of multi-dimensional xarrays, the multiple dimensions represent something like lat x lon, whereas patsy always works in observations x predictors space which is totally different. If you do give patsy a 2d ndarray, then it interprets it as observations x predictors -- e.g. something like poly(x) returns a matrix where rows correspond to entries in x while columns are the different polynomial orders. So it might make sense to have utilities for flattening xarrays into something like observations x predictors space... I'm not sure why it would be useful to reshape back to xarrays on the output, though? It seems like the next step would be to feed into something like a least squares solver, which also works in observations x predictors space.

njsmith · 2016-11-07T20:34:42Z

Does this code handle all of these cases?

"1 + x" where x is in an xarray and contains numerical values
"1 + x" where x is in an xarray and contains categorical values
"1 + C(x)" where x is in an xarray and contains numerical values (e.g. integers)
"1 + C(x)" where x is in an xarray and contains categorical values

shoyer · 2016-11-07T20:45:06Z

So it might make sense to have utilities for flattening xarrays into something like observations x predictors space... I'm not sure why it would be useful to reshape back to xarrays on the output, though? It seems like the next step would be to feed into something like a least squares solver, which also works in observations x predictors space.

The use case would be for fitting a collection of models. For example, you have a number of raw features (e.g., physical variables) defined on a grid (latitude, longitude, time). You combine these raw features with patsy to build predictors, also on the same grid. Then you run a bunch of independent time series models on (time, feature) for each (latitude, longitude) grid point.

This sort of analysis is pretty common in the geosciences, e.g., to understand climate trends.

Anyways, this is definitely beyond the scope of this PR!

ceholden · 2016-11-07T21:15:53Z

@shoyer has basically described my exact use case, though I think for a different set of science questions and datasets in mind :). The advantage of the change he's described seems to be mostly a better user experience and it'd probably be faster to stack and calculate all in one go instead of running the same design matrix calculation inside of a "select pixel at lon/lat". Something to think about adding via another PR if Patsy is amenable to supporting this use.

@njsmith I've updated the test suite to test categorical and numeric data with and without using C() in the formula. Should be a "yes" to your question.

thequackdaddy mentioned this pull request Sep 16, 2016

patsy questions/wishlist #93

Open

ceholden added 3 commits November 4, 2016 14:41

Support categorical data in xarray dtypes

9b17384

Numeric data was OK as is, but checks/conversions for categorical caused errors, largely because __iter__ on an xarray.DataArray returns a DataArray unlike the behavior of something like pandas.Series

Test xarray on Travis when Pandas is tested

2f4d39b

Test xarray when pandas is new enough

886ac29

Current dependency is pandas>=0.15

Document xarray compat and allowed # of dims

4d05246

Test scenarios of numeric/categorical w/wo C()

c7c63a4

ceholden closed this Feb 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

xarray support for categorical data #91

xarray support for categorical data #91

Uh oh!

ceholden commented Sep 6, 2016

Uh oh!

coveralls commented Sep 6, 2016 •

edited

Loading

Uh oh!

codecov-io commented Sep 6, 2016 •

edited

Loading

Uh oh!

njsmith commented Oct 24, 2016

Uh oh!

ceholden commented Nov 7, 2016

Uh oh!

shoyer commented Nov 7, 2016

Uh oh!

shoyer commented Nov 7, 2016

Uh oh!

ceholden commented Nov 7, 2016

Uh oh!

shoyer commented Nov 7, 2016

Uh oh!

njsmith commented Nov 7, 2016

Uh oh!

njsmith commented Nov 7, 2016

Uh oh!

shoyer commented Nov 7, 2016 •

edited

Loading

Uh oh!

ceholden commented Nov 7, 2016

Uh oh!

Uh oh!

xarray support for categorical data #91

xarray support for categorical data #91

Uh oh!

Conversation

ceholden commented Sep 6, 2016

Uh oh!

coveralls commented Sep 6, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-io commented Sep 6, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Current coverage is 99.08% (diff: 100%)

Uh oh!

njsmith commented Oct 24, 2016

Uh oh!

ceholden commented Nov 7, 2016

Uh oh!

shoyer commented Nov 7, 2016

Uh oh!

shoyer commented Nov 7, 2016

Uh oh!

ceholden commented Nov 7, 2016

Uh oh!

shoyer commented Nov 7, 2016

Uh oh!

njsmith commented Nov 7, 2016

Uh oh!

njsmith commented Nov 7, 2016

Uh oh!

shoyer commented Nov 7, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ceholden commented Nov 7, 2016

Uh oh!

Uh oh!

coveralls commented Sep 6, 2016 •

edited

Loading

codecov-io commented Sep 6, 2016 •

edited

Loading

shoyer commented Nov 7, 2016 •

edited

Loading