-
Notifications
You must be signed in to change notification settings - Fork 103
xarray support for categorical data #91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Current coverage is 99.08% (diff: 100%)@@ master #91 diff @@
==========================================
Files 30 30
Lines 5557 5591 +34
Methods 0 0
Messages 0 0
Branches 776 783 +7
==========================================
+ Hits 5506 5540 +34
Misses 28 28
Partials 23 23
|
This seems reasonable at a quick skim, but because the travis build isn't installing xarray, none of the new code is actually getting tested. (You can also see this in the coverage bots complaining about lots of untested lines.) Could you add xarray to the list of libraries in .travis.yml? |
Numeric data was OK as is, but checks/conversions for categorical caused errors, largely because __iter__ on an xarray.DataArray returns a DataArray unlike the behavior of something like pandas.Series
Current dependency is pandas>=0.15
Thanks for the review. I added in |
xarray requires pandas 0.15 or newer, so indeed that makes sense to me. |
Does this only work on 1D It might also be nice to mention xarray compatibility in the patsy docs somewhere (e.g., in the docstring for |
Good point about mentioning the support in the docs. I've added some text to the Technically it will work for 1- or 2-dimensional data, just like Patsy works with 1- or 2-dimensional NumPy arrays or DataFrames. I say "technically" because I didn't know it worked with 2-dimensional data until just now and I'm not really sure what the use case might be. So, this works for example: > import numpy as np
> import patsy
> import xarray as xr
> shp = (10, 20)
> ds = xr.Dataset({'a': (['time', 'x'], np.random.rand(*shp))},
coords={'time': np.arange(shp[0]),
'x': np.arange(shp[1])})
> patsy.dmatrix('1 + a + time', ds)
DesignMatrix with shape (10, 22)
Columns:
['Intercept',
'a[0]',
'a[1]',
...
'a[19]',
'time']
Terms:
'Intercept' (column 0), 'a' (columns 1:21), 'time' (column 21)
(to view full data, use np.asarray(this_obj)) Passing more dimensions than the 2 dimensions allowed triggers a PatsyError from this function regardless of if you use a NumPy array or In case you're interested: my use cases the data is usually only along a single |
I've used patsy in the past with xarray, but mostly just by converting into a In principle, it would be nice if patsy had generic support for handling multi-dimensional |
The general rule is that patsy is happy to accept 1d or 2d numerical predictors, but only 1d categorical predictors, mostly because I don't know what use a 2d categorical predictor would be. (Maybe someone will tell me.) Though I guess for most uses of multi-dimensional xarrays, the multiple dimensions represent something like lat x lon, whereas patsy always works in observations x predictors space which is totally different. If you do give patsy a 2d ndarray, then it interprets it as observations x predictors -- e.g. something like |
Does this code handle all of these cases?
|
The use case would be for fitting a collection of models. For example, you have a number of raw features (e.g., physical variables) defined on a grid (latitude, longitude, time). You combine these raw features with patsy to build predictors, also on the same grid. Then you run a bunch of independent time series models on (time, feature) for each (latitude, longitude) grid point. This sort of analysis is pretty common in the geosciences, e.g., to understand climate trends. Anyways, this is definitely beyond the scope of this PR! |
@shoyer has basically described my exact use case, though I think for a different set of science questions and datasets in mind :). The advantage of the change he's described seems to be mostly a better user experience and it'd probably be faster to stack and calculate all in one go instead of running the same design matrix calculation inside of a "select pixel at lon/lat". Something to think about adding via another PR if Patsy is amenable to supporting this use. @njsmith I've updated the test suite to test categorical and numeric data with and without using |
Not sure if this is out of scope for the project, but this PR tries to smooth over the only issue I've came across so far when using Patsy with
xarray.Dataset
objects (in place ofpd.DataFrame
).As a bit of background, xarray extends the idea of the data frame into an arbitrary number of dimensions, is now integrated into pandas, and is being considered as a replacement for the Panel capabilities in pandas. Xarray has two main data types --
DataArray
andDataset
-- that are n-dimensional extensions of Panda'sSeries
andDataFrame
(see the docs here). I love using Patsy in my work with geospatial data, so such an integration would be great for me and might help down the road if Pandas deprecates Panel in favor of xarray objects.To my surprise, feeding
xarray.Dataset
objects containing purely numeric data as the data into Patsy design matrix calls worked without any hangup. However, the integration breaks down when one tries to create a design matrix involving categorical data:For example, this error crops up in
categorical.CategoricalSniffer:sniff
. It seems the root of the issue is that the__iter__
for anxarray.DataArray
returns a single element still wrapped up in anxarray.DataArray
, unlike when iterating over apd.Series
ornp.ndarray
.I gave a go at a patch by defining
have_xarray
inpatsy/utils
and "unwrapping" data from itsxarray.DataArray
container before trying to perform any of the checks or conversions Patsy does with categorical data. I'm happy to iterate on design or implementation if you think it's worth integrating.Thanks for your work!