Open
Description
Hello,
I fit some (relatively) large-ish GLMs in statsmodels and have been experimenting with using patsy
instead of a home rolled thing. My home rolled method isn't very good (I tend to underestimate challenges...). I've gotten some better hardware so now some models that used to not work with patsy (because of memory constraints) work now. I've run across a few things that might make it easier for me to use patsy more. Happy to work on PRs for them if there's interest.
- Categorical NA logic: Currently, it appears that when a categorical is fed through
patsy
, every inidividual value is checked against a rather detailed list on how to handle NaNs/Missing Values/Empty whatever. I ran a cProfile on this and it was quite slow. I think the bottleneck is here:
https://github.com/pydata/patsy/blob/master/patsy/categorical.py#L341
I know that NaNs/NA/None/Empty is a mess in general, but its a fact of life in my line of work (insurance modeling). I'm wondering if we scope out exactly all the scenarios we need to control for and use (pandas maybe?) to do this more elegantly? I'm not sure of the scope as there's far more players here than me. - Reading from on-disk data stores: Memory is something of a problem for me. a typical model that I run might eat up 10+ GB of RAM. It works, but obviously is not ideal. As far as (relatively) mature tools, I've found bcolz's ctables to be pretty good (and fast). HDFStores/dask would be nice too. I'm not sure if xarray support for categorical data #91 relates to this (I don't know if xarray works well as an on-disk storage/data tool).
- Partial predictions: This is partly a statsmodels partly a patsy idea... I'd like the ability to do a so-called partial predict. Essentially I have a model like
y ~ a + b + a:c
. I want to come up with predictionsy
assuming that justa
changes or justb
changes. I think the process would look something like (assuming we're talking about changing onlya
) creating a new design matrix with every unique value ofa
as a separate row, and have the most frequent (or some other innocuous value) ofb
andc
as constant for these rows. Then feed this dataset through the statsmodelspredict
routine. This is very helpful for GLMs with the log link--which is the bulk of what I work with. - Categorical grouping: Suppose I have categories A, B, C, D, and E. The categories aren't really sortable in any logical way, but some could be grouped. Has there been any thought on how (or if) to allow this?
- Weights: For methods like
standardize
, it may make sense to weight the observations. (Really only applicable if you have a really skewed data where certain values are more prevalent on higher weighted records.
Metadata
Metadata
Assignees
Labels
No labels