Skip to content

Fix: Add compat code for pd.Categorical in pandas>=0.15 #47

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

jankatins
Copy link

pandas renamed pd.Categorical.labels to pd.Categorical.codes. It's
also now possible to have Categoricals as blocks, so Series can contain
Categoricals.

@jankatins
Copy link
Author

@josef-pkt this was the only place I found in statsmodels and patsy, where pd.Categorical.labels were used. Lets see how this works...

@josef-pkt
Copy link

@JanSchulz The test error for statsmodels is in grouputils of statsmodels
https://github.com/statsmodels/statsmodels/blob/master/statsmodels/tools/grouputils.py#L369
which is supposed to be our new/future code for general groups handling, in panel data and similar

@njsmith
Copy link
Member

njsmith commented Aug 14, 2014

Thanks!

  • There's a merge conflict, not sure why.
  • commas should always be followed by a space
  • I had the impression that in 0.15 it will be possible for Series objects to contain categorical data, while this special case path currently only checks for objects where isinstance(obj, Categorical). Are further changes needed?

@jankatins
Copy link
Author

rebased, fixed comma and repushed

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling 11e2588 on JanSchulz:cat_fixes into 5738c55 on pydata:master.

@jankatins
Copy link
Author

There are probably other places, where it is good to check for categoricals, but if this is not checked a Series of type categorical should behave like a string/int/... Series apart from sorting behaviour, min/max on unordered cats and if you try to change values to something which is not in levels.

I'll add this to the categorical_to_int function:

[...]
if getattr(data, "dtype", None) and data.dtype == "category":
        # This is not a Categorical, but a Series(Categorical(...)) in pandas >= 0.15
        data_levels_tuple = tuple(data.cat.levels)
        if not data_levels_tuple == levels:
            raise PatsyError("mismatching levels: expected %r, got %r"
                             % (levels, data_levels_tuple), origin)
        # not sure yet if the boxing is needed or if Series.cat.codes should be a Series
        return pandas.Series(data.cat.codes, index=data.index)
8...]

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling 11e2588 on JanSchulz:cat_fixes into 5738c55 on pydata:master.

@njsmith
Copy link
Member

njsmith commented Aug 14, 2014

Okay, right -- a categorical containing ints should not be treated like
an array of ints (the latter is treated as numeric).

That change can't possibly be correct because you'll need to touch the
categorical sniffer too :-)

On Thu, Aug 14, 2014 at 3:31 PM, Jan Schulz notifications@github.com
wrote:

There are probably other places, where it is good to check for
categoricals, but if this is not checked a Series of type categorical
should behave like a string/int/... Series apart from sorting behaviour,
min/max on unordered cats and if you try to check values to something which
is not in levels.

I'll add this to the categorical_to_int function:

[...]
if getattr(data, "dtype", None) and data.dtype == "category":
# This is not a Categorical, but a Series(Categorical(...)) in pandas >= 0.15
data_levels_tuple = tuple(data.cat.levels)
if not data_levels_tuple == levels:
raise PatsyError("mismatching levels: expected %r, got %r"
% (levels, data_levels_tuple), origin)
# not sure yet if the boxing is needed or if Series.cat.codes should be a Series
return pandas.Series(data.cat.codes, index=data.index)
8...]


Reply to this email directly or view it on GitHub
#47 (comment).

Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org

@jankatins
Copy link
Author

BTW: this is used in Categorical to get ints for levels:

def _get_codes_for_values(values, levels):
    """"
    utility routine to turn values into codes given the specified levels
    """

    from pandas.core.algorithms import _get_data_algo, _hashtables
    if values.dtype != levels.dtype:
        values = com._ensure_object(values)
        levels = com._ensure_object(levels)
    (hash_klass, vec_klass), vals = _get_data_algo(values, _hashtables)
    t = hash_klass(len(levels))
    t.map_locations(com._values_from_object(levels))
    return com._ensure_platform_int(t.lookup(values))

I'm not sure how many users patsy has, which do not use pandas...

@njsmith
Copy link
Member

njsmith commented Aug 14, 2014

(The travis build failure is just my current battles with travis and python 3.2, see mailing list, nothing to do with the patch.)

@jankatins
Copy link
Author

current patch, just waiting on the update whether to box the codes or not

diff --git a/patsy/categorical.py b/patsy/categorical.py
index ae56de4..b47c867 100644
--- a/patsy/categorical.py
+++ b/patsy/categorical.py
@@ -173,6 +173,11 @@ class CategoricalSniffer(object):
             # second-guess it.
             self._levels = tuple(data.levels)
             return True
+        if hasattr(data, "dtype") and data.dtype = "category":
+            # A Series(Categorical(...)) in pandas >= 0.15
+            self._levels = tuple(data.cat.levels)
+            return True
+
         if isinstance(data, _CategoricalBox):
             if data.levels is not None:
                 self._levels = tuple(data.levels)
@@ -298,6 +303,13 @@ def categorical_to_int(data, levels, NA_action, origin=None
         # Compat code for the labels -> codes change in pandas 0.15
         # FIXME: Remove when we don't want to support pandas < 0.15
         return getattr(data, 'codes', data.labels)
+    if hasattr(data, "dtype") and data.dtype == "category":
+        # This is Series(Categorical(...)) in pandas >= 0.15
+        data_levels_tuple = tuple(data.cat.levels)
+        if not data_levels_tuple == levels:
+            raise PatsyError("mismatching levels: expected %r, got %r"
+                             % (levels, data_levels_tuple), origin)
+        return pandas.Series(data.cat.codes, index=data.index)
     if isinstance(data, _CategoricalBox):
         if data.levels is not None and tuple(data.levels) != levels:
             raise PatsyError("mismatching levels: expected %r, got %r"

@njsmith
Copy link
Member

njsmith commented Aug 14, 2014

NB you have an = where you mean ==.

Is data.cat a Categorical object? If so then I think it would reduce code
duplication to just do something like (right before the check for
Categorical objects):

if getattr(data, "dtype", None) =="category":
# A pandas Series containing a Categorical; unpack the Categorical and
fall through.
data = data.cat

On Thu, Aug 14, 2014 at 3:39 PM, Jan Schulz notifications@github.com
wrote:

current patch, just waiting on the update whether to box the codes or not

diff --git a/patsy/categorical.py b/patsy/categorical.py
index ae56de4..b47c867 100644
--- a/patsy/categorical.py
+++ b/patsy/categorical.py
@@ -173,6 +173,11 @@ class CategoricalSniffer(object):
# second-guess it.
self._levels = tuple(data.levels)
return True

  •    if hasattr(data, "dtype") and data.dtype = "category":
    
  •        # A Series(Categorical(...)) in pandas >= 0.15
    
  •        self._levels = tuple(data.cat.levels)
    
  •        return True
    
    • if isinstance(data, _CategoricalBox):
      if data.levels is not None:
      self._levels = tuple(data.levels)
      @@ -298,6 +303,13 @@ def categorical_to_int(data, levels, NA_action, origin=None

      Compat code for the labels -> codes change in pandas 0.15

      FIXME: Remove when we don't want to support pandas < 0.15

      return getattr(data, 'codes', data.labels)
  • if hasattr(data, "dtype") and data.dtype == "category":
  •    # This is Series(Categorical(...)) in pandas >= 0.15
    
  •    data_levels_tuple = tuple(data.cat.levels)
    
  •    if not data_levels_tuple == levels:
    
  •        raise PatsyError("mismatching levels: expected %r, got %r"
    
  •                         % (levels, data_levels_tuple), origin)
    
  •    return pandas.Series(data.cat.codes, index=data.index)
    
    if isinstance(data, _CategoricalBox):
    if data.levels is not None and tuple(data.levels) != levels:
    raise PatsyError("mismatching levels: expected %r, got %r"


Reply to this email directly or view it on GitHub
#47 (comment).

Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org

@jankatins
Copy link
Author

Nope, data.cat is the accessor (only shows the public API), but data.values is the Categorical... Will change and repush...

@jankatins
Copy link
Author

I added the code, but I kept the current way to check, as it basically ended up with the same number of checks.

@jankatins jankatins changed the title Fix: Add compat code for pd.Categorical.labels -> .codes Fix: Add compat code for pd.Categorical in pandas>=0.15 Aug 14, 2014
@@ -173,6 +173,10 @@ def sniff(self, data):
# second-guess it.
self._levels = tuple(data.levels)
return True
if hasattr(data, "dtype") and data.dtype == "category":
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JanSchulz you might want to define something like: is_pandas_cat_support = LooseVersion(pd.__version__) >= 0.15.0

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

patsy (and statsmodels) do not unconditionally import pandas, so this would need 2 additionally checks (has_pandas, the version check, and then still if it is a data with dtype and if that dtype is categorical...

@jankatins
Copy link
Author

redone...

return data.codes
else:
return data.labels
if hasattr(data, "dtype") and data.dtype == "category":
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return getattr(data,'codes',data.labels)

a bit more concise

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But that will fail when data.labels is not there(so in current pandas master and after the deprecation period), as the data.labels is evaluated before calling getattr.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the only alternative is

try:
    return data.codes
except:
    return data.labels

[Thats the variant which is not in statsmodels]

But I like explicit checks and not the try-except variants.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm you r right
nothing special about getattr wrt to short circuiting

@hmgaudecker
Copy link

I just got bitten by this after returning to an older project, having updated Pandas in the meantime.

Any chance to get this merged soonish? Is it possible to help?

@kousu
Copy link

kousu commented Feb 3, 2015

So just to be clear, I'm not supposed to be getting this "datatype not understood" exception, am I? It only happens when I cast things to pandas.Categorical

code.py

import pandas
import scipy.linalg; import scipy.lib.lapack.calc_lwork; scipy.linalg.calc_lwork = scipy.lib.lapack.calc_lwork #version creep bug: https://github.com/statsmodels/statsmodels/issues/2191
import statsmodels.api as sm #<-- this is the mark of a badly designed API: when you have a special subpackage just for your API

D = pandas.read_csv("surgery.csv", skipinitialspace=True)

def factor(series):
  "coerce series to a categorical variable in one line"
  "meant to be like R's factor() command"
  series = series.astype("category")
  series.cat.set_categories(set(series), inplace=True)
  # ^ from http://pandas-docs.github.io/pandas-docs-travis/categorical.html#getting-data-in-out
  return series

for C in ["treated", "risk"]:
  D[C] = factor(D[C])

With this setup, I either get:

In [10]: M = sm.GLM.from_formula("died ~ C(treated)", D)
---------------------------------------------------------------------------
PatsyError                                Traceback (most recent call last)
<ipython-input-10-27c4b7a47531> in <module>()
----> 1 M = sm.GLM.from_formula("died ~ C(treated)", D)

/usr/lib/python3.4/site-packages/statsmodels/base/model.py in from_formula(cls, formula, data, subset, *args, **kwargs)
    145         (endog, exog), missing_idx = handle_formula_data(data, None, formula,
    146                                                          depth=eval_env,
--> 147                                                          missing=missing)
    148         kwargs.update({'missing_idx': missing_idx,
    149                        'missing': missing})

/usr/lib/python3.4/site-packages/statsmodels/formula/formulatools.py in handle_formula_data(Y, X, formula, depth, missing)
     63         if data_util._is_using_pandas(Y, None):
     64             result = dmatrices(formula, Y, depth, return_type='dataframe',
---> 65                                NA_action=na_action)
     66         else:
     67             result = dmatrices(formula, Y, depth, return_type='dataframe',

/usr/lib/python3.4/site-packages/patsy/highlevel.py in dmatrices(formula_like, data, eval_env, NA_action, return_type)
    295     eval_env = EvalEnvironment.capture(eval_env, reference=1)
    296     (lhs, rhs) = _do_highlevel_design(formula_like, data, eval_env,
--> 297                                       NA_action, return_type)
    298     if lhs.shape[1] == 0:
    299         raise PatsyError("model is missing required outcome variables")

/usr/lib/python3.4/site-packages/patsy/highlevel.py in _do_highlevel_design(formula_like, data, eval_env, NA_action, return_type)
    150         return iter([data])
    151     builders = _try_incr_builders(formula_like, data_iter_maker, eval_env,
--> 152                                   NA_action)
    153     if builders is not None:
    154         return build_design_matrices(builders, data,

/usr/lib/python3.4/site-packages/patsy/highlevel.py in _try_incr_builders(formula_like, data_iter_maker, eval_env, NA_action)
     55                                        formula_like.rhs_termlist],
     56                                       data_iter_maker,
---> 57                                       NA_action)
     58     else:
     59         return None

/usr/lib/python3.4/site-packages/patsy/build.py in design_matrix_builders(termlists, data_iter_maker, NA_action)
    658                                                    factor_states,
    659                                                    data_iter_maker,
--> 660                                                    NA_action)
    661     # Now we need the factor evaluators, which encapsulate the knowledge of
    662     # how to turn any given factor into a chunk of data:

/usr/lib/python3.4/site-packages/patsy/build.py in _examine_factor_types(factors, factor_states, data_iter_maker, NA_action)
    422     for data in data_iter_maker():
    423         for factor in list(examine_needed):
--> 424             value = factor.eval(factor_states[factor], data)
    425             if factor in cat_sniffers or guess_categorical(value):
    426                 if factor not in cat_sniffers:

/usr/lib/python3.4/site-packages/patsy/eval.py in eval(self, memorize_state, data)
    483     #    http://nedbatchelder.com/blog/200711/rethrowing_exceptions_in_python.html
    484     def eval(self, memorize_state, data):
--> 485         return self._eval(memorize_state["eval_code"], memorize_state, data)
    486 
    487 def test_EvalFactor_basics():

/usr/lib/python3.4/site-packages/patsy/eval.py in _eval(self, code, memorize_state, data)
    466                                  self,
    467                                  self._eval_env.eval,
--> 468                                  code, inner_namespace=inner_namespace)
    469 
    470     def memorize_chunk(self, state, which_pass, data):

/usr/lib/python3.4/site-packages/patsy/compat.py in call_and_wrap_exc(msg, origin, f, *args, **kwargs)
    122                                  origin)
    123             # Use 'exec' to hide this syntax from the Python 2 parser:
--> 124             exec("raise new_exc from e")
    125         else:
    126             # In python 2, we just let the original exception escape -- better

/usr/lib/python3.4/site-packages/patsy/compat.py in <module>()

PatsyError: Error evaluating factor: TypeError: 'str' object is not callable
    died ~ C(treated)

or

In [11]: M = sm.GLM.from_formula("died ~ treated", D)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-11-ba4acccf7be6> in <module>()
----> 1 M = sm.GLM.from_formula("died ~ treated", D)

/usr/lib/python3.4/site-packages/statsmodels/base/model.py in from_formula(cls, formula, data, subset, *args, **kwargs)
    145         (endog, exog), missing_idx = handle_formula_data(data, None, formula,
    146                                                          depth=eval_env,
--> 147                                                          missing=missing)
    148         kwargs.update({'missing_idx': missing_idx,
    149                        'missing': missing})

/usr/lib/python3.4/site-packages/statsmodels/formula/formulatools.py in handle_formula_data(Y, X, formula, depth, missing)
     63         if data_util._is_using_pandas(Y, None):
     64             result = dmatrices(formula, Y, depth, return_type='dataframe',
---> 65                                NA_action=na_action)
     66         else:
     67             result = dmatrices(formula, Y, depth, return_type='dataframe',

/usr/lib/python3.4/site-packages/patsy/highlevel.py in dmatrices(formula_like, data, eval_env, NA_action, return_type)
    295     eval_env = EvalEnvironment.capture(eval_env, reference=1)
    296     (lhs, rhs) = _do_highlevel_design(formula_like, data, eval_env,
--> 297                                       NA_action, return_type)
    298     if lhs.shape[1] == 0:
    299         raise PatsyError("model is missing required outcome variables")

/usr/lib/python3.4/site-packages/patsy/highlevel.py in _do_highlevel_design(formula_like, data, eval_env, NA_action, return_type)
    150         return iter([data])
    151     builders = _try_incr_builders(formula_like, data_iter_maker, eval_env,
--> 152                                   NA_action)
    153     if builders is not None:
    154         return build_design_matrices(builders, data,

/usr/lib/python3.4/site-packages/patsy/highlevel.py in _try_incr_builders(formula_like, data_iter_maker, eval_env, NA_action)
     55                                        formula_like.rhs_termlist],
     56                                       data_iter_maker,
---> 57                                       NA_action)
     58     else:
     59         return None

/usr/lib/python3.4/site-packages/patsy/build.py in design_matrix_builders(termlists, data_iter_maker, NA_action)
    658                                                    factor_states,
    659                                                    data_iter_maker,
--> 660                                                    NA_action)
    661     # Now we need the factor evaluators, which encapsulate the knowledge of
    662     # how to turn any given factor into a chunk of data:

/usr/lib/python3.4/site-packages/patsy/build.py in _examine_factor_types(factors, factor_states, data_iter_maker, NA_action)
    427                     cat_sniffers[factor] = CategoricalSniffer(NA_action,
    428                                                               factor.origin)
--> 429                 done = cat_sniffers[factor].sniff(value)
    430                 if done:
    431                     examine_needed.remove(factor)

/usr/lib/python3.4/site-packages/patsy/categorical.py in sniff(self, data)
    169         # fastpath to avoid doing an item-by-item iteration over boolean
    170         # arrays, as requested by #44
--> 171         if hasattr(data, "dtype") and np.issubdtype(data.dtype, np.bool_):
    172             self._level_set = set([True, False])
    173             return True

/usr/lib/python3.4/site-packages/numpy/core/numerictypes.py in issubdtype(arg1, arg2)
    761     """
    762     if issubclass_(arg2, generic):
--> 763         return issubclass(dtype(arg1).type, arg2)
    764     mro = dtype(arg2).type.mro()
    765     if len(mro) > 1:

TypeError: data type not understood

surgery.csv:

"survived", "died", "treated", "risk"
12, 3,                "control", "low"
10, 2,                "treated", "low"
 6, 7,                "control", "medium"
10, 3,                "treated", "medium"
 3, 6,                "control", "high"
 8, 2,                "treated", "high"

Versions:

In [16]: patsy.version.__version__
Out[16]: '0.3.0'

In [24]: pandas.version.version
Out[24]: '0.15.2'

In [27]: sm.version.version
Out[27]: '0.6.1'

@jankatins
Copy link
Author

This probably needs some more work: labels -> codes and levels -> categories...

Shoudl I redo the PR or what's the status?

@njsmith
Copy link
Member

njsmith commented Mar 3, 2015

Sorry I lost track of this! (Just moved from Edinburgh to Berkeley and have been dropping a lot of balls :-(.)

At this point I am completely lost and confused regarding pandas's categorical changes. @JanSchulz: do you know what to do? Please send help!

pandas renamed pd.Categorical.labels to pd.Categorical.codes and
pd.Categorical.levels to pd.Categorical.categories. The 'codes and
level' constructor was also removed in favour of the 'values and
categories' one, so use the Categories.from_codes(...).

It's also now possible to have Categoricals as blocks, so Series
can contain Categoricals. Unfortunately numpy.dtypes do not compare
to 'category', so use the pandas function 'is_categorical_dtype'
for this.

Added FIXMEs to all sections which should can be removed when patsy
only supports pandas >0.15.
@jankatins
Copy link
Author

Updated, fixed all nosetest errors in the file, let's see if travis sees some more...

@jankatins
Copy link
Author

The above error is probably still present :-(

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling e636ef0 on JanSchulz:cat_fixes into * on pydata:master*.

1 similar comment
@coveralls
Copy link

Coverage Status

Changes Unknown when pulling e636ef0 on JanSchulz:cat_fixes into * on pydata:master*.

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling e636ef0 on JanSchulz:cat_fixes into * on pydata:master*.

@jankatins
Copy link
Author

I opend a issue in pandas repo regarding the above problem with this

In[16]: s = pd.Series([1,2,3,1,2,3]).astype("category")
In[18]: np.issubdtype(s.dtype, np.bool_)
Traceback (most recent call last):
  File "c:\data\external\ipython\IPython\core\interactiveshell.py", line 3032, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-18-607a91e2a828>", line 1, in <module>
    np.issubdtype(s.dtype, np.bool_)
  File "C:\portabel\miniconda\envs\ipython\lib\site-packages\numpy\core\numerictypes.py", line 763, in issubdtype
    return issubclass(dtype(arg1).type, arg2)
TypeError: data type not understood

I haven't looked into the first problem (python3)

@njsmith
Copy link
Member

njsmith commented Mar 4, 2015

Those interested in this PR should check out #59, which I think (hope?) replaces it.

@jankatins
Copy link
Author

close, as #59 replaces this?

@jankatins
Copy link
Author

Closing, as this in in #59

@jankatins jankatins closed this Mar 21, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants