BUG: pivot_table with margins=True fails for categorical dtype, #10989 #10993

jakevdp · 2015-09-04T20:00:17Z

This is a fix for the issue reported in #10989. I suspect this is an example of "fixing the symptom" rather than "fixing the problem", but I think it makes clear what the source of the problem is: to compute margins, the pivot table must add a row and/or column to the result. If the index or column is categorical, a new value cannot be added.

Let me know if you think there are better approaches to this.

jreback · 2015-09-04T20:06:09Z

pandas/tools/pivot.py

+    # here we'll convert all categorical indices to object
+    def convert_categorical(ind):
+        _convert = lambda ind: (ind.astype('object')
+                                if ind.dtype.name == 'category' else ind)


this is not quite right, this should end of being a cat level.

can you add the tests case and i'll take a look. thx.

The test case is in the linked issue. Or would you like me to add a unit test to the package as part of this PR?

this is too specific
need to be more general and/or pushed down into the index itself

jakevdp · 2015-09-04T20:14:05Z

I should add one thing: I decided that rather than adding a new item to the category, it would be cleaner to simply change categories to objects rather than trying to track down all the corner cases of hierarchical indices with categorical levels.

jakevdp · 2015-09-04T20:38:10Z

A much cleaner solution, IMO, would be to add a utility function that does something along the lines of "add a new entry to this index, even if it requires a new category". I imagine there are other places in the package where this sort of thing might happen, and such a utility could be used there as well.

TomAugspurger · 2015-09-04T21:08:36Z

I think we should return a regular Index with object dtypes. What happens if the user has a category called 'All'? I suspect that any fix involving categories will break/be fragile (just a hunch, haven't tried).

jakevdp · 2015-09-04T21:18:17Z

@TomAugspurger – I checked – it turns out this is another bug in the current codebase, even if you're not using categories! If one of the index entries is called All, the attempt to compute margins will overwrite it:

In [19]: data = pd.DataFrame({'x': np.arange(99),
                     'y': np.arange(99) // 50,
                     'z': np.arange(99) % 3})

In [20]: data.z = np.array(['Any', 'All', 'None'])[data.z]

In [21]: data.pivot_table('x', 'y', 'z')
Out[21]: 
z   All   Any  None
y                  
0  25.0  24.0  24.5
1  74.5  73.5  74.0

In [22]: data.pivot_table('x', 'y', 'z', margins=True)
Out[22]: 
z     All   Any  None
y                    
0    24.5  24.0  24.5
1    74.0  73.5  74.0
All  49.0  48.0  50.0

jreback · 2015-10-18T14:04:44Z

@jakevdp can you update according to comments

jakevdp · 2015-10-18T16:08:31Z

I think I've already addressed all comments above – any others you have in mind? Any reason you closed this without merging?

jreback · 2015-10-18T16:21:55Z

I thought I put the comments before

this needs a more general soln as it too much if/then on type determination

jakevdp · 2015-10-18T16:50:06Z

I guess I'm not entirely clear about what you're wanting as a "more general" solution. Any specific ideas?

jreback · 2015-10-18T22:12:48Z

So there is something more going on here; this bug report is a sympton of a different issue. Namely,
that we allow a Categorical as a level of a MultiIndex. But in fact we should simply convert them directly to an object dtype when creating the multi-index in the first place; these are de-facto the same.

In [4]: data = pd.DataFrame({'x': np.arange(8),'y': Series(np.arange(8) // 4).astype('category'),'z': Series(np.arange(8) % 2).astype('category')})

In [5]: data
Out[5]: 
   x  y  z
0  0  0  0
1  1  0  1
2  2  0  0
3  3  0  1
4  4  1  0
5  5  1  1
6  6  1  0
7  7  1  1

In [6]: data.dtypes
Out[6]: 
x       int64
y    category
z    category
dtype: object

In [7]: data.groupby(['y','z']).agg('mean')
Out[7]: 
     x
y z   
0 0  1
  1  2
1 0  5
  1  6

In [7]: data.groupby(['y','z']).agg('mean')
Out[7]: 
     x
y z   
0 0  1
  1  2
1 0  5
  1  6

In [8]: data.groupby(['y','z']).agg('mean').index.levels[0]
Out[8]: CategoricalIndex([0, 1], categories=[0, 1], ordered=False, name=u'y', dtype='category')

In [9]: data.groupby(['y','z']).agg('mean').index.levels[1]
Out[9]: CategoricalIndex([0, 1], categories=[0, 1], ordered=False, name=u'z', dtype='category')

In [10]: data.groupby(['y','z']).agg('mean').index.values   
Out[10]: array([(0, 0), (0, 1), (1, 0), (1, 1)], dtype=object)

So this is the same as the index in [10]

In [15]: idx = pd.MultiIndex.from_tuples([(0, 0), (0, 1), (1, 0), (1, 1)],names=['y','z'])

In [16]: idx
Out[16]: 
MultiIndex(levels=[[0, 1], [0, 1]],
           labels=[[0, 0, 1, 1], [0, 1, 0, 1]],
           names=[u'y', u'z'])

So I think that MultiIndex creation should coerce Categoricals on construction. This can be done in the MultiIndex.__init__ (keep all of these existing test), prob need a couple more.

jakevdp · 2015-10-18T23:27:17Z

I suppose we should close this PR then, and leave the issue open. Hacking into the internals of MultiIndex is well beyond my level of comfort with the library.

jreback · 2015-10-18T23:33:16Z

haha. well, I'll take your tests in any event. So going to leave open for a bit.

jakevdp · 2015-10-19T15:47:37Z

Sounds good – thanks!

BUG: pivot table bug with Categorical indexes, #10993

BUG: quick fix for pandas-dev#10989

2b04d9f

jreback reviewed Sep 4, 2015
View reviewed changes

TST: add test case from Issue pandas-dev#10989

74cac0e

jreback added Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode Categorical Categorical Data Type labels Sep 5, 2015

jreback changed the title ~~BUG: quick fix for #10989~~ BUG: pivot_table with margins=True fails for categorical dtype, #10989 Sep 10, 2015

jreback closed this Oct 18, 2015

jreback reopened this Oct 18, 2015

jreback added the MultiIndex label Oct 18, 2015

jreback added this to the 0.17.1 milestone Oct 18, 2015

jreback mentioned this pull request Oct 19, 2015

BUG: pivot table bug with Categorical indexes, #10993 #11371

Merged

jreback closed this in #11371 Oct 20, 2015

jreback added a commit that referenced this pull request Oct 20, 2015

Merge pull request #11371 from jreback/jakevdp-pivot-table-categorical

db884d9

BUG: pivot table bug with Categorical indexes, #10993

Uh oh!

BUG: pivot_table with margins=True fails for categorical dtype, #10989 #10993

BUG: pivot_table with margins=True fails for categorical dtype, #10989 #10993

Uh oh!

Conversation

jakevdp commented Sep 4, 2015

Uh oh!

jreback Sep 4, 2015

Choose a reason for hiding this comment

Uh oh!

jakevdp Sep 4, 2015

Choose a reason for hiding this comment

Uh oh!

jreback Sep 4, 2015

Choose a reason for hiding this comment

Uh oh!

jakevdp Sep 4, 2015

Choose a reason for hiding this comment

Uh oh!

jreback Oct 18, 2015

Choose a reason for hiding this comment

Uh oh!

jakevdp commented Sep 4, 2015

Uh oh!

jakevdp commented Sep 4, 2015

Uh oh!

TomAugspurger commented Sep 4, 2015

Uh oh!

jakevdp commented Sep 4, 2015

Uh oh!

jreback commented Oct 18, 2015

Uh oh!

jakevdp commented Oct 18, 2015

Uh oh!

jreback commented Oct 18, 2015

Uh oh!

jakevdp commented Oct 18, 2015

Uh oh!

jreback commented Oct 18, 2015

Uh oh!

jakevdp commented Oct 18, 2015

Uh oh!

jreback commented Oct 18, 2015

Uh oh!

jakevdp commented Oct 19, 2015

Uh oh!

Uh oh!