Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: pivot_table with margins=True fails for categorical dtype, #10989 #10993

Closed
wants to merge 2 commits into from

Conversation

jakevdp
Copy link
Contributor

@jakevdp jakevdp commented Sep 4, 2015

This is a fix for the issue reported in #10989. I suspect this is an example of "fixing the symptom" rather than "fixing the problem", but I think it makes clear what the source of the problem is: to compute margins, the pivot table must add a row and/or column to the result. If the index or column is categorical, a new value cannot be added.

Let me know if you think there are better approaches to this.

# here we'll convert all categorical indices to object
def convert_categorical(ind):
_convert = lambda ind: (ind.astype('object')
if ind.dtype.name == 'category' else ind)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not quite right, this should end of being a cat level.

can you add the tests case and i'll take a look. thx.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test case is in the linked issue. Or would you like me to add a unit test to the package as part of this PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes pls

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is too specific
need to be more general and/or pushed down into the index itself

@jakevdp
Copy link
Contributor Author

jakevdp commented Sep 4, 2015

I should add one thing: I decided that rather than adding a new item to the category, it would be cleaner to simply change categories to objects rather than trying to track down all the corner cases of hierarchical indices with categorical levels.

@jakevdp
Copy link
Contributor Author

jakevdp commented Sep 4, 2015

A much cleaner solution, IMO, would be to add a utility function that does something along the lines of "add a new entry to this index, even if it requires a new category". I imagine there are other places in the package where this sort of thing might happen, and such a utility could be used there as well.

@TomAugspurger
Copy link
Contributor

I think we should return a regular Index with object dtypes. What happens if the user has a category called 'All'? I suspect that any fix involving categories will break/be fragile (just a hunch, haven't tried).

@jakevdp
Copy link
Contributor Author

jakevdp commented Sep 4, 2015

@TomAugspurger – I checked – it turns out this is another bug in the current codebase, even if you're not using categories! If one of the index entries is called All, the attempt to compute margins will overwrite it:

In [19]: data = pd.DataFrame({'x': np.arange(99),
                     'y': np.arange(99) // 50,
                     'z': np.arange(99) % 3})

In [20]: data.z = np.array(['Any', 'All', 'None'])[data.z]

In [21]: data.pivot_table('x', 'y', 'z')
Out[21]: 
z   All   Any  None
y                  
0  25.0  24.0  24.5
1  74.5  73.5  74.0

In [22]: data.pivot_table('x', 'y', 'z', margins=True)
Out[22]: 
z     All   Any  None
y                    
0    24.5  24.0  24.5
1    74.0  73.5  74.0
All  49.0  48.0  50.0

@jreback jreback added Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode Categorical Categorical Data Type labels Sep 5, 2015
@jreback jreback changed the title BUG: quick fix for #10989 BUG: pivot_table with margins=True fails for categorical dtype, #10989 Sep 10, 2015
@jreback
Copy link
Contributor

jreback commented Oct 18, 2015

@jakevdp can you update according to comments

@jreback jreback closed this Oct 18, 2015
@jakevdp
Copy link
Contributor Author

jakevdp commented Oct 18, 2015

I think I've already addressed all comments above – any others you have in mind? Any reason you closed this without merging?

@jreback
Copy link
Contributor

jreback commented Oct 18, 2015

I thought I put the comments before

this needs a more general soln as it too much if/then on type determination

@jreback jreback reopened this Oct 18, 2015
@jakevdp
Copy link
Contributor Author

jakevdp commented Oct 18, 2015

I guess I'm not entirely clear about what you're wanting as a "more general" solution. Any specific ideas?

@jreback
Copy link
Contributor

jreback commented Oct 18, 2015

So there is something more going on here; this bug report is a sympton of a different issue. Namely,
that we allow a Categorical as a level of a MultiIndex. But in fact we should simply convert them directly to an object dtype when creating the multi-index in the first place; these are de-facto the same.

In [4]: data = pd.DataFrame({'x': np.arange(8),'y': Series(np.arange(8) // 4).astype('category'),'z': Series(np.arange(8) % 2).astype('category')})

In [5]: data
Out[5]: 
   x  y  z
0  0  0  0
1  1  0  1
2  2  0  0
3  3  0  1
4  4  1  0
5  5  1  1
6  6  1  0
7  7  1  1

In [6]: data.dtypes
Out[6]: 
x       int64
y    category
z    category
dtype: object

In [7]: data.groupby(['y','z']).agg('mean')
Out[7]: 
     x
y z   
0 0  1
  1  2
1 0  5
  1  6
In [7]: data.groupby(['y','z']).agg('mean')
Out[7]: 
     x
y z   
0 0  1
  1  2
1 0  5
  1  6

In [8]: data.groupby(['y','z']).agg('mean').index.levels[0]
Out[8]: CategoricalIndex([0, 1], categories=[0, 1], ordered=False, name=u'y', dtype='category')

In [9]: data.groupby(['y','z']).agg('mean').index.levels[1]
Out[9]: CategoricalIndex([0, 1], categories=[0, 1], ordered=False, name=u'z', dtype='category')

In [10]: data.groupby(['y','z']).agg('mean').index.values   
Out[10]: array([(0, 0), (0, 1), (1, 0), (1, 1)], dtype=object)

So this is the same as the index in [10]

In [15]: idx = pd.MultiIndex.from_tuples([(0, 0), (0, 1), (1, 0), (1, 1)],names=['y','z'])

In [16]: idx
Out[16]: 
MultiIndex(levels=[[0, 1], [0, 1]],
           labels=[[0, 0, 1, 1], [0, 1, 0, 1]],
           names=[u'y', u'z'])

So I think that MultiIndex creation should coerce Categoricals on construction. This can be done in the MultiIndex.__init__ (keep all of these existing test), prob need a couple more.

@jreback jreback added this to the 0.17.1 milestone Oct 18, 2015
@jakevdp
Copy link
Contributor Author

jakevdp commented Oct 18, 2015

I suppose we should close this PR then, and leave the issue open. Hacking into the internals of MultiIndex is well beyond my level of comfort with the library.

@jreback
Copy link
Contributor

jreback commented Oct 18, 2015

haha. well, I'll take your tests in any event. So going to leave open for a bit.

@jakevdp
Copy link
Contributor Author

jakevdp commented Oct 19, 2015

Sounds good – thanks!

jreback added a commit that referenced this pull request Oct 20, 2015
BUG: pivot table bug with Categorical indexes, #10993
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type MultiIndex Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants