BUG: GroupBy.get_group raises ValueError when group key contains NaT #6996

sinhrks · 2014-04-28T23:44:28Z

Closes #6992. Made GroupBy.get_group works even if the key contains NaT.

NOTE: One issue is that GroupBy.groups returns incorrect key in numpy 1.6. This seems to be caused by _convert_grouper uses grouper.reindex(axis).values to return value. This looks doesn't affect to main functionalities, but is there any expected result?

import pandas as pd
import numpy as np
>>> np.__version__
1.6.2

>>> df = pd.DataFrame({'values': np.random.randn(8), 
                   'dt': [np.nan, pd.Timestamp('2013-01-01'), np.nan, pd.Timestamp('2013-02-01'),
                          np.nan, pd.Timestamp('2013-02-01'), np.nan, pd.Timestamp('2013-01-01')]})
>>> grouped = df.groupby('dt')

>>> grouped.groups
{1970-01-16 48:00:00: [1, 7], 1970-01-16 24:00:00: [3, 5]}

jreback · 2014-04-28T23:57:29Z

pandas/core/groupby.py

@@ -1809,7 +1810,13 @@ def _make_labels(self):
    @property
    def groups(self):
        if self._groups is None:
-            self._groups = self.index.groupby(self.grouper)
+            if isinstance(self.grouper, DatetimeIndex):


I think better to add an argument to Index.groupby and DatetimeIndex.groupby, maybe exclude_missing=False as the default (and you will pass True here), then you can leave groupby alone and do the exlusions inside the index?

OK, I've changed the logic and would like to ask about keyword.

Considering #3729, I preferred to dropna as a keyword, and now it uses dropna. But looks better to use exclude_missing because pivot_table already has dropna for different purpose (to exclude index/columns from result which values are all NaN).

Is it OK to add exclude_missing to Dataframe.groupby and pivot_table as new keyword?

Also, I think default should be dropna=True based on current behavior which exclude NaT as nan.

jreback · 2014-05-02T23:41:44Z

you can't have nan in the int groupby nor datetime

the cython routines are very strict about types

so u don't need to check for a non existsnt type

furthermore the datetimes won't be passed directly but as a i8 dtype

sinhrks · 2014-05-02T23:57:45Z

I understand nan should not be included in values of such indexes, but can be passed as to_groupby argument.

Existing cython logic doesn't care about index instance, but check nan for the argument. I didn't change the core logic, and added to Nat handling condition.

jreback · 2014-05-24T15:34:35Z

@sinhrks haven't really looked at this, but does this solve #7227 as well?

sinhrks · 2014-05-24T21:51:55Z

@jreback Unfortunately No. #7227 seems to be caused by TimeGrouper, and the fix is only for normal groupby. I'll take a deeper look later.

cpcloud · 2014-05-25T01:49:25Z

pandas/algos.pyx

@@ -1924,7 +1925,7 @@ def groupby_indices(ndarray values):
        k = labels[i]

        # was NaN
-        if k == -1:
+        if k == -1 or k is tslib.NaT:


do something like

cdef Py_ssize_t NaT = tslib.NaT # ... k == NaT

no need for any overhead of attribute access.

sinhrks · 2014-06-05T19:48:37Z

In above example to group by datetime column, labels argument for groupby_template will be DatetimeIndex, not i8. So I understand all functions created by groupby_template should have the logic to exclude NaT

index = [0 1 2 3 4 5 6 7]
labels = <class 'pandas.tseries.index.DatetimeIndex'>
[NaT, ..., 2013-01-01]
Length: 8, Freq: None, Timezone: None

jreback · 2014-06-05T19:58:42Z

hmm...DatetimeIndex should never be passed to a cython function, ONLY i8. can you show me where that is the case? (you can pass as object dtype, but i8 is MUCH faster so always pass that way)

sinhrks · 2014-06-06T11:46:22Z

@jreback The behavior seems to be caused by Grouping.
https://github.com/pydata/pandas/blob/master/pandas/core/groupby.py#L1881-1884

grouper is converted to datetime in following lines.
https://github.com/pydata/pandas/blob/master/pandas/core/groupby.py#L1831-1838

jreback · 2014-06-06T12:08:53Z

@sinhrks that doesn't seem a problem, its when its actually passed to cython that needs conversion (to and from), can you investigate?

jreback · 2014-06-19T11:45:21Z

@sinhrks can you rebase and address comments?

sinhrks · 2014-06-20T22:54:27Z

@jreback You mean to convert to_groupby (group key) just before passed to cython func, correct?
https://github.com/pydata/pandas/blob/master/pandas/core/index.py#L1330-1331

Conversion looks possible if to_gorupby is datetime, but it also can be an object dtype.
Thus I think cython funcs should handle object in this case?

jreback · 2014-06-20T23:28:59Z

I think u should convert to i8 if possible (and reverse after) - evn though I guess it's handled as object now

jreback · 2014-06-24T13:15:15Z

#7560 does this conversion now (actually these are converted to float then coerced back).

So rebase after I merge that and let's take another look

jreback · 2014-06-26T13:10:29Z

can you show groupby benches after this change..thanks

sinhrks · 2014-06-26T19:02:04Z

Thanks, but #7560 doesn't convert group keys, correct? Thus, _algos.groupby_xxx functions are still receives objects as labels argument even after #7560.

I think dtype should be converted in Index.groupby before calling cython funcs. Is it ok to change Index.groupby only to accept ndarray, or is it categorized as public API?

jreback · 2014-06-26T19:03:44Z

Ahh, yes was thinking of something else.

Index.groupby is ONLY used by groupby IIRC, so its not really public (so ok to change).

sinhrks · 2014-06-26T19:12:57Z

OK, let me try. Maybe there is no side-effect, because currently _algos.groupby_xxx only accepts ndarray (thus the point which error occurrs being changed if something passed other than ndarray).

sinhrks · 2014-06-26T19:56:52Z

@jreback I made it work when group-key has datelike dtypes, is this what you mean?

Have to consider object dtype case and better way to coercing dtype back.

jreback · 2014-07-01T15:41:09Z

can you rebase and revisit this...thxs

jreback · 2014-07-06T13:39:02Z

pushing if you get 2 this in next day or 2 lmk

jreback · 2015-05-09T16:19:10Z

@sinhrks let's do this for 0.17.0

sinhrks · 2015-05-15T19:38:03Z

Yes, fixed to work on the latest master.

jreback · 2015-05-15T19:41:34Z

@sinhrks don't add ANOTHER nat checker, use https://github.com/pydata/pandas/blob/master/pandas/src/inference.pyx#L228

sinhrks · 2015-05-15T22:23:13Z

Thanks, I haven't noticed that.

jreback · 2015-05-18T11:55:38Z

pandas/algos.pyx

@@ -2010,7 +2012,7 @@ def groupby_indices(ndarray values):
        k = labels[i]

        # was NaN
-        if k == -1:
+        if k == -1 or k is NaT:


this doesn't look right. these are indices, not actual objects, right?

Correct, actually no need to change algos.pyx. Fixed.

sinhrks · 2015-05-28T22:05:29Z

@jreback pls check if there is anything I should do?

jreback · 2015-05-28T22:08:29Z

lgtm....go ahead and merged

sinhrks · 2015-05-28T22:17:36Z

Thanks, and sorry I found the doc header alignment is incorrect. Will fix it then merge if nothing else.

BUG: GroupBy.get_group raises ValueError when group key contains NaT

jreback reviewed Apr 28, 2014
View reviewed changes

jreback added Groupby labels Apr 29, 2014

jreback added this to the 0.14.1 milestone May 1, 2014

cpcloud reviewed May 25, 2014
View reviewed changes

jreback modified the milestones: 0.15.0, 0.14.1, 0.15.1 Jul 6, 2014

jreback modified the milestones: 0.17.0, Next Major Release May 9, 2015

sinhrks force-pushed the groupnat branch 5 times, most recently from 1731852 to ca22e12 Compare May 15, 2015 17:49

sinhrks force-pushed the groupnat branch 2 times, most recently from f91a28c to fe8860e Compare May 15, 2015 21:56

sinhrks force-pushed the groupnat branch from fe8860e to 3605738 Compare May 16, 2015 15:38

jreback reviewed May 18, 2015
View reviewed changes

sinhrks force-pushed the groupnat branch 3 times, most recently from c423fcd to 6e79756 Compare May 21, 2015 13:31

BUG: GroupBy.get_group raises ValueError when group key contains NaT

0ad0e0c

sinhrks force-pushed the groupnat branch from 6e79756 to 0ad0e0c Compare May 29, 2015 19:41

sinhrks added a commit that referenced this pull request May 30, 2015

Merge pull request #6996 from sinhrks/groupnat

753d66b

BUG: GroupBy.get_group raises ValueError when group key contains NaT

sinhrks merged commit 753d66b into pandas-dev:master May 30, 2015

sinhrks deleted the groupnat branch May 30, 2015 13:49

jorisvandenbossche modified the milestones: 0.17.0, 0.16.2 Jun 2, 2015

Uh oh!

BUG: GroupBy.get_group raises ValueError when group key contains NaT #6996

BUG: GroupBy.get_group raises ValueError when group key contains NaT #6996

Uh oh!

Conversation

sinhrks commented Apr 28, 2014

Uh oh!

jreback Apr 28, 2014

Choose a reason for hiding this comment

Uh oh!

sinhrks May 2, 2014

Choose a reason for hiding this comment

Uh oh!

jreback commented May 2, 2014

Uh oh!

sinhrks commented May 2, 2014

Uh oh!

jreback commented May 24, 2014

Uh oh!

sinhrks commented May 24, 2014

Uh oh!

cpcloud May 25, 2014

Choose a reason for hiding this comment

Uh oh!

sinhrks commented Jun 5, 2014

Uh oh!

jreback commented Jun 5, 2014

Uh oh!

sinhrks commented Jun 6, 2014

Uh oh!

jreback commented Jun 6, 2014

Uh oh!

jreback commented Jun 19, 2014

Uh oh!

sinhrks commented Jun 20, 2014

Uh oh!

jreback commented Jun 20, 2014

Uh oh!

jreback commented Jun 24, 2014

Uh oh!

jreback commented Jun 26, 2014

Uh oh!

sinhrks commented Jun 26, 2014

Uh oh!

jreback commented Jun 26, 2014

Uh oh!

sinhrks commented Jun 26, 2014

Uh oh!

sinhrks commented Jun 26, 2014

Uh oh!

jreback commented Jul 1, 2014

Uh oh!

jreback commented Jul 6, 2014

Uh oh!

jreback commented May 9, 2015

Uh oh!

sinhrks commented May 15, 2015

Uh oh!

jreback commented May 15, 2015

Uh oh!

sinhrks commented May 15, 2015

Uh oh!

jreback May 18, 2015

Choose a reason for hiding this comment

Uh oh!

sinhrks May 18, 2015

Choose a reason for hiding this comment

Uh oh!

sinhrks commented May 28, 2015

Uh oh!

jreback commented May 28, 2015

Uh oh!

sinhrks commented May 28, 2015

Uh oh!

Uh oh!