Skip to content

groupby.transform inconsistent behavior when grouping by columns containing NaN #10923

Closed
@briangerke

Description

@briangerke

This is similar to #9697, which was fixed in 0.16.1. I give a (very) slightly modified example here to show some related behavior which is at least inconsistent and should probably be handled cleanly.

It's not entirely clear to me what the desired behavior is in this case; it's possible that transform should not work here at all, since it spits out unexpected values. But at minimum it seems like it should do the same thing no matter how I invoke it below.

Example:

import numpy as np
df = pd.DataFrame({'col1':[1,1,2,2], 'col2':[1,2,3,np.nan])
#Let's try grouping on 'col2', which contains a NaN.

# Works and gives arguably reasonable results, with one unpredictable value
df.groupby('col2').transform(sum)['col1']

# Throws an unhelpful error
df.groupby('col2')['col1'].transform(sum)

Error is similar to the one encountered in the previous issue:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-54-2d4b83df6487> in <module>()
----> 1 df.groupby('col2')['col1'].transform(sum)

/usr/local/lib/python2.7/dist-packages/pandas/core/groupby.pyc in transform(self, func, *args, **kwargs)
   2442         cyfunc = _intercept_cython(func)
   2443         if cyfunc and not args and not kwargs:
-> 2444             return self._transform_fast(cyfunc)
   2445 
   2446         # reg transform

/usr/local/lib/python2.7/dist-packages/pandas/core/groupby.pyc in _transform_fast(self, func)
   2488             values = self._try_cast(values, self._selected_obj)
   2489 
-> 2490         return self._set_result_index_ordered(Series(values))
   2491 
   2492     def filter(self, func, dropna=True, *args, **kwargs):

/usr/local/lib/python2.7/dist-packages/pandas/core/groupby.pyc in _set_result_index_ordered(self, result)
    503             result = result.sort_index()
    504 
--> 505         result.index = self.obj.index
    506         return result
    507 

/usr/local/lib/python2.7/dist-packages/pandas/core/generic.pyc in __setattr__(self, name, value)
   2159         try:
   2160             object.__getattribute__(self, name)
-> 2161             return object.__setattr__(self, name, value)
   2162         except AttributeError:
   2163             pass

/usr/local/lib/python2.7/dist-packages/pandas/lib.so in pandas.lib.AxisProperty.__set__ (pandas/lib.c:42548)()

/usr/local/lib/python2.7/dist-packages/pandas/core/series.pyc in _set_axis(self, axis, labels, fastpath)
    273         object.__setattr__(self, '_index', labels)
    274         if not fastpath:
--> 275             self._data.set_axis(axis, labels)
    276 
    277     def _set_subtyp(self, is_all_dates):

/usr/local/lib/python2.7/dist-packages/pandas/core/internals.pyc in set_axis(self, axis, new_labels)
   2217         if new_len != old_len:
   2218             raise ValueError('Length mismatch: Expected axis has %d elements, '
-> 2219                              'new values have %d elements' % (old_len, new_len))
   2220 
   2221         self.axes[axis] = new_labels

ValueError: Length mismatch: Expected axis has 3 elements, new values have 4 elements

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugGroupbyMissing-datanp.nan, pd.NaT, pd.NA, dropna, isnull, interpolate

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions