ENH: use size instead of cythonized count for fallback cases #7055

cpcloud · 2014-05-06T14:02:58Z

use size instead of cythonized count for integer case since we cannot have
nan in that case
not faster perf wise
compilation time is shorter because count has no int templates except for
dates and times
any precision issues (don't think there were any) with other integer types are gone, since only int64 is compared with dates, whereas lower precision integers use size.

Anywho, here's the vbench results for this PR vs master:

-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------
groupby_int_count                            |   4.0247 |   4.0337 |   0.9978 |
groupby_multi_count                          |   7.5080 |   7.4813 |   1.0036 |
-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------

cpcloud · 2014-05-08T03:16:49Z

@jreback ok to merge?

cpcloud · 2014-05-08T03:17:08Z

actually let me add a test in for lower prec ints

jreback · 2014-05-08T12:49:02Z

looks fine...maybe squash and merge

ENH: use size instead of cythonized count for fallback cases

jreback · 2014-05-09T16:43:20Z

issues on windows:

I think the fallback is not working, e.g. should be x.size, not x.size(); so that is untested
I think their is a casting issue somewhere here (e.g. regular floats are not getting upcasted to float64), default on windows 32 is float32

>>> c:\users\jeff reback\documents\github\pandas\build\lib.win32-2.6\pandas\core\groupby.py(107)f()
-> result = self.aggregate(lambda x: npfunc(x, axis=self.axis))
(Pdb) l
102             try:
103                 return self._cython_agg_general(alias, numeric_only=numeric_only)
104             except AssertionError as e:
105                 raise SpecificationError(str(e))
106             except Exception:
107  ->             result = self.aggregate(lambda x: npfunc(x, axis=self.axis))
108                 if _convert:
109                     result = result.convert_objects()
110                 return result
111
112         f.__doc__ = "Compute %s of group values" % name
(Pdb) p self._cython_agg_general(alias,numeric_only=numeric_only)
*** ValueError: ValueError("Buffer dtype mismatch, expected 'float64_t' but got 'float'",)
(Pdb) d
> c:\users\jeff reback\documents\github\pandas\build\lib.win32-2.6\pandas\core\groupby.py(2405)aggregate()
-> result = self._aggregate_generic(arg, *args, **kwargs)
(Pdb) d
> c:\users\jeff reback\documents\github\pandas\build\lib.win32-2.6\pandas\core\groupby.py(2464)_aggregate_generic()
-> return self._aggregate_item_by_item(func, *args, **kwargs)
(Pdb) d
> c:\users\jeff reback\documents\github\pandas\build\lib.win32-2.6\pandas\core\groupby.py(2508)_aggregate_item_by_item()
-> raise errors

======================================================================
ERROR: test_mixed_type_join_with_suffix (pandas.tools.tests.test_merge.TestMerge)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "c:\Users\Jeff Reback\Documents\GitHub\pandas\build\lib.win32-2.6\pandas\tools\tests\test_merge.py", line 686, in test_mixed_type_join_with_suf
fix
    cn = grouped.count()
  File "c:\Users\Jeff Reback\Documents\GitHub\pandas\build\lib.win32-2.6\pandas\core\groupby.py", line 729, in count
    return self._count().astype('int64')
  File "c:\Users\Jeff Reback\Documents\GitHub\pandas\build\lib.win32-2.6\pandas\core\groupby.py", line 107, in f
    result = self.aggregate(lambda x: npfunc(x, axis=self.axis))
  File "c:\Users\Jeff Reback\Documents\GitHub\pandas\build\lib.win32-2.6\pandas\core\groupby.py", line 2405, in aggregate
    result = self._aggregate_generic(arg, *args, **kwargs)
  File "c:\Users\Jeff Reback\Documents\GitHub\pandas\build\lib.win32-2.6\pandas\core\groupby.py", line 2464, in _aggregate_generic
    return self._aggregate_item_by_item(func, *args, **kwargs)
  File "c:\Users\Jeff Reback\Documents\GitHub\pandas\build\lib.win32-2.6\pandas\core\groupby.py", line 2508, in _aggregate_item_by_item
    raise errors
TypeError: 'int' object is not callable

======================================================================
ERROR: test_count (pandas.tests.test_groupby.TestGroupBy)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "c:\Users\Jeff Reback\Documents\GitHub\pandas\build\lib.win32-2.6\pandas\tests\test_groupby.py", line 2012, in test_count
    count_as = df.groupby('A').count()
  File "c:\Users\Jeff Reback\Documents\GitHub\pandas\build\lib.win32-2.6\pandas\core\groupby.py", line 729, in count
    return self._count().astype('int64')
  File "c:\Users\Jeff Reback\Documents\GitHub\pandas\build\lib.win32-2.6\pandas\core\groupby.py", line 107, in f
    result = self.aggregate(lambda x: npfunc(x, axis=self.axis))
  File "c:\Users\Jeff Reback\Documents\GitHub\pandas\build\lib.win32-2.6\pandas\core\groupby.py", line 2405, in aggregate
    result = self._aggregate_generic(arg, *args, **kwargs)
  File "c:\Users\Jeff Reback\Documents\GitHub\pandas\build\lib.win32-2.6\pandas\core\groupby.py", line 2464, in _aggregate_generic
    return self._aggregate_item_by_item(func, *args, **kwargs)
  File "c:\Users\Jeff Reback\Documents\GitHub\pandas\build\lib.win32-2.6\pandas\core\groupby.py", line 2508, in _aggregate_item_by_item
    raise errors
TypeError: 'int' object is not callable

======================================================================
ERROR: test_count_object (pandas.tests.test_groupby.TestGroupBy)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "c:\Users\Jeff Reback\Documents\GitHub\pandas\build\lib.win32-2.6\pandas\tests\test_groupby.py", line 2026, in test_count_object
    result = df.groupby('c').a.count()
  File "c:\Users\Jeff Reback\Documents\GitHub\pandas\build\lib.win32-2.6\pandas\core\groupby.py", line 729, in count
    return self._count().astype('int64')
  File "c:\Users\Jeff Reback\Documents\GitHub\pandas\build\lib.win32-2.6\pandas\core\groupby.py", line 107, in f
    result = self.aggregate(lambda x: npfunc(x, axis=self.axis))
  File "c:\Users\Jeff Reback\Documents\GitHub\pandas\build\lib.win32-2.6\pandas\core\groupby.py", line 2063, in aggregate
    result = self._aggregate_named(func_or_funcs, *args, **kwargs)
  File "c:\Users\Jeff Reback\Documents\GitHub\pandas\build\lib.win32-2.6\pandas\core\groupby.py", line 2147, in _aggregate_named
    output = func(group, *args, **kwargs)
  File "c:\Users\Jeff Reback\Documents\GitHub\pandas\build\lib.win32-2.6\pandas\core\groupby.py", line 107, in <lambda>
    result = self.aggregate(lambda x: npfunc(x, axis=self.axis))
  File "c:\Users\Jeff Reback\Documents\GitHub\pandas\build\lib.win32-2.6\pandas\core\groupby.py", line 725, in <lambda>
    _count = _groupby_function('_count', 'count', lambda x, axis=0: x.size(),
TypeError: 'int' object is not callable

cpcloud · 2014-05-09T16:45:26Z

@jreback small favor: do you have a box or something that i can log into and test this? i don't have a copy of windows anywhere that i can test this stuff on

jreback · 2014-05-09T16:47:28Z

wish I had an easy way for you to do that.

cpcloud · 2014-05-09T16:49:06Z

ok i'll see if i can repro this on linux

jreback · 2014-05-09T16:49:54Z

the first one I think you can 'fake' a test (e.g. force the cython agg to fail), then the rest raises

maybe passing in float32 should fail the othe test

lmk

jreback · 2014-05-09T16:50:23Z

I can test when you have changes

cpcloud · 2014-05-09T17:54:59Z

@jreback does np.empty(n) default to float32 on Win32?

jreback · 2014-05-09T18:02:46Z

same on 2.7-64 & numpy 1.8.0

C:\Users\Jeff Reback\Documents\GitHub\pandas>c:\python26-32\python.exe
Python 2.6.6 (r266:84297, Aug 24 2010, 18:46:32) [MSC v.1500 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy as np
>>> np.__version__
'1.6.2'
>>> np.empty((1,1)).dtype
dtype('float64')
>>> np.empty(1).dtype
dtype('float64')
>>> np.arange(1).dtype
dtype('int32')
>>> np.arange(1.).dtype
dtype('float64')
>>>

cpcloud · 2014-05-09T18:03:19Z

thx

cpcloud added this to the 0.14.0 milestone May 6, 2014

jreback added Performance labels May 6, 2014

cpcloud self-assigned this May 6, 2014

ENH: use size instead of cythonized count for fallback cases

cf6cbb2

cpcloud added a commit that referenced this pull request May 8, 2014

Merge pull request #7055 from cpcloud/groupby-count-with-size

98b4fcf

ENH: use size instead of cythonized count for fallback cases

cpcloud merged commit 98b4fcf into pandas-dev:master May 8, 2014

cpcloud deleted the groupby-count-with-size branch May 8, 2014 17:55

cpcloud mentioned this pull request May 9, 2014

BUG: use size attribute (not method call) #7089

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ENH: use size instead of cythonized count for fallback cases #7055

ENH: use size instead of cythonized count for fallback cases #7055

Uh oh!

cpcloud commented May 6, 2014

Uh oh!

cpcloud commented May 8, 2014

Uh oh!

cpcloud commented May 8, 2014

Uh oh!

jreback commented May 8, 2014

Uh oh!

jreback commented May 9, 2014

Uh oh!

cpcloud commented May 9, 2014

Uh oh!

jreback commented May 9, 2014

Uh oh!

cpcloud commented May 9, 2014

Uh oh!

jreback commented May 9, 2014

Uh oh!

jreback commented May 9, 2014

Uh oh!

cpcloud commented May 9, 2014

Uh oh!

jreback commented May 9, 2014

Uh oh!

cpcloud commented May 9, 2014

Uh oh!

Uh oh!

Uh oh!

ENH: use size instead of cythonized count for fallback cases #7055

ENH: use size instead of cythonized count for fallback cases #7055

Uh oh!

Conversation

cpcloud commented May 6, 2014

Uh oh!

cpcloud commented May 8, 2014

Uh oh!

cpcloud commented May 8, 2014

Uh oh!

jreback commented May 8, 2014

Uh oh!

jreback commented May 9, 2014

Uh oh!

cpcloud commented May 9, 2014

Uh oh!

jreback commented May 9, 2014

Uh oh!

cpcloud commented May 9, 2014

Uh oh!

jreback commented May 9, 2014

Uh oh!

jreback commented May 9, 2014

Uh oh!

cpcloud commented May 9, 2014

Uh oh!

jreback commented May 9, 2014

Uh oh!

cpcloud commented May 9, 2014

Uh oh!

Uh oh!