Skip to content

ENH: use size instead of cythonized count for fallback cases #7055

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 8, 2014
Merged

ENH: use size instead of cythonized count for fallback cases #7055

merged 1 commit into from
May 8, 2014

Conversation

cpcloud
Copy link
Member

@cpcloud cpcloud commented May 6, 2014

  • use size instead of cythonized count for integer case since we cannot have
    nan in that case
  • not faster perf wise
  • compilation time is shorter because count has no int templates except for
    dates and times
  • any precision issues (don't think there were any) with other integer types are gone, since only int64 is compared with dates, whereas lower precision integers use size.

Anywho, here's the vbench results for this PR vs master:

-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------
groupby_int_count                            |   4.0247 |   4.0337 |   0.9978 |
groupby_multi_count                          |   7.5080 |   7.4813 |   1.0036 |
-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------

@cpcloud cpcloud added this to the 0.14.0 milestone May 6, 2014
@cpcloud cpcloud self-assigned this May 6, 2014
@cpcloud
Copy link
Member Author

cpcloud commented May 8, 2014

@jreback ok to merge?

@cpcloud
Copy link
Member Author

cpcloud commented May 8, 2014

actually let me add a test in for lower prec ints

@jreback
Copy link
Contributor

jreback commented May 8, 2014

looks fine...maybe squash and merge

cpcloud added a commit that referenced this pull request May 8, 2014
ENH: use size instead of cythonized count for fallback cases
@cpcloud cpcloud merged commit 98b4fcf into pandas-dev:master May 8, 2014
@cpcloud cpcloud deleted the groupby-count-with-size branch May 8, 2014 17:55
@jreback
Copy link
Contributor

jreback commented May 9, 2014

issues on windows:

  • I think the fallback is not working, e.g. should be x.size, not x.size(); so that is untested
  • I think their is a casting issue somewhere here (e.g. regular floats are not getting upcasted to float64), default on windows 32 is float32
>>> c:\users\jeff reback\documents\github\pandas\build\lib.win32-2.6\pandas\core\groupby.py(107)f()
-> result = self.aggregate(lambda x: npfunc(x, axis=self.axis))
(Pdb) l
102             try:
103                 return self._cython_agg_general(alias, numeric_only=numeric_only)
104             except AssertionError as e:
105                 raise SpecificationError(str(e))
106             except Exception:
107  ->             result = self.aggregate(lambda x: npfunc(x, axis=self.axis))
108                 if _convert:
109                     result = result.convert_objects()
110                 return result
111
112         f.__doc__ = "Compute %s of group values" % name
(Pdb) p self._cython_agg_general(alias,numeric_only=numeric_only)
*** ValueError: ValueError("Buffer dtype mismatch, expected 'float64_t' but got 'float'",)
(Pdb) d
> c:\users\jeff reback\documents\github\pandas\build\lib.win32-2.6\pandas\core\groupby.py(2405)aggregate()
-> result = self._aggregate_generic(arg, *args, **kwargs)
(Pdb) d
> c:\users\jeff reback\documents\github\pandas\build\lib.win32-2.6\pandas\core\groupby.py(2464)_aggregate_generic()
-> return self._aggregate_item_by_item(func, *args, **kwargs)
(Pdb) d
> c:\users\jeff reback\documents\github\pandas\build\lib.win32-2.6\pandas\core\groupby.py(2508)_aggregate_item_by_item()
-> raise errors
======================================================================
ERROR: test_mixed_type_join_with_suffix (pandas.tools.tests.test_merge.TestMerge)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "c:\Users\Jeff Reback\Documents\GitHub\pandas\build\lib.win32-2.6\pandas\tools\tests\test_merge.py", line 686, in test_mixed_type_join_with_suf
fix
    cn = grouped.count()
  File "c:\Users\Jeff Reback\Documents\GitHub\pandas\build\lib.win32-2.6\pandas\core\groupby.py", line 729, in count
    return self._count().astype('int64')
  File "c:\Users\Jeff Reback\Documents\GitHub\pandas\build\lib.win32-2.6\pandas\core\groupby.py", line 107, in f
    result = self.aggregate(lambda x: npfunc(x, axis=self.axis))
  File "c:\Users\Jeff Reback\Documents\GitHub\pandas\build\lib.win32-2.6\pandas\core\groupby.py", line 2405, in aggregate
    result = self._aggregate_generic(arg, *args, **kwargs)
  File "c:\Users\Jeff Reback\Documents\GitHub\pandas\build\lib.win32-2.6\pandas\core\groupby.py", line 2464, in _aggregate_generic
    return self._aggregate_item_by_item(func, *args, **kwargs)
  File "c:\Users\Jeff Reback\Documents\GitHub\pandas\build\lib.win32-2.6\pandas\core\groupby.py", line 2508, in _aggregate_item_by_item
    raise errors
TypeError: 'int' object is not callable

======================================================================
ERROR: test_count (pandas.tests.test_groupby.TestGroupBy)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "c:\Users\Jeff Reback\Documents\GitHub\pandas\build\lib.win32-2.6\pandas\tests\test_groupby.py", line 2012, in test_count
    count_as = df.groupby('A').count()
  File "c:\Users\Jeff Reback\Documents\GitHub\pandas\build\lib.win32-2.6\pandas\core\groupby.py", line 729, in count
    return self._count().astype('int64')
  File "c:\Users\Jeff Reback\Documents\GitHub\pandas\build\lib.win32-2.6\pandas\core\groupby.py", line 107, in f
    result = self.aggregate(lambda x: npfunc(x, axis=self.axis))
  File "c:\Users\Jeff Reback\Documents\GitHub\pandas\build\lib.win32-2.6\pandas\core\groupby.py", line 2405, in aggregate
    result = self._aggregate_generic(arg, *args, **kwargs)
  File "c:\Users\Jeff Reback\Documents\GitHub\pandas\build\lib.win32-2.6\pandas\core\groupby.py", line 2464, in _aggregate_generic
    return self._aggregate_item_by_item(func, *args, **kwargs)
  File "c:\Users\Jeff Reback\Documents\GitHub\pandas\build\lib.win32-2.6\pandas\core\groupby.py", line 2508, in _aggregate_item_by_item
    raise errors
TypeError: 'int' object is not callable

======================================================================
ERROR: test_count_object (pandas.tests.test_groupby.TestGroupBy)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "c:\Users\Jeff Reback\Documents\GitHub\pandas\build\lib.win32-2.6\pandas\tests\test_groupby.py", line 2026, in test_count_object
    result = df.groupby('c').a.count()
  File "c:\Users\Jeff Reback\Documents\GitHub\pandas\build\lib.win32-2.6\pandas\core\groupby.py", line 729, in count
    return self._count().astype('int64')
  File "c:\Users\Jeff Reback\Documents\GitHub\pandas\build\lib.win32-2.6\pandas\core\groupby.py", line 107, in f
    result = self.aggregate(lambda x: npfunc(x, axis=self.axis))
  File "c:\Users\Jeff Reback\Documents\GitHub\pandas\build\lib.win32-2.6\pandas\core\groupby.py", line 2063, in aggregate
    result = self._aggregate_named(func_or_funcs, *args, **kwargs)
  File "c:\Users\Jeff Reback\Documents\GitHub\pandas\build\lib.win32-2.6\pandas\core\groupby.py", line 2147, in _aggregate_named
    output = func(group, *args, **kwargs)
  File "c:\Users\Jeff Reback\Documents\GitHub\pandas\build\lib.win32-2.6\pandas\core\groupby.py", line 107, in <lambda>
    result = self.aggregate(lambda x: npfunc(x, axis=self.axis))
  File "c:\Users\Jeff Reback\Documents\GitHub\pandas\build\lib.win32-2.6\pandas\core\groupby.py", line 725, in <lambda>
    _count = _groupby_function('_count', 'count', lambda x, axis=0: x.size(),
TypeError: 'int' object is not callable

@cpcloud
Copy link
Member Author

cpcloud commented May 9, 2014

@jreback small favor: do you have a box or something that i can log into and test this? i don't have a copy of windows anywhere that i can test this stuff on

@jreback
Copy link
Contributor

jreback commented May 9, 2014

wish I had an easy way for you to do that.

@cpcloud
Copy link
Member Author

cpcloud commented May 9, 2014

ok i'll see if i can repro this on linux

@jreback
Copy link
Contributor

jreback commented May 9, 2014

the first one I think you can 'fake' a test (e.g. force the cython agg to fail), then the rest raises

maybe passing in float32 should fail the othe test

lmk

@jreback
Copy link
Contributor

jreback commented May 9, 2014

I can test when you have changes

@cpcloud
Copy link
Member Author

cpcloud commented May 9, 2014

@jreback does np.empty(n) default to float32 on Win32?

@jreback
Copy link
Contributor

jreback commented May 9, 2014

same on 2.7-64 & numpy 1.8.0

C:\Users\Jeff Reback\Documents\GitHub\pandas>c:\python26-32\python.exe
Python 2.6.6 (r266:84297, Aug 24 2010, 18:46:32) [MSC v.1500 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy as np
>>> np.__version__
'1.6.2'
>>> np.empty((1,1)).dtype
dtype('float64')
>>> np.empty(1).dtype
dtype('float64')
>>> np.arange(1).dtype
dtype('int32')
>>> np.arange(1.).dtype
dtype('float64')
>>>

@cpcloud
Copy link
Member Author

cpcloud commented May 9, 2014

thx

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants