Description
Pandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
def f(group):
print(group)
print(group['A'])
raise TypeError
if __name__ == '__main__':
pd.DataFrame.from_dict({'A': [1], 'B': [3]}).groupby('A').apply(f)
Issue Description
If a function (f
), which is applied via core.groupby.GroupBy.apply, raises a TypeError
, the grouping column (A
) is dropped and f
is executed again. However, if A
is accessed in f
, this leads to a confusing stack trace, which can mask the actual error:
Traceback (most recent call last):
...
File "sample.py", line 6, in f
raise TypeError
TypeError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
...
File "pandas/_libs/hashtable_class_helper.pxi", line 5206, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'A'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
...
File ".../pandas/core/indexes/base.py", line 3623, in get_loc
raise KeyError(key) from err
KeyError: 'A'
Of course, if you read it top to bottom, you notice the actual TypeError
, and take care of it first. I however read the KeyError
trace first and was really confused as to how the column was dropped when I was debugging it (the stack trace is very long in this case). If other errors occur, the exception is simply raised, so this came unexpected. I'm not sure if this behavior at least warrants a note in the documentation. However, I assume this to be an actual bug, because of the code that's responsible for this behavior, which is in pandas/core/groupby/groupby.py
, lines 1413 - 1425:
try:
result = self._python_apply_general(f, self._selected_obj)
except TypeError:
# gh-20949
# try again, with .apply acting as a filtering
# operation, by excluding the grouping column
# This would normally not be triggered
# except if the udf is trying an operation that
# fails on *some* columns, e.g. a numeric operation
# on a string grouper column
with self._group_selection_context():
return self._python_apply_general(f, self._selected_obj)
I assume this references issue #20949, based on the tag, but I couldn't really find something in the issue directly addressing this behavior. The comment makes the assumption that this handling is only triggered "if the udf is trying an operation that fails on some columns", which is not the case. I'm not sure this code works as intended. Maybe a better option would be to catch the TypeError
more specifically, or to add an option to GroupBy.apply
which specifies if the grouping column should be passed to the applied function, or not.
Expected Behavior
Simply raise the occurred TypeError
, as it is done with other exceptions.
Installed Versions
pd.show_versions()
failed, so here is the output of pd.__version__
and sys.version
:
pd.__version__
: 1.4.1sys.version
: 3.10.2 ...