Skip to content

DEPR: Silent dropping of nuisance columns in agg_list_like #43741

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Sep 29, 2021
Merged
Next Next commit
DEPR: Dropping of silent columns in NDFrame.agg with list-like func
  • Loading branch information
rhshadrach committed Sep 25, 2021
commit bc67672031ff1b1b51719ea5af65388a3eb09799
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.4.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -337,6 +337,7 @@ Other Deprecations
- Deprecated the ``squeeze`` argument to :meth:`read_csv`, :meth:`read_table`, and :meth:`read_excel`. Users should squeeze the DataFrame afterwards with ``.squeeze("columns")`` instead. (:issue:`43242`)
- Deprecated the ``index`` argument to :class:`SparseArray` construction (:issue:`23089`)
- Deprecated :meth:`.Rolling.validate`, :meth:`.Expanding.validate`, and :meth:`.ExponentialMovingWindow.validate` (:issue:`43665`)
- Deprecated silent dropping of columns that raised a ``TypeError`` or ``DataError`` in :class:`Series.aggregate` and :class:`DataFrame.aggregate` when used with a list (:issue:``)

.. ---------------------------------------------------------------------------

Expand Down
21 changes: 18 additions & 3 deletions pandas/core/apply.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@
FrameOrSeries,
)
from pandas.util._decorators import cache_readonly
from pandas.util._exceptions import find_stack_level

from pandas.core.dtypes.cast import is_nested_object
from pandas.core.dtypes.common import (
Expand Down Expand Up @@ -335,6 +336,7 @@ def agg_list_like(self) -> DataFrame | Series:

results = []
keys = []
failed_names = []

# degenerate case
if selected_obj.ndim == 1:
Expand All @@ -344,7 +346,7 @@ def agg_list_like(self) -> DataFrame | Series:
new_res = colg.aggregate(a)

except TypeError:
pass
failed_names.append(com.get_callable_name(a) or a)
else:
results.append(new_res)

Expand All @@ -358,10 +360,14 @@ def agg_list_like(self) -> DataFrame | Series:
for index, col in enumerate(selected_obj):
colg = obj._gotitem(col, ndim=1, subset=selected_obj.iloc[:, index])
try:
new_res = colg.aggregate(arg)
with warnings.catch_warnings(record=True) as w:
new_res = colg.aggregate(arg)
if len(w) > 0:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are all warnings that happen here going to represent a failure?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Woah - thanks. No, probably not. Will inspect the warnings themselves.

Copy link
Member Author

@rhshadrach rhshadrach Sep 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This got a bit hairy, I wasn't able to find a good way to capture and suppress only our warnings while passing through any other warnings and being able to determine if we raised. Supplying a filter within the catch_warnings context manager means we can't fell if we raised; using record=True there means all user warnings would also be suppressed.

If there are any suggestions for a better implementation, it would be appreciated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens if you dont catch warnings at all here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running

def foo(x):
   return x.prod()
df = pd.DataFrame({'a': [1, 2], 'b': ['x', 'y'], 'c': ['z', 'w'], 'd': ['a', 'b']})
df.agg([foo, lambda x: x.sum()])

gives

FutureWarning: ['foo'] did not aggregate successfully.
FutureWarning: ['foo'] did not aggregate successfully.
FutureWarning: ['foo'] did not aggregate successfully.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think there's something like a filter="once" to prevent exactly that (and i think its the default)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see - thanks. Digging into this, the only way I can see why I'm getting multiple warnings is because of https://bugs.python.org/issue29672. I haven't confirmed yet if the code within agg ever hits a separate catch_warnings; if it does that will cause repeat messages.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still going to verify that the use of another catch_warnings (in a different part of the code) is the cause for the multiple messages I am seeing without the special handling in this PR.

Copy link
Member Author

@rhshadrach rhshadrach Sep 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've confirmed it is because of the Python bug mentioned above. The code in the sample hits pandas.core.indexes.base.Index_with_infer which uses the catch_warnings context manager. Here is a MWE to demonstrate the bug (mirroring the structure of agg_list_like):

def foo():
    with warnings.catch_warnings():
        warnings.filterwarnings("ignore", ".*the Index constructor", FutureWarning)


def bar(ndim):
    if ndim == 1:
        pass
    else:
        for _ in range(3):
            bar(1)

    warnings.warn('test warning', FutureWarning)
    foo()

bar(2)

With the call to foo from within bar, three warning messages are printed. When the call to foo is removed, only one is printed.

With this, are we okay with catching the warnings as was merged @jbrockmendel?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

im fine with it. a comment to this effect might be worthwhile

failed_names.append(col)
except (TypeError, DataError):
pass
failed_names.append(col)
except ValueError as err:
failed_names.append(col)
# cannot aggregate
if "Must produce aggregated value" in str(err):
# raised directly in _aggregate_named
Expand All @@ -384,6 +390,15 @@ def agg_list_like(self) -> DataFrame | Series:
if not len(results):
raise ValueError("no results")

if len(failed_names) > 0:
warnings.warn(
f"{failed_names} did not aggregate successfully. If any error is "
f"raised this will raise in a future version of pandas. "
f"Drop these columns/ops to avoid this warning.",
FutureWarning,
stacklevel=find_stack_level(),
)

try:
concatenated = concat(results, keys=keys, axis=1, sort=False)
except TypeError as err:
Expand Down
18 changes: 14 additions & 4 deletions pandas/tests/apply/test_frame_apply.py
Original file line number Diff line number Diff line change
Expand Up @@ -1087,12 +1087,16 @@ def test_agg_multiple_mixed_no_warning():
index=["min", "sum"],
)
# sorted index
with tm.assert_produces_warning(None):
with tm.assert_produces_warning(
FutureWarning, match=r"\['D'\] did not aggregate successfully"
):
result = mdf.agg(["min", "sum"])

tm.assert_frame_equal(result, expected)

with tm.assert_produces_warning(None):
with tm.assert_produces_warning(
FutureWarning, match=r"\['D'\] did not aggregate successfully"
):
result = mdf[["D", "C", "B", "A"]].agg(["sum", "min"])

# GH40420: the result of .agg should have an index that is sorted
Expand Down Expand Up @@ -1201,7 +1205,10 @@ def test_nuiscance_columns():
expected = Series([6, 6.0, "foobarbaz"], index=["A", "B", "C"])
tm.assert_series_equal(result, expected)

result = df.agg(["sum"])
with tm.assert_produces_warning(
FutureWarning, match=r"\['D'\] did not aggregate successfully"
):
result = df.agg(["sum"])
expected = DataFrame(
[[6, 6.0, "foobarbaz"]], index=["sum"], columns=["A", "B", "C"]
)
Expand Down Expand Up @@ -1433,7 +1440,10 @@ def foo(s):
return s.sum() / 2

aggs = ["sum", foo, "count", "min"]
result = df.agg(aggs)
with tm.assert_produces_warning(
FutureWarning, match=r"\['item'\] did not aggregate successfully"
):
result = df.agg(aggs)
expected = DataFrame(
{
"item": ["123456", np.nan, 6, "1"],
Expand Down
10 changes: 8 additions & 2 deletions pandas/tests/groupby/aggregate/test_aggregate.py
Original file line number Diff line number Diff line change
Expand Up @@ -339,8 +339,14 @@ def test_multiple_functions_tuples_and_non_tuples(df):
expected = df.groupby("A")["C"].agg(ex_funcs)
tm.assert_frame_equal(result, expected)

result = df.groupby("A").agg(funcs)
expected = df.groupby("A").agg(ex_funcs)
with tm.assert_produces_warning(
FutureWarning, match=r"\['B'\] did not aggregate successfully"
):
result = df.groupby("A").agg(funcs)
with tm.assert_produces_warning(
FutureWarning, match=r"\['B'\] did not aggregate successfully"
):
expected = df.groupby("A").agg(ex_funcs)
tm.assert_frame_equal(result, expected)


Expand Down
8 changes: 6 additions & 2 deletions pandas/tests/groupby/aggregate/test_other.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,13 +45,17 @@ def peak_to_peak(arr):
return arr.max() - arr.min()

with tm.assert_produces_warning(
FutureWarning, match="Dropping invalid", check_stacklevel=False
FutureWarning,
match=r"\['key2'\] did not aggregate successfully",
check_stacklevel=False,
):
expected = grouped.agg([peak_to_peak])
expected.columns = ["data1", "data2"]

with tm.assert_produces_warning(
FutureWarning, match="Dropping invalid", check_stacklevel=False
FutureWarning,
match=r"\['key2'\] did not aggregate successfully",
check_stacklevel=False,
):
result = grouped.agg(peak_to_peak)
tm.assert_frame_equal(result, expected)
Expand Down
5 changes: 4 additions & 1 deletion pandas/tests/groupby/test_groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -583,7 +583,10 @@ def test_frame_multi_key_function_list():

grouped = data.groupby(["A", "B"])
funcs = [np.mean, np.std]
agged = grouped.agg(funcs)
with tm.assert_produces_warning(
FutureWarning, match=r"\['C'\] did not aggregate successfully"
):
agged = grouped.agg(funcs)
expected = pd.concat(
[grouped["D"].agg(funcs), grouped["E"].agg(funcs), grouped["F"].agg(funcs)],
keys=["D", "E", "F"],
Expand Down
4 changes: 3 additions & 1 deletion pandas/tests/resample/test_resample_api.py
Original file line number Diff line number Diff line change
Expand Up @@ -352,7 +352,9 @@ def test_agg():
for t in cases:
warn = FutureWarning if t in cases[1:3] else None
with tm.assert_produces_warning(
warn, match="Dropping invalid columns", check_stacklevel=False
warn,
match=r"\['date'\] did not aggregate successfully",
check_stacklevel=False,
):
# .var on dt64 column raises and is dropped
result = t.aggregate([np.mean, np.std])
Expand Down