Skip to content

REGR: 1.3 invalid exclusion of nuisance columns with groupby aggregation #43380

Closed
@joseph-wakeling-frequenz

Description

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

Adapted from "Automatic exclusion of nuisance columns" in the User Guide "Group by" docs:

from decimal import Decimal

import pandas as pd


df_dec = pd.DataFrame(
    {
        "id": [1, 2, 1, 2],
        "int_column": [1, 2, 3, 4],
        "dec_column": [
            Decimal("0.50"),
            Decimal("0.15"),
            Decimal("0.25"),
            Decimal("0.40"),
        ],
    }
)

print('\ndf_dec.groupby(["id"])[["dec_column"]].sum()')
print('According to docs this should sum correctly')
print('It works for pandas 1.2.x but generates an empty dataframe for 1.3.x')
print(df_dec.groupby(["id"])[["dec_column"]].sum())

print('\ndf_dec.groupby(["id"])[["int_column", "dec_column"]].sum()')
print('This drops `dec_column` as expected for both 1.2.x and 1.3.x')
print(df_dec.groupby(["id"])[["int_column", "dec_column"]].sum())

print('\ndf_dec.groupby(["id"]).agg({"int_column": "sum", "dec_column": "sum"})')
print('This aggregates everything correctly as expected for both 1.2.x and 1.3.x')
print(df_dec.groupby(["id"]).agg({"int_column": "sum", "dec_column": "sum"}))

Problem description

The User Guide "Group by" docs provides a code example that shows when nuisance columns will be excluded from aggregation. According to this doc the case:

df_dec.groupby(["id"])[["dec_column"]].sum()

should produce a valid aggregation, but for pandas >= 1.3.0 it results in an empty dataframe.

The impact of the regression can even be seen in the published docs. If we look an archive.org 2021-02-25 snapshot of the "Automatic exclusion of nuisance columns" section, we can see that the example produces correct output (see Out[170] in the code example):
https://web.archive.org/web/20210225195813/https://pandas.pydata.org/docs/user_guide/groupby.html#automatic-exclusion-of-nuisance-columns

By contrast the 2021-08-24 snapshot displays an empty dataframe for the Out[170] example:
https://web.archive.org/web/20210824151314/https://pandas.pydata.org/docs/user_guide/groupby.html#automatic-exclusion-of-nuisance-columns

However, note that the docs in both cases indicate that the example should produce a correct aggregation.

Expected Output

df_dec.groupby(["id"])[["dec_column"]].sum()

should produce the result:

   dec_column
id           
1        0.75
2        0.55

For pandas 1.2.x it does so as expected. For pandas >= 1.3.0 it produces instead the incorrect

Empty DataFrame
Columns: []
Index: [1, 2]

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugDocsDuplicate ReportDuplicate issue or pull requestGroupbyNuisance ColumnsIdentifying/Dropping nuisance columns in reductions, groupby.add, DataFrame.applyReduction Operationssum, mean, min, max, etc.RegressionFunctionality that used to work in a prior pandas version

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions