Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DEPR: observed=False default in groupby #51811

Merged
merged 10 commits into from
Mar 17, 2023
4 changes: 2 additions & 2 deletions doc/source/user_guide/10min.rst
Original file line number Diff line number Diff line change
Expand Up @@ -702,11 +702,11 @@ Sorting is per order in the categories, not lexical order:

df.sort_values(by="grade")

Grouping by a categorical column also shows empty categories:
Grouping by a categorical column with ``observed=False`` also shows empty categories:

.. ipython:: python

df.groupby("grade").size()
df.groupby("grade", observed=False).size()


Plotting
Expand Down
4 changes: 2 additions & 2 deletions doc/source/user_guide/advanced.rst
Original file line number Diff line number Diff line change
Expand Up @@ -800,8 +800,8 @@ Groupby operations on the index will preserve the index nature as well.

.. ipython:: python

df2.groupby(level=0).sum()
df2.groupby(level=0).sum().index
df2.groupby(level=0, observed=True).sum()
df2.groupby(level=0, observed=True).sum().index

Reindexing operations will return a resulting index based on the type of the passed
indexer. Passing a list will return a plain-old ``Index``; indexing with
Expand Down
10 changes: 5 additions & 5 deletions doc/source/user_guide/categorical.rst
Original file line number Diff line number Diff line change
Expand Up @@ -607,7 +607,7 @@ even if some categories are not present in the data:
s = pd.Series(pd.Categorical(["a", "b", "c", "c"], categories=["c", "a", "b", "d"]))
s.value_counts()

``DataFrame`` methods like :meth:`DataFrame.sum` also show "unused" categories.
``DataFrame`` methods like :meth:`DataFrame.sum` also show "unused" categories when ``observed=False``.

.. ipython:: python

Expand All @@ -618,17 +618,17 @@ even if some categories are not present in the data:
data=[[1, 2, 3], [4, 5, 6]],
columns=pd.MultiIndex.from_arrays([["A", "B", "B"], columns]),
).T
df.groupby(level=1).sum()
df.groupby(level=1, observed=False).sum()

Groupby will also show "unused" categories:
Groupby will also show "unused" categories when ``observed=False``:

.. ipython:: python

cats = pd.Categorical(
["a", "b", "b", "b", "c", "c", "c"], categories=["a", "b", "c", "d"]
)
df = pd.DataFrame({"cats": cats, "values": [1, 2, 2, 2, 3, 4, 5]})
df.groupby("cats").mean()
df.groupby("cats", observed=False).mean()

cats2 = pd.Categorical(["a", "a", "b", "b"], categories=["a", "b", "c"])
df2 = pd.DataFrame(
Expand All @@ -638,7 +638,7 @@ Groupby will also show "unused" categories:
"values": [1, 2, 3, 4],
}
)
df2.groupby(["cats", "B"]).mean()
df2.groupby(["cats", "B"], observed=False).mean()


Pivot tables:
Expand Down
2 changes: 1 addition & 1 deletion doc/source/user_guide/groupby.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1401,7 +1401,7 @@ can be used as group keys. If so, the order of the levels will be preserved:

factor = pd.qcut(data, [0, 0.25, 0.5, 0.75, 1.0])

data.groupby(factor).mean()
data.groupby(factor, observed=False).mean()

.. _groupby.specify:

Expand Down
2 changes: 1 addition & 1 deletion doc/source/whatsnew/v0.15.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -85,7 +85,7 @@ For full docs, see the :ref:`categorical introduction <categorical>` and the
"medium", "good", "very good"])
df["grade"]
df.sort_values("grade")
df.groupby("grade").size()
df.groupby("grade", observed=False).size()

- ``pandas.core.group_agg`` and ``pandas.core.factor_agg`` were removed. As an alternative, construct
a dataframe and use ``df.groupby(<group>).agg(<func>)``.
Expand Down
2 changes: 1 addition & 1 deletion doc/source/whatsnew/v0.19.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1134,7 +1134,7 @@ As a consequence, ``groupby`` and ``set_index`` also preserve categorical dtypes
.. ipython:: python

df = pd.DataFrame({"A": [0, 1], "B": [10, 11], "C": cat})
df_grouped = df.groupby(by=["A", "C"]).first()
df_grouped = df.groupby(by=["A", "C"], observed=False).first()
df_set_idx = df.set_index(["A", "C"])

**Previous behavior**:
Expand Down
4 changes: 2 additions & 2 deletions doc/source/whatsnew/v0.20.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -289,15 +289,15 @@ In previous versions, ``.groupby(..., sort=False)`` would fail with a ``ValueErr

.. code-block:: ipython

In [3]: df[df.chromosomes != '1'].groupby('chromosomes', sort=False).sum()
In [3]: df[df.chromosomes != '1'].groupby('chromosomes', observed=False, sort=False).sum()
---------------------------------------------------------------------------
ValueError: items in new_categories are not the same as in old categories

**New behavior**:

.. ipython:: python

df[df.chromosomes != '1'].groupby('chromosomes', sort=False).sum()
df[df.chromosomes != '1'].groupby('chromosomes', observed=False, sort=False).sum()

.. _whatsnew_0200.enhancements.table_schema:

Expand Down
6 changes: 3 additions & 3 deletions doc/source/whatsnew/v0.22.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -109,7 +109,7 @@ instead of ``NaN``.

In [8]: grouper = pd.Categorical(['a', 'a'], categories=['a', 'b'])

In [9]: pd.Series([1, 2]).groupby(grouper).sum()
In [9]: pd.Series([1, 2]).groupby(grouper, observed=False).sum()
Out[9]:
a 3.0
b NaN
Expand All @@ -120,14 +120,14 @@ instead of ``NaN``.
.. ipython:: python

grouper = pd.Categorical(["a", "a"], categories=["a", "b"])
pd.Series([1, 2]).groupby(grouper).sum()
pd.Series([1, 2]).groupby(grouper, observed=False).sum()

To restore the 0.21 behavior of returning ``NaN`` for unobserved groups,
use ``min_count>=1``.

.. ipython:: python

pd.Series([1, 2]).groupby(grouper).sum(min_count=1)
pd.Series([1, 2]).groupby(grouper, observed=False).sum(min_count=1)

Resample
^^^^^^^^
Expand Down
1 change: 1 addition & 0 deletions doc/source/whatsnew/v2.1.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -99,6 +99,7 @@ Deprecations
- Deprecated silently dropping unrecognized timezones when parsing strings to datetimes (:issue:`18702`)
- Deprecated :meth:`DataFrame._data` and :meth:`Series._data`, use public APIs instead (:issue:`33333`)
- Deprecating pinning ``group.name`` to each group in :meth:`SeriesGroupBy.aggregate` aggregations; if your operation requires utilizing the groupby keys, iterate over the groupby object instead (:issue:`41090`)
- Deprecated the default of ``observed=False`` in :meth:`DataFrame.groupby` and :meth:`Series.groupby`; this will default to ``True`` in a future version (:issue:`43999`)
- Deprecated ``axis=1`` in :meth:`DataFrame.groupby` and in :class:`Grouper` constructor, do ``frame.T.groupby(...)`` instead (:issue:`51203`)
- Deprecated passing a :class:`DataFrame` to :meth:`DataFrame.from_records`, use :meth:`DataFrame.set_index` or :meth:`DataFrame.drop` instead (:issue:`51353`)
- Deprecated accepting slices in :meth:`DataFrame.take`, call ``obj[slicer]`` or pass a sequence of integers instead (:issue:`51539`)
Expand Down
2 changes: 1 addition & 1 deletion pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -8677,7 +8677,7 @@ def groupby(
as_index: bool = True,
sort: bool = True,
group_keys: bool = True,
observed: bool = False,
observed: bool | lib.NoDefault = lib.no_default,
dropna: bool = True,
) -> DataFrameGroupBy:
if axis is not lib.no_default:
Expand Down
21 changes: 18 additions & 3 deletions pandas/core/groupby/groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,7 @@ class providing the base-class of operations.
cache_readonly,
doc,
)
from pandas.util._exceptions import find_stack_level

from pandas.core.dtypes.cast import ensure_dtype_can_hold_na
from pandas.core.dtypes.common import (
Expand Down Expand Up @@ -905,7 +906,7 @@ def __init__(
as_index: bool = True,
sort: bool = True,
group_keys: bool | lib.NoDefault = True,
observed: bool = False,
observed: bool | lib.NoDefault = lib.no_default,
dropna: bool = True,
) -> None:
self._selection = selection
Expand All @@ -922,7 +923,6 @@ def __init__(
self.keys = keys
self.sort = sort
self.group_keys = group_keys
self.observed = observed
self.dropna = dropna

if grouper is None:
Expand All @@ -932,10 +932,23 @@ def __init__(
axis=axis,
level=level,
sort=sort,
observed=observed,
observed=False if observed is lib.no_default else observed,
dropna=self.dropna,
)

if observed is lib.no_default:
if any(ping._passed_categorical for ping in grouper.groupings):
warnings.warn(
"The default of observed=False is deprecated and will be changed "
"to True in a future version of pandas. Pass observed=False to "
"retain current behavior or observed=True to adopt the future "
"default and silence this warning.",
FutureWarning,
stacklevel=find_stack_level(),
)
observed = False
self.observed = observed

self.obj = obj
self.axis = obj._get_axis_number(axis)
self.grouper = grouper
Expand Down Expand Up @@ -2125,6 +2138,8 @@ def _value_counts(
result_series.index.droplevel(levels),
sort=self.sort,
dropna=self.dropna,
# GH#43999 - deprecation of observed=False
observed=False,
).transform("sum")
result_series /= indexed_group_size

Expand Down
6 changes: 5 additions & 1 deletion pandas/core/indexes/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -723,7 +723,11 @@ def _format_duplicate_message(self) -> DataFrame:
duplicates = self[self.duplicated(keep="first")].unique()
assert len(duplicates)

out = Series(np.arange(len(self))).groupby(self).agg(list)[duplicates]
out = (
Series(np.arange(len(self)))
.groupby(self, observed=False)
.agg(list)[duplicates]
)
if self._is_multi:
# test_format_duplicate_labels_message_multi
# error: "Type[Index]" has no attribute "from_tuples" [attr-defined]
Expand Down
2 changes: 1 addition & 1 deletion pandas/core/series.py
Original file line number Diff line number Diff line change
Expand Up @@ -1999,7 +1999,7 @@ def groupby(
as_index: bool = True,
sort: bool = True,
group_keys: bool = True,
observed: bool = False,
observed: bool | lib.NoDefault = lib.no_default,
dropna: bool = True,
) -> SeriesGroupBy:
from pandas.core.groupby.generic import SeriesGroupBy
Expand Down
5 changes: 5 additions & 0 deletions pandas/core/shared_docs.py
Original file line number Diff line number Diff line change
Expand Up @@ -154,6 +154,11 @@
This only applies if any of the groupers are Categoricals.
If True: only show observed values for categorical groupers.
If False: show all values for categorical groupers.

.. deprecated:: 2.1.0

The default value will change to True in a future version of pandas.

dropna : bool, default True
If True, and if group keys contain NA values, NA values together
with row/column will be dropped.
Expand Down
2 changes: 1 addition & 1 deletion pandas/plotting/_matplotlib/boxplot.py
Original file line number Diff line number Diff line change
Expand Up @@ -254,7 +254,7 @@ def _grouped_plot_by_column(
return_type=None,
**kwargs,
):
grouped = data.groupby(by)
grouped = data.groupby(by, observed=False)
if columns is None:
if not isinstance(by, (list, tuple)):
by = [by]
Expand Down
4 changes: 2 additions & 2 deletions pandas/tests/groupby/aggregate/test_aggregate.py
Original file line number Diff line number Diff line change
Expand Up @@ -1250,7 +1250,7 @@ def test_groupby_single_agg_cat_cols(grp_col_dict, exp_data):

input_df = input_df.astype({"cat": "category", "cat_ord": "category"})
input_df["cat_ord"] = input_df["cat_ord"].cat.as_ordered()
result_df = input_df.groupby("cat").agg(grp_col_dict)
result_df = input_df.groupby("cat", observed=False).agg(grp_col_dict)

# create expected dataframe
cat_index = pd.CategoricalIndex(
Expand Down Expand Up @@ -1289,7 +1289,7 @@ def test_groupby_combined_aggs_cat_cols(grp_col_dict, exp_data):

input_df = input_df.astype({"cat": "category", "cat_ord": "category"})
input_df["cat_ord"] = input_df["cat_ord"].cat.as_ordered()
result_df = input_df.groupby("cat").agg(grp_col_dict)
result_df = input_df.groupby("cat", observed=False).agg(grp_col_dict)

# create expected dataframe
cat_index = pd.CategoricalIndex(
Expand Down
2 changes: 1 addition & 1 deletion pandas/tests/groupby/test_apply.py
Original file line number Diff line number Diff line change
Expand Up @@ -883,7 +883,7 @@ def test_apply_multi_level_name(category):
df = DataFrame(
{"A": np.arange(10), "B": b, "C": list(range(10)), "D": list(range(10))}
).set_index(["A", "B"])
result = df.groupby("B").apply(lambda x: x.sum())
result = df.groupby("B", observed=False).apply(lambda x: x.sum())
tm.assert_frame_equal(result, expected)
assert df.index.names == ["A", "B"]

Expand Down
32 changes: 22 additions & 10 deletions pandas/tests/groupby/test_categorical.py
Original file line number Diff line number Diff line change
Expand Up @@ -739,7 +739,7 @@ def test_categorical_series(series, data):
# Group the given series by a series with categorical data type such that group A
# takes indices 0 and 3 and group B indices 1 and 2, obtaining the values mapped in
# the given data.
groupby = series.groupby(Series(list("ABBA"), dtype="category"))
groupby = series.groupby(Series(list("ABBA"), dtype="category"), observed=False)
result = groupby.aggregate(list)
expected = Series(data, index=CategoricalIndex(data.keys()))
tm.assert_series_equal(result, expected)
Expand Down Expand Up @@ -1115,7 +1115,7 @@ def test_groupby_multiindex_categorical_datetime():
"values": np.arange(9),
}
)
result = df.groupby(["key1", "key2"]).mean()
result = df.groupby(["key1", "key2"], observed=False).mean()

idx = MultiIndex.from_product(
[
Expand Down Expand Up @@ -1291,8 +1291,8 @@ def test_seriesgroupby_observed_apply_dict(df_cat, observed, index, data):

def test_groupby_categorical_series_dataframe_consistent(df_cat):
# GH 20416
expected = df_cat.groupby(["A", "B"])["C"].mean()
result = df_cat.groupby(["A", "B"]).mean()["C"]
expected = df_cat.groupby(["A", "B"], observed=False)["C"].mean()
result = df_cat.groupby(["A", "B"], observed=False).mean()["C"]
tm.assert_series_equal(result, expected)


Expand All @@ -1303,11 +1303,11 @@ def test_groupby_categorical_axis_1(code):
cat = Categorical.from_codes(code, categories=list("abc"))
msg = "DataFrame.groupby with axis=1 is deprecated"
with tm.assert_produces_warning(FutureWarning, match=msg):
gb = df.groupby(cat, axis=1)
gb = df.groupby(cat, axis=1, observed=False)
result = gb.mean()
msg = "The 'axis' keyword in DataFrame.groupby is deprecated"
with tm.assert_produces_warning(FutureWarning, match=msg):
gb2 = df.T.groupby(cat, axis=0)
gb2 = df.T.groupby(cat, axis=0, observed=False)
expected = gb2.mean().T
tm.assert_frame_equal(result, expected)

Expand Down Expand Up @@ -1478,7 +1478,7 @@ def test_series_groupby_categorical_aggregation_getitem():
df = DataFrame(d)
cat = pd.cut(df["foo"], np.linspace(0, 20, 5))
df["range"] = cat
groups = df.groupby(["range", "baz"], as_index=True, sort=True)
groups = df.groupby(["range", "baz"], as_index=True, sort=True, observed=False)
result = groups["foo"].agg("mean")
expected = groups.agg("mean")["foo"]
tm.assert_series_equal(result, expected)
Expand Down Expand Up @@ -1539,7 +1539,7 @@ def test_read_only_category_no_sort():
{"a": [1, 3, 5, 7], "b": Categorical([1, 1, 2, 2], categories=Index(cats))}
)
expected = DataFrame(data={"a": [2.0, 6.0]}, index=CategoricalIndex(cats, name="b"))
result = df.groupby("b", sort=False).mean()
result = df.groupby("b", sort=False, observed=False).mean()
tm.assert_frame_equal(result, expected)


Expand Down Expand Up @@ -1583,7 +1583,7 @@ def test_sorted_missing_category_values():
dtype="category",
)

result = df.groupby(["bar", "foo"]).size().unstack()
result = df.groupby(["bar", "foo"], observed=False).size().unstack()

tm.assert_frame_equal(result, expected)

Expand Down Expand Up @@ -1748,7 +1748,7 @@ def test_groupby_categorical_indices_unused_categories():
"col": range(3),
}
)
grouped = df.groupby("key", sort=False)
grouped = df.groupby("key", sort=False, observed=False)
result = grouped.indices
expected = {
"b": np.array([0, 1], dtype="intp"),
Expand Down Expand Up @@ -2013,3 +2013,15 @@ def test_many_categories(as_index, sort, index_kind, ordered):
expected = DataFrame({"a": Series(index), "b": data})

tm.assert_frame_equal(result, expected)


@pytest.mark.parametrize("cat_columns", ["a", "b", ["a", "b"]])
@pytest.mark.parametrize("keys", ["a", "b", ["a", "b"]])
def test_groupby_default_depr(cat_columns, keys):
# GH#43999
df = DataFrame({"a": [1, 1, 2, 3], "b": [4, 5, 6, 7]})
df[cat_columns] = df[cat_columns].astype("category")
msg = "The default of observed=False is deprecated"
klass = FutureWarning if set(cat_columns) & set(keys) else None
with tm.assert_produces_warning(klass, match=msg):
df.groupby(keys)
4 changes: 2 additions & 2 deletions pandas/tests/groupby/test_groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -1926,7 +1926,7 @@ def test_empty_groupby(

df = df.iloc[:0]

gb = df.groupby(keys, group_keys=False, dropna=dropna)[columns]
gb = df.groupby(keys, group_keys=False, dropna=dropna, observed=False)[columns]

def get_result(**kwargs):
if method == "attr":
Expand Down Expand Up @@ -2638,7 +2638,7 @@ def test_datetime_categorical_multikey_groupby_indices():
"c": Categorical.from_codes([-1, 0, 1], categories=[0, 1]),
}
)
result = df.groupby(["a", "b"]).indices
result = df.groupby(["a", "b"], observed=False).indices
expected = {
("a", Timestamp("2018-01-01 00:00:00")): np.array([0]),
("b", Timestamp("2018-02-01 00:00:00")): np.array([1]),
Expand Down
Loading