Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDEP-11: Change default of dropna to False #53094

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
78 changes: 78 additions & 0 deletions web/pandas/pdeps/0011-dropna-default.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
# PDEP-11: dropna default in pandas

- Created: 4 May 2023
- Status: Under discussion
- Discussion: [PR #53094](https://github.com/pandas-dev/pandas/pull/53094)
- Authors: [Richard Shadrach](https://github.com/rhshadrach)
- Revision: 1

## Abstract

Throughout pandas, almost all of the methods that have a `dropna` argument default
to `True`. Being the default, this can cause NA values to be silently dropped.
This PDEP proposes to deprecate the current default value of `True` and change it
to `False` in the next major release of pandas.

## Motivation and Scope

Upon seeing the output for a Series `ser`:

```python
print(ser.value_counts())

1 3
2 1
dtype: Int64
```

users may be surprised that the Series can contain NA values. By then operating
on data under the assumption NA values are not present, erroroneous results can
arise. The same issue can occur with `groupby`, which can also be used to produce
detailed summary statistics of data. We think it is not unreasonable that an
experienced pandas user seeing the code

df[["a", "b"]].groupby("a").sum()

would describe this operation as something like the following.

> For each unique value in column `a`, compute the sum of corresponding values
> in column `b` and return the results in a DataFrame indexed by the unique
> values of `a`.

This is correct, except that NA values in the column `a` will be dropped from
the computation. That pandas is taking this additional step in the computation
is not apparent from the code, and can surprise users.
mroeschke marked this conversation as resolved.
Show resolved Hide resolved

## Detailed Description

We propose to deprecate the current default of `dropna` and change it to
`False` across all applicable methods. The following methods have a dropna
argument, those marked with a `*` already default to `False`.

```python
Series.groupby
Series.mode
Series.nunique
Series.to_hdf*
Series.value_counts
DataFrame.groupby
DataFrame.mode
DataFrame.nunique
DataFrame.pivot_table
DataFrame.stack
DataFrame.to_hdf*
DataFrame.value_counts
SeriesGroupBy.nunique
SeriesGroupBy.value_counts
DataFrameGroupBy.nunique
DataFrameGroupBy.value_counts
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you might be missing a couple functions here.
This is the complete list (made using the keyword inspector script I posted on slack a while back).

<class 'pandas.core.arrays.categorical.Categorical'>.value_counts
<class 'pandas.core.indexes.category.CategoricalIndex'>.nunique
<class 'pandas.core.indexes.category.CategoricalIndex'>.value_counts
<class 'pandas.core.frame.DataFrame'>.groupby
<class 'pandas.core.frame.DataFrame'>.mode
<class 'pandas.core.frame.DataFrame'>.nunique
<class 'pandas.core.frame.DataFrame'>.pivot_table
<class 'pandas.core.frame.DataFrame'>.stack
<class 'pandas.core.frame.DataFrame'>.to_hdf
<class 'pandas.core.frame.DataFrame'>.value_counts
<class 'pandas.core.indexes.datetimes.DatetimeIndex'>.nunique
<class 'pandas.core.indexes.datetimes.DatetimeIndex'>.value_counts
<class 'pandas.io.pytables.HDFStore'>.append
<class 'pandas.io.pytables.HDFStore'>.append_to_multiple
<class 'pandas.io.pytables.HDFStore'>.put
<class 'pandas.core.indexes.base.Index'>.nunique
<class 'pandas.core.indexes.base.Index'>.value_counts
<class 'pandas.core.indexes.interval.IntervalIndex'>.nunique
<class 'pandas.core.indexes.interval.IntervalIndex'>.value_counts
<class 'pandas.core.indexes.multi.MultiIndex'>.nunique
<class 'pandas.core.indexes.multi.MultiIndex'>.value_counts
<class 'pandas.core.indexes.period.PeriodIndex'>.nunique
<class 'pandas.core.indexes.period.PeriodIndex'>.value_counts
<class 'pandas.core.indexes.range.RangeIndex'>.nunique
<class 'pandas.core.indexes.range.RangeIndex'>.value_counts
<class 'pandas.core.series.Series'>.groupby
<class 'pandas.core.series.Series'>.mode
<class 'pandas.core.series.Series'>.nunique
<class 'pandas.core.series.Series'>.to_hdf
<class 'pandas.core.series.Series'>.value_counts
<class 'pandas.core.indexes.timedeltas.TimedeltaIndex'>.nunique
<class 'pandas.core.indexes.timedeltas.TimedeltaIndex'>.value_counts
crosstab
lreshape
pivot_table
value_counts

I think the missing ones are
crosstab, lreshape, HDFStore.put|append|append_to_multiple.


## Timeline

If accepted, the current `dropna` default would be deprecated as part of pandas
2.x and this deprecation would be enforced in pandas 3.0.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would users find out about this deprecation? I'm concerned it will create noisy messages. For example, if you were to do df[["a", "b"]].groupby("a").sum(), would you always get a deprecation message? Would you only get a message if the result would change because the column "a" had NA values?

So can you be more specific about how the deprecation would work?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added. A warning would only be emitted when dropna is unspecified and an NA value is encountered.


## PDEP History

- 4 May 2023: Initial draft