Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDEP-11: Change default of dropna to False #53094

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

rhshadrach
Copy link
Member

cc @pandas-dev/pandas-core

@rhshadrach rhshadrach added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Deprecate Functionality to remove in pandas PDEP pandas enhancement proposal labels May 5, 2023
@jbrockmendel
Copy link
Member

Will need to give this some thought. The "even further" possibility is to deprecate the argument entirely and tell users to do obj.whatever().dropna() to get the old behavior

@attack68
Copy link
Contributor

attack68 commented May 5, 2023

Will need to give this some thought. The "even further" possibility is to deprecate the argument entirely and tell users to do obj.whatever().dropna() to get the old behavior

Wouldn't this not work for certain functions. If you sum over values with na and then drop you get nothing, whilst if the sum function knows to drop you will return a value. This would be important in the groupby.

@rhshadrach
Copy link
Member Author

rhshadrach commented May 6, 2023

@attack68 (#53094 (comment)): Consider also obj.dropna(...).whatever()

The "even further" possibility is to deprecate the argument entirely and tell users to do obj.whatever().dropna() to get the old behavior

@jbrockmendel - I also think further deprecation might be desirable for some of these methods. However they would then act as dropna=False, and deprecating an argument and changing the behavior to the non-default option is not a great user experience. So I think we should first change the default and then deprecate if desired. But even if we don't deprecate the argument, I would like to see the default change. So these are independent.

@attack68
Copy link
Contributor

attack68 commented May 6, 2023

My point being that this doesn't work:
Screenshot 2023-05-06 at 16 09 33

If you want to get the answer 3, you can currently do df.sum(skipna=True).sum(), where the dropna is within the method. Or, in this trivial case you could wrangle it, df.unstack().dropna().sum(), but that might not work generally?

@phofl
Copy link
Member

phofl commented May 6, 2023

This should take the skipna keyword into consideration, right now we are ignoring NAs consistently, this PDEP would make things kind of inconsistent (not that this is a bad thing, but it should at least get mentioned)

@rhshadrach
Copy link
Member Author

@attack68 - skipna is not part of the PDEP, I'll make this more clear in an update. And I don't think anyone is thinking it should be deprecated in methods like sum.

Also, deprecating any dropna is not part of the PDEP.

@attack68
Copy link
Contributor

attack68 commented May 7, 2023

I am generally in support of this PDEP. I think NA information is valuable in many processes and should not be dropped by default.
I understand skipna is not part of this PDEP and nothing is being deprecated, but it is worth mentioning as @phofl says.

I've just raised the points to try to discover if there are any processes / calculation chains that might be made more difficult by the change, including any user defined functions and/or in groupbys.

## Timeline

If accepted, the current `dropna` default would be deprecated as part of pandas
2.x and this deprecation would be enforced in pandas 3.0.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would users find out about this deprecation? I'm concerned it will create noisy messages. For example, if you were to do df[["a", "b"]].groupby("a").sum(), would you always get a deprecation message? Would you only get a message if the result would change because the column "a" had NA values?

So can you be more specific about how the deprecation would work?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added. A warning would only be emitted when dropna is unspecified and an NA value is encountered.

@rhshadrach
Copy link
Member Author

Thanks @jbrockmendel @attack68 @mroeschke @phofl @Dr-Irv - I've updated the PDEP based on your feedback.

@topper-123
Copy link
Contributor

I'm +1 on this change.

Do we have a rough timeline for Pandas 3.0? I'm a bit worried about noise from this when working interactively, could we consider implementing the deprecation late in the v2.x release cycle rather than early? Or is the 2.x cycle so short that timing this will not matter?

@mroeschke
Copy link
Member

I am +1 on this proposal as well.

Do we have a rough timeline for Pandas 3.0?

We would like to release 3.0 in 1 year from 2.0, so April 3rd 2024 ideally. 2.1 should be released in August 2.1 where this deprecation could be introduced

@rhshadrach
Copy link
Member Author

@mroeschke: Is there then not going to be a pandas 2.2?

@topper-123: I'm happy to wait as long as possible in 2.x. The deprecation will be straightforward.

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very much in favour, I've done value_counts in the past and thought "oh, no missing values, all good", only to later be bitten by nonsense forecasts

+1 (sorry it's taken me some time to get round to this)

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented May 30, 2023

Thanks for the PDEP!

I am +1 on the proposal for the groupby-related methods (the variations of groupby and value_counts).

I am less sure that we should also do this for other methods, or at least I think those merit some specific case-by-case discussion (AFAIK the linked previous discussions were all about the groupby/value_counts-related cases?)

  • mode() is a tricky one, because on the one hand it is an aggregation (and all our other aggregations skip NAs by default, although the keyword name here is dropna and not skipna), while on the other hand it is very much related to "counting values" (and for value_counts I think it makes sense to change the default to dropna=False)

  • nunique() is a bit similar as mode(), since it is also an aggregation function, but also related to counts. But for example, SQL COUNT(DISTINCT col) also skips nulls.

  • pivot_table(): it seems that the dropna keyword is used for multiple things? It is passed through to the underlying groupby call (and here it of course makes sense to follow the change in default), but it is also used to post-process the result and to calculate the margins (the docstring says: "If True, rows with a NaN value in any column will be omitted before computing margins."). Have we evaluated what the impact of the change is here?

  • For stack() I was going to say that I thought this was only about dropping NAs that were introduced because of the reshaping operations (and not NAs that already existed before), but it seems to be a bit more complicated than just that. The last example in the docstring illustrates the effect of dropna, and it's also dropping at least one pre-existing NA value. In that case, it seems this is a good change to preserve those original NAs in the output.
    But if you are stacking multiple levels at once, you can write an example where the NAs that you get with dropna=False are actually new NAs, and for this case I am not sure this is necessarily the best behaviour:

    >>> df = pd.DataFrame(np.arange(6).reshape(2,3), columns=pd.MultiIndex.from_tuples([('A','x'), ('A','y'), ('B','z')], names=['Upper', 'Lower']))
    >>> df
    Upper  A     B
    Lower  x  y  z
    0      0  1  2
    1      3  4  5
    >>> df.stack(level=[0,1])
       Upper  Lower
    0  A      x        0.0
              y        1.0
       B      z        2.0
    1  A      x        3.0
              y        4.0
       B      z        5.0
    dtype: float64
    
    >>> df.stack(level=[0,1], dropna=False)
       Upper  Lower
    0  A      x        0.0
              y        1.0
              z        NaN
       B      x        NaN
              y        NaN
              z        2.0
    1  A      x        3.0
              y        4.0
              z        NaN
       B      x        NaN
              y        NaN
              z        5.0
    dtype: float64

@mroeschke
Copy link
Member

@mroeschke: Is there then not going to be a pandas 2.2?

I think there should be a 2.2 release in December with the every-4-months for a minor release

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on this proposal

SeriesGroupBy.value_counts
DataFrameGroupBy.nunique
DataFrameGroupBy.value_counts
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you might be missing a couple functions here.
This is the complete list (made using the keyword inspector script I posted on slack a while back).

<class 'pandas.core.arrays.categorical.Categorical'>.value_counts
<class 'pandas.core.indexes.category.CategoricalIndex'>.nunique
<class 'pandas.core.indexes.category.CategoricalIndex'>.value_counts
<class 'pandas.core.frame.DataFrame'>.groupby
<class 'pandas.core.frame.DataFrame'>.mode
<class 'pandas.core.frame.DataFrame'>.nunique
<class 'pandas.core.frame.DataFrame'>.pivot_table
<class 'pandas.core.frame.DataFrame'>.stack
<class 'pandas.core.frame.DataFrame'>.to_hdf
<class 'pandas.core.frame.DataFrame'>.value_counts
<class 'pandas.core.indexes.datetimes.DatetimeIndex'>.nunique
<class 'pandas.core.indexes.datetimes.DatetimeIndex'>.value_counts
<class 'pandas.io.pytables.HDFStore'>.append
<class 'pandas.io.pytables.HDFStore'>.append_to_multiple
<class 'pandas.io.pytables.HDFStore'>.put
<class 'pandas.core.indexes.base.Index'>.nunique
<class 'pandas.core.indexes.base.Index'>.value_counts
<class 'pandas.core.indexes.interval.IntervalIndex'>.nunique
<class 'pandas.core.indexes.interval.IntervalIndex'>.value_counts
<class 'pandas.core.indexes.multi.MultiIndex'>.nunique
<class 'pandas.core.indexes.multi.MultiIndex'>.value_counts
<class 'pandas.core.indexes.period.PeriodIndex'>.nunique
<class 'pandas.core.indexes.period.PeriodIndex'>.value_counts
<class 'pandas.core.indexes.range.RangeIndex'>.nunique
<class 'pandas.core.indexes.range.RangeIndex'>.value_counts
<class 'pandas.core.series.Series'>.groupby
<class 'pandas.core.series.Series'>.mode
<class 'pandas.core.series.Series'>.nunique
<class 'pandas.core.series.Series'>.to_hdf
<class 'pandas.core.series.Series'>.value_counts
<class 'pandas.core.indexes.timedeltas.TimedeltaIndex'>.nunique
<class 'pandas.core.indexes.timedeltas.TimedeltaIndex'>.value_counts
crosstab
lreshape
pivot_table
value_counts

I think the missing ones are
crosstab, lreshape, HDFStore.put|append|append_to_multiple.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make it explicit with all the people starting to approve: I am +1 on changing it for groupby/value_counts, but -1 on the proposal as it is currently written.
I think all the other group of methods (other than groupby/value_counts) need a separate motivation (and maybe discussion), see my comment at #53094 (comment)

@rhshadrach
Copy link
Member Author

rhshadrach commented Jun 6, 2023

Thanks @jorisvandenbossche - for now I'd like to move discussions on the behavior of dropna for stack (#53515) and pivot_table (#53521) to those issues as I think they will require more lengthy discussion. If at the end they do not sufficiently address your concerns, we can bring them back here.

I am less sure that we should also do this for other methods, or at least I think those merit some specific case-by-case discussion

Emphasis mine. Certainly no objection to such discussions, but I do think there is high value in having a consistent default for dropna across pandas. It leads to less surprises and is less taxing on our users to have a consistent default.

mode() is a tricky one, because on the one hand it is an aggregation... It might be worth checking some other data tools what their behaviour is for "mode" with nulls.

I agree this could either have dropna defaulting to False or skipna defaulting to True. I'm okay either way here and can see arguments for both. At the end of the day I think we just have to make a call as there doesn't seem to be a standard. I'd lean toward changing to skipna so as not to change default behavior but have consistency in default arguments. Other languages/libraries:

  • Those that do not implement mode: R, SQL, GSL (C/C++)
  • Those that implement mode and include nulls by default: polars, SciPy
  • Those that implement mode and skip nulls: pyarrow, pyspark, scikit-learn

nunique() is a bit similar as mode(), since it is also an aggregation function, but also related to counts. But for example, SQL COUNT(DISTINCT col) also skips nulls.

pandas impelments count and this skips NA values (with no option to not skip). That is the method I think should be compared to SQL COUNT, not nunique. In addition, I think it is important that ser.nunique() == len(ser.unique()) and that is currently not the case.

Thanks @lithomas1! I indeed missed checking the top namespace - will add them to the PDEP. Some thoughts on each:

  • crosstab: The dropna argument is only used via passing to pivot_table, so I think we can combine this with the discussion on pivot_table (DEPR: Some dropna behaviors in DataFrame.pivot_table #53521).
  • lreshape: The docs here on dropna are wrong; this is equivalent to calling DataFrame.dropna(how='any') prior to the call to lreshape. I think we can deprecate dropna here outright.
  • HDFStore.put|append|append_to_multiple: These already default to dropna=False.

@github-actions
Copy link
Contributor

github-actions bot commented Jul 7, 2023

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Jul 7, 2023
@rhshadrach rhshadrach removed the Stale label Jul 7, 2023
@github-actions
Copy link
Contributor

github-actions bot commented Aug 7, 2023

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Aug 7, 2023
@simonjayhawkins
Copy link
Member

@rhshadrach what's the status here? was this announced to solicit feedback from the community?

@rhshadrach
Copy link
Member Author

Thanks for the ping here @simonjayhawkins - I've been meaning to post an update. I put the brakes on this from the feedback in #53094 (comment); namely the behavior of dropna in pivot_table and crosstab. I tried to summarize these issues in #53521.

In addition, I think we should decide to have mode with the default dropna=False or skipna=True. I've argued for the latter, but would be happy with either option. Finally, some opposition was voiced against nunique defaulting to dropna=False; I've argued against this in #53094 (comment) but have not heard if others find my arguments compelling.

In my mind, there are two ways forward:

  • change the default to dropna in all methods except for pivot_table and crosstab and clean those methods up in the future; or
  • wait to change the default of dropna until pivot_table and crosstab are fixed.

I am okay with either way, but would lean toward the first.

cc @jorisvandenbossche

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Deprecate Functionality to remove in pandas Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate PDEP pandas enhancement proposal Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.