-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PDEP-11: Change default of dropna to False #53094
base: main
Are you sure you want to change the base?
Conversation
Will need to give this some thought. The "even further" possibility is to deprecate the argument entirely and tell users to do |
Wouldn't this not work for certain functions. If you sum over values with na and then drop you get nothing, whilst if the sum function knows to drop you will return a value. This would be important in the groupby. |
@attack68 (#53094 (comment)): Consider also
@jbrockmendel - I also think further deprecation might be desirable for some of these methods. However they would then act as |
This should take the |
@attack68 - Also, deprecating any |
I am generally in support of this PDEP. I think NA information is valuable in many processes and should not be dropped by default. I've just raised the points to try to discover if there are any processes / calculation chains that might be made more difficult by the change, including any user defined functions and/or in groupbys. |
## Timeline | ||
|
||
If accepted, the current `dropna` default would be deprecated as part of pandas | ||
2.x and this deprecation would be enforced in pandas 3.0. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would users find out about this deprecation? I'm concerned it will create noisy messages. For example, if you were to do df[["a", "b"]].groupby("a").sum()
, would you always get a deprecation message? Would you only get a message if the result would change because the column "a"
had NA
values?
So can you be more specific about how the deprecation would work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added. A warning would only be emitted when dropna
is unspecified and an NA value is encountered.
Thanks @jbrockmendel @attack68 @mroeschke @phofl @Dr-Irv - I've updated the PDEP based on your feedback. |
I'm +1 on this change. Do we have a rough timeline for Pandas 3.0? I'm a bit worried about noise from this when working interactively, could we consider implementing the deprecation late in the v2.x release cycle rather than early? Or is the 2.x cycle so short that timing this will not matter? |
I am +1 on this proposal as well.
We would like to release 3.0 in 1 year from 2.0, so April 3rd 2024 ideally. 2.1 should be released in August 2.1 where this deprecation could be introduced |
@mroeschke: Is there then not going to be a pandas 2.2? @topper-123: I'm happy to wait as long as possible in 2.x. The deprecation will be straightforward. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very much in favour, I've done value_counts
in the past and thought "oh, no missing values, all good", only to later be bitten by nonsense forecasts
+1 (sorry it's taken me some time to get round to this)
Thanks for the PDEP! I am +1 on the proposal for the groupby-related methods (the variations of I am less sure that we should also do this for other methods, or at least I think those merit some specific case-by-case discussion (AFAIK the linked previous discussions were all about the groupby/value_counts-related cases?)
|
I think there should be a 2.2 release in December with the every-4-months for a minor release |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on this proposal
SeriesGroupBy.value_counts | ||
DataFrameGroupBy.nunique | ||
DataFrameGroupBy.value_counts | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you might be missing a couple functions here.
This is the complete list (made using the keyword inspector script I posted on slack a while back).
<class 'pandas.core.arrays.categorical.Categorical'>.value_counts
<class 'pandas.core.indexes.category.CategoricalIndex'>.nunique
<class 'pandas.core.indexes.category.CategoricalIndex'>.value_counts
<class 'pandas.core.frame.DataFrame'>.groupby
<class 'pandas.core.frame.DataFrame'>.mode
<class 'pandas.core.frame.DataFrame'>.nunique
<class 'pandas.core.frame.DataFrame'>.pivot_table
<class 'pandas.core.frame.DataFrame'>.stack
<class 'pandas.core.frame.DataFrame'>.to_hdf
<class 'pandas.core.frame.DataFrame'>.value_counts
<class 'pandas.core.indexes.datetimes.DatetimeIndex'>.nunique
<class 'pandas.core.indexes.datetimes.DatetimeIndex'>.value_counts
<class 'pandas.io.pytables.HDFStore'>.append
<class 'pandas.io.pytables.HDFStore'>.append_to_multiple
<class 'pandas.io.pytables.HDFStore'>.put
<class 'pandas.core.indexes.base.Index'>.nunique
<class 'pandas.core.indexes.base.Index'>.value_counts
<class 'pandas.core.indexes.interval.IntervalIndex'>.nunique
<class 'pandas.core.indexes.interval.IntervalIndex'>.value_counts
<class 'pandas.core.indexes.multi.MultiIndex'>.nunique
<class 'pandas.core.indexes.multi.MultiIndex'>.value_counts
<class 'pandas.core.indexes.period.PeriodIndex'>.nunique
<class 'pandas.core.indexes.period.PeriodIndex'>.value_counts
<class 'pandas.core.indexes.range.RangeIndex'>.nunique
<class 'pandas.core.indexes.range.RangeIndex'>.value_counts
<class 'pandas.core.series.Series'>.groupby
<class 'pandas.core.series.Series'>.mode
<class 'pandas.core.series.Series'>.nunique
<class 'pandas.core.series.Series'>.to_hdf
<class 'pandas.core.series.Series'>.value_counts
<class 'pandas.core.indexes.timedeltas.TimedeltaIndex'>.nunique
<class 'pandas.core.indexes.timedeltas.TimedeltaIndex'>.value_counts
crosstab
lreshape
pivot_table
value_counts
I think the missing ones are
crosstab
, lreshape
, HDFStore.put|append|append_to_multiple
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make it explicit with all the people starting to approve: I am +1 on changing it for groupby/value_counts, but -1 on the proposal as it is currently written.
I think all the other group of methods (other than groupby/value_counts) need a separate motivation (and maybe discussion), see my comment at #53094 (comment)
Thanks @jorisvandenbossche - for now I'd like to move discussions on the behavior of
Emphasis mine. Certainly no objection to such discussions, but I do think there is high value in having a consistent default for
I agree this could either have
pandas impelments Thanks @lithomas1! I indeed missed checking the top namespace - will add them to the PDEP. Some thoughts on each:
|
This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this. |
This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this. |
@rhshadrach what's the status here? was this announced to solicit feedback from the community? |
Thanks for the ping here @simonjayhawkins - I've been meaning to post an update. I put the brakes on this from the feedback in #53094 (comment); namely the behavior of In addition, I think we should decide to have In my mind, there are two ways forward:
I am okay with either way, but would lean toward the first. |
cc @pandas-dev/pandas-core