Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

String dtype: propagate NaNs as False in predicate methods (eg .str.startswith) #59616

Conversation

jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Aug 26, 2024

This ensures to propagate NaN as False for all string predicate methods (i.e. the ones that typically (absent of NA values) return a boolean result) if using the NaN-variant string dtype.
Additionally, for the few methods that have an na keyword that controls the value to propagate, I updated the default to no_default. There is some related discussion about how this na keyword should be treated if it is passed a non-boolean value in #59561

Still have to update docstrings and type annotations.

xref #54792

@jorisvandenbossche jorisvandenbossche added this to the 2.3 milestone Aug 27, 2024
@WillAyd
Copy link
Member

WillAyd commented Aug 27, 2024

Does this modify categorical behavior as well? I assume not, but the reason I ask is having come across #36241 when looking for motivation for the current design

I'm somewhat unsure about changing this, mostly from a fear of changing things without comprehensively solving them. On one hand, this aligns behavior with floating point and temporal types, but moves away from the nullable extension type behavior. I assume in its current state it also introduces a discrepancy between string and categorical, so whether this is a net positive or not is hard to say

@jorisvandenbossche
Copy link
Member Author

Does this modify categorical behavior as well? I assume not, but the reason I ask is having come across #36241 when looking for motivation for the current design

Ah, good question. It seems indeed not:

>>> pd.Series(["a", None], dtype=pd.StringDtype(na_value=np.nan)).str.startswith("a") 
0     True
1    False
dtype: bool

>>> pd.Series(["a", "b", None], dtype=pd.StringDtype(na_value=np.nan)).astype("category").str.startswith("a")
0     True
1      NaN
dtype: object

I updated the tests also covering category, but seems I did that a bit blindly without actually considering the category case.
Now, given that the result is a plain bool array (or object if there are missing values, when starting with object dtype categories), and not a categorical of bool categories, I think the end result should just be the same for both examples above?

@WillAyd
Copy link
Member

WillAyd commented Aug 27, 2024

I think the end result should just be the same for both examples above?

Really hard to say; regardless of which path we choose, there is going to be some inconsistency in missing value handling (until PDEP-16)

I normally prefer punting fixes like this until they can be done more comprehensively, but overall I'm +/0 on this one.

Worth getting input from the wider team since this might be a breaking change. I think @rhshadrach has expressed interest in maintaining the status quo for the object-dtype with NaN in the past (sorry if misquoting!)

@jorisvandenbossche
Copy link
Member Author

maintaining the status quo for the object-dtype with NaN in the past

To be clear, this PR is not changing the behaviour if you have object dtype, only for the NaN-variant of StringDtype (I know that many cases of object dtype will of course become that new str dtype in 3.0, but just to be explicit about the scope of the current changes)

I think the end result should just be the same for both examples above?

Really hard to say; regardless of which path we choose, there is going to be some inconsistency in missing value handling (until PDEP-16)

FWIW, there are certainly good arguments either way for propagating NaNs vs using False for the new default string dtype, but if we choose one or the other behaviour, I personally don't really see a reason to not do that consistently for category[str].
(meaning, if going through with this PR, I think I should also update string categories to follow the same change)

@WillAyd
Copy link
Member

WillAyd commented Aug 27, 2024

(meaning, if going through with this PR, I think I should also update string categories to follow the same change)

Definitely agree on that point

@jorisvandenbossche
Copy link
Member Author

Updated this after some of the recent refactor PRs, and to align the categorical[str] behaviour to also propagate NaNs as False.

@rhshadrach
Copy link
Member

this [PR] aligns behavior with floating point and temporal types

What is an example operation with floating point types that is comparable?

I think @rhshadrach has expressed interest in maintaining the status quo for the object-dtype with NaN in the past (sorry if misquoting!)

No - only that we be diligent in understanding the impact on users when making changes, and evaluate them carefully.

@rhshadrach
Copy link
Member

I think I see, from the linked issue

"String columns will follow NaN-semantics for missing values, where NaN gives False in boolean operations such as comparisons or predicates", i.e. ser == "a"

So a comparable operation would be ser > 0.

I'm onboard with the two phase approach to strings:

  1. Introduce NaN-semantic strings in a consistent manner
  2. Transition to NA-semantics (across all dtypes)

and I see this as being a part of that two phase approach.

@jorisvandenbossche
Copy link
Member Author

@WillAyd @jbrockmendel this is ready for review

@WillAyd
Copy link
Member

WillAyd commented Sep 9, 2024

Should we include the object dtype in this? I do worry that there is going to be an intermediate phase where users have a lot of .astype(object) calls laying around that now have different NA-semantics than what they get by default.

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Sep 10, 2024

For object dtype the situation is a bit more complex, because object dtype can contain anything, not just strings and missing values. And in that case, we default to return NaN for anything that is not a string, and do this in all methods and not just boolean predicates:

>>> pd.Series(["string", 10, np.nan], dtype=object).str.startswith("str")
0    True
1     NaN
2     NaN
dtype: object

>>> pd.Series(["string", 10, np.nan], dtype=object).str.upper()
0    STRING
1       NaN
2       NaN
dtype: object

Of course we can say that for boolean predicates, we also return False for anything that is not a string and so would have [True, False, False] in the case above.

But I would personally start with doing that for just StringDtype. It's certainly a good point and we can discuss that more (maybe on the issue?), but in any case it will be easier to do that in a separate PR, because changing it for object dtype is not something we want to backport to 2.3

@WillAyd
Copy link
Member

WillAyd commented Sep 10, 2024

Sounds good. I don't have a strong preference - just trying to think through what inconsistencies this will surface.

On board with the change if the rest of the team is

@jorisvandenbossche jorisvandenbossche force-pushed the string-dtype-predicates-nan-propagation branch from 452e992 to e401b55 Compare September 16, 2024 09:01
@jorisvandenbossche
Copy link
Member Author

Another ping here. I have to update this to resolve conflicts, but the diff itself should still be reviewable.

@jorisvandenbossche jorisvandenbossche merged commit 88554d0 into pandas-dev:main Oct 10, 2024
51 checks passed
Copy link

lumberbot-app bot commented Oct 10, 2024

Owee, I'm MrMeeseeks, Look at me.

There seem to be a conflict, please backport manually. Here are approximate instructions:

  1. Checkout backport branch and update it.
git checkout 2.3.x
git pull
  1. Cherry pick the first parent branch of the this PR on top of the older branch:
git cherry-pick -x -m1 88554d0ca77c7b80605a34f9ece838b834db8720
  1. You will likely have some merge/cherry-pick conflict here, fix them and commit:
git commit -am 'Backport PR #59616: String dtype: propagate NaNs as False in predicate methods (eg .str.startswith)'
  1. Push to a named branch:
git push YOURFORK 2.3.x:auto-backport-of-pr-59616-on-2.3.x
  1. Create a PR against branch 2.3.x, I would have named this PR:

"Backport PR #59616 on branch 2.3.x (String dtype: propagate NaNs as False in predicate methods (eg .str.startswith))"

And apply the correct labels and milestones.

Congratulations — you did some good work! Hopefully your backport PR will be tested by the continuous integration and merged soon!

Remember to remove the Still Needs Manual Backport label once the PR gets merged.

If these instructions are inaccurate, feel free to suggest an improvement.

@jorisvandenbossche
Copy link
Member Author

Manual backport -> #60014

jorisvandenbossche added a commit that referenced this pull request Oct 11, 2024
…ethods (eg .str.startswith) (#59616) (#60014)

* String dtype: propagate NaNs as False in predicate methods (eg .str.startswith) (#59616)

(cherry picked from commit 88554d0)

* ignore object dtype inference warnings
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design backported Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants