Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: behaviour for the "str.center()" string method for the pyarrow-backed string dtype #54807

Closed
Tracked by #54792
jorisvandenbossche opened this issue Aug 28, 2023 · 6 comments · Fixed by #59624
Closed
Tracked by #54792
Assignees
Labels
API Design Arrow pyarrow functionality Strings String extension data type and string data
Milestone

Comments

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Aug 28, 2023

From #54533 (comment) (and relevant for new String dtype #54792)

In general the pyarrow backed string dtype is meant to behave the same as the existing object-dtype or non-pyarrow backed StringDtype, so users in generally should have a smooth way of upgrading. In general most string compute functions in pyarrow have similar behaviour, or we still have a few places where we fall back to the original object-based methods.

One specific method that currently has a different behaviour is str.center:

>>> pd.Series(["a", "bb", "cccc"], dtype="object").str.center(5, fillchar="X")
0    XXaXX
1    XXbbX
2    Xcccc
dtype: object

>>> pd.Series(["a", "bb", "cccc"], dtype="string[python]").str.center(5, fillchar="X")
0    XXaXX
1    XXbbX
2    Xcccc
dtype: string

>>> pd.Series(["a", "bb", "cccc"], dtype="string[pyarrow]").str.center(5, fillchar="X")
0    XXaXX
1    XXbbX
2    Xcccc
dtype: string

>>> pd.Series(["a", "bb", "cccc"], dtype="string[pyarrow_numpy]").str.center(5, fillchar="X")
0    XXaXX
1    XbbXX    # <-- aligned left instead of right
2    ccccX    # <-- aligned left instead of right
dtype: string

The first three dtypes all give the same result, but for the last one using StringDtype(storage="pyarrow_numpy"), the strings are "aligned" to the left instead of right in case an uneven fill chars have to be used.

This is a behaviour we inherited from the Python str center method, but the pyarrow pyarrow.compute.ascii_center function apparently made the opposite choice (I don't think was necessarily an intentional choice to deviate from this; there was no discussion about this on the PR (apache/arrow#10586), which did add the comment "If odd number of spaces, put the extra space on the left" to the relevant place in the code that handles this).

This is a potentially breaking change for users, so to start with should decide if we want to keep the current behaviour or not for the future string dtype. On the short term, we can fall back to the python string objects method (like we do for the other string dtypes). But we can also ask pyarrow to add a keyword to control this aspect of the compute function (that should be an easy change), so that in the future we can again use the faster pyarrow function.

@mroeschke
Copy link
Member

xref apache/arrow#15053

@jorisvandenbossche jorisvandenbossche added this to the 3.0 milestone Feb 28, 2024
@jorisvandenbossche
Copy link
Member Author

I added a lean_left_on_odd_padding keyword upstream in pyarrow to control this behaviour, and this is released with pyarrow 17.0. So I would propose that we fallback to the object dtype based implementation for older versions, and only use pyarrow for 17.0+ with this new keyword, to ensure we can give consistent (and not breaking) behaviour.

(at least for the new NaN variant of the String dtype under PDEP 14)

@jorisvandenbossche jorisvandenbossche self-assigned this Aug 12, 2024
@jbrockmendel
Copy link
Member

i just came across this in trying to see if ArrowStringArray._str_pad could re-use ArrowEA._str_pad. The pd.ArrowDtype(pa.string()) behavior matches the pyarrow_numpy behavior from the OP. Does the proposal to fall back to the object behavior for older pyarrow and use the new keyword for newer pyarrow apply to the ArrowEA._str_pad method too? If so I'm +1. (i guess im +1 either way, but more enthusiastic if we can get across-the-board consistency and de-duplication)

@jorisvandenbossche
Copy link
Member Author

I am fine with doing that across the board, but up to @mroeschke to decide for ArrowExtensionArray

@mroeschke
Copy link
Member

I'm not super keen on having ArrowEA falling back in general, but if this would just be temporary until pyarrow 17 is our minimum version (where it sounds like we no longer need the fallback) then I would be OK with the logic sharing for code-deduplication.

@jbrockmendel
Copy link
Member

jbrockmendel commented Aug 27, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Arrow pyarrow functionality Strings String extension data type and string data
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants