Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

String dtype: implement sum reduction #59853

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Sep 20, 2024

Based on the feedback in #59328, implementing sum() for the string dtype

xref #54792

@jorisvandenbossche jorisvandenbossche added Strings String extension data type and string data Reduction Operations sum, mean, min, max, etc. labels Sep 20, 2024
@jorisvandenbossche jorisvandenbossche added this to the 2.3 milestone Sep 20, 2024
@jorisvandenbossche jorisvandenbossche changed the title String dtype: implemen sum reduction String dtype: implement sum reduction Sep 20, 2024
@jorisvandenbossche jorisvandenbossche marked this pull request as ready for review September 21, 2024 13:51
Comment on lines +1732 to +1734
if pa.types.is_large_string(data.type):
# binary_join only supports string, not large_string
data = data.cast(pa.string())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not too familiar here, can this cause unexpected results? If so, should it be documented?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it can cause overflow error if a single chunk doesn't fit into the string dtype. I suppose this will be very rare, because we are summing here, and that would mean that the single scalar string as result of the sum would be bigger than 2GB (I am not fully sure how well Python will handle such a large str object).

We can indeed document it.
In theory it could also be circumvented by splitting the chunk into multiple chunks (although I have to verify that pyarrow then does not actually concatenate that again under the hood in the binary_join implementation).

@WillAyd
Copy link
Member

WillAyd commented Sep 22, 2024

Hmm wouldn't it be better to just make users cast to object if they want the legacy behavior on this? I think it was intentional to not implement sum on the string dtype; while there may be some niche use cases, more often than not I think its a mistake (and huge performance hit) to have string types summed up

@rhshadrach
Copy link
Member

while there may be some niche use cases, more often than not I think its a mistake (and huge performance hit)

This is not my personal experience.

@jorisvandenbossche
Copy link
Member Author

@WillAyd it was indeed intentional in the past to not implement it, but see the thread in #59328, and if you want to discuss this, let's do that there ;) (I personally don't care too much about this specifically, but can follow some of the arguments given there, and just implementing what has been decided for now)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Reduction Operations sum, mean, min, max, etc. Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants