Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

String dtype: avoid surfacing pyarrow exception in binary operations #59610

Merged

Conversation

jorisvandenbossche
Copy link
Member

For the pyarrow storage, in the generic ArrowExtensionArray implementation, we currently just call the pyarrow compute kernel, regardless of the input types are supported or not, potentially resulting in pyarrow errors surfacing to the user (typically pyarrow.lib.ArrowNotImplementedError).

This has several issues: 1) it's a bit annoying for testing (the tests currently import pyarrow to test the expected error message, which then fails if pyarrow is not installed), but 2) moreover this is a bad user experience IMO: it gives error messages that can be confusing (including "implementation details" from pyarrow) and inconsistent (compared to non-pyarrow backed dtypes), and it actually also gives a wrong exception type (as many of those operations are simply not supported for strings, and are expected to raise a TypeError, instead of a NotIplementedError which gives the impression it might be implemented in the future).

I am currently doing this for all arrow types, but I could also limit this to only those cases where it is used by the ArrowStringArray, and not for other ArrowDtype cases.

xref #54792

@jorisvandenbossche jorisvandenbossche added Error Reporting Incorrect or improved errors from pandas Strings String extension data type and string data Arrow pyarrow functionality labels Aug 26, 2024
Comment on lines 807 to 808
except pa.lib.ArrowNotImplementedError as err:
raise TypeError(self._op_method_error_message(other_original, op)) from err
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am doing from err here, so the original pyarrow message like "ArrowNotImplementedError: Function 'binary_join_element_wise' has no kernel matching input types (large_string, int64, large_string)" is still visible up in the traceback.

For the new string dtype, I would actually be fine with suppressing this (using from None) to limit the length of the error traceback. But for usage of ArrowDtype in general, I suppose this information is still very useful to see.

But as I mentioned in the top post, I could also only do this try/except in case we are in ArrowStringArray, and leave the generic ArrowExtensionArray as is (raising the pyarrow error directly)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the from err - I don't see a huge harm in including all of that information

@jorisvandenbossche jorisvandenbossche added this to the 2.3 milestone Aug 27, 2024
@jorisvandenbossche jorisvandenbossche changed the title String dtype: avoid surfacing pyarrow excetion in binary operations String dtype: avoid surfacing pyarrow exception in binary operations Aug 27, 2024
@jorisvandenbossche
Copy link
Member Author

OK, this should be fully ready for review now

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice change - great work!

pytest.mark.xfail(
raises=TypeError, reason="Can only string multiply by an integer."
)
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice to see these go!

Copy link
Member

@jbrockmendel jbrockmendel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@WillAyd WillAyd merged commit 67bec1f into pandas-dev:main Aug 27, 2024
47 checks passed
@WillAyd
Copy link
Member

WillAyd commented Aug 27, 2024

thanks @jorisvandenbossche

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality backported Error Reporting Incorrect or improved errors from pandas Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants