-
Notifications
You must be signed in to change notification settings - Fork 4k
Closed
Description
Describe the bug, including details regarding any error messages, version, and platform.
Description
The utf8_is_digit kernel in pyarrow.compute does not fully replicate Python's str.isdigit() behavior, especially with certain Unicode digit characters.
For example, the character '³' (U+00B3 SUPERSCRIPT THREE) returns True with Python’s str.isdigit() but returns False when passed to pyarrow.compute.utf8_is_digit.
This divergence leads to downstream inconsistencies, particularly in pandas when using StringDtype(storage="pyarrow").
Reproduction
import pyarrow as pa
import pyarrow.compute as pc
arr = pa.array(['3', '٣', '५', '123', '³'])
print(pc.utf8_is_digit(arr).to_pylist())Output:
[True, True, True, True, False] # <-- '³' incorrectly returns False
Expected Output (matches str.isdigit()):
[True, True, True, True, True]
Notes
- The issue seems to stem from the implementation of
IsDigitUnicode::PredicateCharacterAllnot including characters in the Unicode "No" (Number, Other) category, such as superscript digits (³,², etc.). - Python's behavior can be verified as:
print("³".isdigit()) # TrueImpact
This affects pandas string operations like .str.isdigit() when using pyarrow storage. Python string-based behavior passes, but pyarrow-based behavior fails for characters like '³'.
System Info
Tested with:
- PyArrow 20.0.0 (pip-installed)
- Pyarrow
main0.1.dev17578+g218c886 - Python 3.12
- Debian-based Linux (Ubuntu)
Component(s)
Python