Open
Description
Describe the enhancement requested
The type inference for schema detection that's implemented here
does not distinguish between signed and unsigned integer types.This leads to the following behaviour, I think it would be nice if that was more consistent.
import pyarrow as pa
import pandas as pd
data_1 = [{"a": pow(2, 63) - 1}]
schema_1 = pa.Schema.from_pandas(pd.DataFrame(data_1))
print(schema_1) # takes a different codepath, correctly infers uint64
data_2 = [{"a": [pow(2, 63) - 1]}]
schema_2 = pa.Schema.from_pandas(pd.DataFrame(data_2)) # crashes
Here's the backtrace that you get when trying to compute schema_2
.
Traceback (most recent call last):
File "/work/arrow/foo.py", line 5, in <module>
schema = pa.Schema.from_pandas(pd.DataFrame(data))
File "pyarrow/types.pxi", line 3104, in pyarrow.lib.Schema.from_pandas
File "/work/arrow/pyarrow-dev/lib/python3.10/site-packages/pyarrow/pandas_compat.py", line 562, in dataframe_to_types
type_ = pa.array(c, from_pandas=True).type
File "pyarrow/array.pxi", line 360, in pyarrow.lib.array
File "pyarrow/array.pxi", line 87, in pyarrow.lib._ndarray_to_array
File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status
OverflowError: Python int too large to convert to C long
Is that something that can be changed or would that likely have too many unintended consequences?
I've tested this with pyarrow version 19.0.0 on ubuntu 24.04.
Component(s)
C++