Skip to content

Support inference of unsigned integer types #45312

Open
@ben-freist

Description

Describe the enhancement requested

The type inference for schema detection that's implemented here

int_count_ += numpy_dtype_count_;
does not distinguish between signed and unsigned integer types.

This leads to the following behaviour, I think it would be nice if that was more consistent.

import pyarrow as pa
import pandas as pd

data_1 = [{"a": pow(2, 63) - 1}]
schema_1 = pa.Schema.from_pandas(pd.DataFrame(data_1))
print(schema_1) # takes a different codepath, correctly infers uint64
data_2 = [{"a": [pow(2, 63) - 1]}]
schema_2 = pa.Schema.from_pandas(pd.DataFrame(data_2)) # crashes

Here's the backtrace that you get when trying to compute schema_2.

Traceback (most recent call last):
  File "/work/arrow/foo.py", line 5, in <module>
    schema = pa.Schema.from_pandas(pd.DataFrame(data))
  File "pyarrow/types.pxi", line 3104, in pyarrow.lib.Schema.from_pandas
  File "/work/arrow/pyarrow-dev/lib/python3.10/site-packages/pyarrow/pandas_compat.py", line 562, in dataframe_to_types
    type_ = pa.array(c, from_pandas=True).type
  File "pyarrow/array.pxi", line 360, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 87, in pyarrow.lib._ndarray_to_array
  File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status
OverflowError: Python int too large to convert to C long

Is that something that can be changed or would that likely have too many unintended consequences?

I've tested this with pyarrow version 19.0.0 on ubuntu 24.04.

Component(s)

C++

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions