Skip to content

[Python] Unable to import arrow table to pandas if it has categorical columns with index types of unsigned ints #47022

@dweih

Description

@dweih

Describe the bug, including details regarding any error messages, version, and platform.

Our code primarily uses polars but external tools use pandas, and when we use them to import parquet files with categorical columns that have unsigned int index types (uint16 and uint32) we get the error

ArrowTypeError: Converting unsigned ddictionary indices to pandas not yet supported, index type: uint32

Simple repro below.

import polars as pl
import pyarrow as pa

n = 100
cat_values = [f"cat_{i}" for i in range(n)]
df = pl.DataFrame({
    "cat": cat_values,
    "val": list(range(n))
})
arrow_table = df.to_arrow()

dict_type = pa.dictionary(index_type=pa.uint16(), value_type=pa.string())
arrow_table = arrow_table.set_column(
    arrow_table.schema.get_field_index("cat"),
    "cat",
    arrow_table.column("cat").cast(dict_type)
)

print("Arrow schema:", arrow_table.schema)


try:
    pdf = pl.from_table(arrow_table).to_pandas()
    pdf = arrow_table.to_pandas()
    print("Loaded into pandas successfully.")
except Exception as e:
    print("Failed to load into pandas:")
    print(e)

try:
    pol_df = pl.from_arrow(arrow_table)
    print("Loaded into Polars successfully.")
except Exception as e:
    print("Failed to load into Polars:")
    print(e)

Finally, I wasn't sure whether to make this a FR or Issue, because it's missing, not incorrect.

Component(s)

Python

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions