Describe the bug, including details regarding any error messages, version, and platform.
Our code primarily uses polars but external tools use pandas, and when we use them to import parquet files with categorical columns that have unsigned int index types (uint16 and uint32) we get the error
ArrowTypeError: Converting unsigned ddictionary indices to pandas not yet supported, index type: uint32
Simple repro below.
import polars as pl
import pyarrow as pa
n = 100
cat_values = [f"cat_{i}" for i in range(n)]
df = pl.DataFrame({
"cat": cat_values,
"val": list(range(n))
})
arrow_table = df.to_arrow()
dict_type = pa.dictionary(index_type=pa.uint16(), value_type=pa.string())
arrow_table = arrow_table.set_column(
arrow_table.schema.get_field_index("cat"),
"cat",
arrow_table.column("cat").cast(dict_type)
)
print("Arrow schema:", arrow_table.schema)
try:
pdf = pl.from_table(arrow_table).to_pandas()
pdf = arrow_table.to_pandas()
print("Loaded into pandas successfully.")
except Exception as e:
print("Failed to load into pandas:")
print(e)
try:
pol_df = pl.from_arrow(arrow_table)
print("Loaded into Polars successfully.")
except Exception as e:
print("Failed to load into Polars:")
print(e)
Finally, I wasn't sure whether to make this a FR or Issue, because it's missing, not incorrect.
Component(s)
Python