Closed
Description
Describe the bug
As the title says, the ParquetRecordBatchReader
can not recognize duration type written by pandas or polars.
To Reproduce
First, we should prepare parquet file
import polars as pl
from datetime import timedelta
df = pl.DataFrame({
"a": [timedelta(days=1) for _ in range(100)]
})
df.write_parquet("./test.parquet")
Then, read in rust arrow-rs:
fn main() -> Result<()> {
// Create parquet file that will be read.
let path = "./test.parquet";
let file = File::open(path).unwrap();
let parquet_reader = ParquetRecordBatchReaderBuilder::try_new(file)?
.with_batch_size(8192)
.build()?;
let mut batches = Vec::new();
for batch in parquet_reader {
batches.push(batch?);
}
println!("{:#?}", batches[0].schema());
Ok(())
}
finally we get:
Schema {
fields: [
Field {
name: "a",
data_type: Int64,
nullable: true,
dict_id: 0,
dict_is_ordered: false,
metadata: {},
},
],
metadata: {},
}
Expected behavior
polars result:
shape: (100, 1)
┌──────────────┐
│ a │
│ --- │
│ duration[μs] │
╞══════════════╡
│ 1d │
│ 1d │
│ 1d │
│ 1d │
│ 1d │
│ … │
│ 1d │
│ 1d │
│ 1d │
│ 1d │
│ 1d │
└──────────────┘
pandas result:
a
0 1 days
1 1 days
2 1 days
3 1 days
4 1 days
.. ...
95 1 days
96 1 days
97 1 days
98 1 days
99 1 days
[100 rows x 1 columns]
Additional context