-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reading nested parquet files results in index out of bounds
#1383
Comments
index out of bounds
index out of bounds
i am also getting similar error, i have nested json arrow record batch and converting arrow to parquet files. somehow i am not able to query nested json in parquet file. |
There is another fact: DataFusion has its own parquet reader - it does NOT use the Arrow-RS/Parquet native implementation. I have no idea why it is so. |
I think this can be fixed with a quick and dirty workaround when we iterate through
We are using the parquet crate you linked right now. |
@houqp I can try get a swing at this issue. |
I've been looking at the source code and it seems that the statistics are taken into account only for the top level columns. In the majority of places I see Parquet, on the other side have statistics for all columns, regardless of the nested level. I do understand the "quick and dirty workaround" and in regards to it I have the following questions:
|
This appears to now work correctly, I suspect it was fixed by apache/arrow-rs#1588 |
Describe the bug
Reading nested parquet files results in
index out of bounds
error as seen bellow:To Reproduce
./data
folderindex out of bounds
panicExpected behavior
To properly read the parquet file.
Additional context
After debugging a bit the issue the error happens in
fetch_statistics
function. To be more precise theschema.fields().len()
datasource/file_format/parquet.rs#L261 construct returns only the top fields, while therow_group_meta.columns()
(datasource/file_format/parquet.rs#L276-L277) returns all leaves.In the context of the given parquet file, there are 8 top level fields and about 262 leaves.
DataFusion is
6.0
Rust is
1.58.0-nightly (65c55bf93 2021-11-23)
Cargo is
1.58.0-nightly (e1fb17631 2021-11-22)
The text was updated successfully, but these errors were encountered: