Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading nested parquet files results in index out of bounds #1383

Closed
andrei-ionescu opened this issue Nov 29, 2021 · 6 comments
Closed

Reading nested parquet files results in index out of bounds #1383

andrei-ionescu opened this issue Nov 29, 2021 · 6 comments
Labels
bug Something isn't working

Comments

@andrei-ionescu
Copy link

andrei-ionescu commented Nov 29, 2021

Describe the bug

Reading nested parquet files results in index out of bounds error as seen bellow:

thread 'main' panicked at 'index out of bounds: the len is 8 but the index is 8', /Users/xxxx/.cargo/registry/
    src/github.com-1ecc6299db9ec823/datafusion-6.0.0/src/datasource/file_format/parquet.rs:272:13

To Reproduce

  1. Download attached zipped parquet file and unzip it: nested_schema_1row.parquet.zip
  2. Place it in a ./data folder
  3. Execute the following code:
let mut ctx = ExecutionContext::new(); 
let df = ctx.read_parquet("./data/nested_schema_1row.parquet").await?;
df.show().await
  1. The result is index out of bounds panic
thread 'main' panicked at 'index out of bounds: the len is 8 but the index is 8', /Users/xxxx/.cargo/registry/
    src/github.com-1ecc6299db9ec823/datafusion-6.0.0/src/datasource/file_format/parquet.rs:272:13
stack backtrace:
   0: rust_begin_unwind
             at /rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/std/src/panicking.rs:498:5
   1: core::panicking::panic_fmt
             at /rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/core/src/panicking.rs:107:14
   2: core::panicking::panic_bounds_check
             at /rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/core/src/panicking.rs:75:5
   3: <usize as core::slice::index::SliceIndex<[T]>>::index_mut
             at /rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/core/src/slice/index.rs:190:14
   4: core::slice::index::<impl core::ops::index::IndexMut<I> for [T]>::index_mut
             at /rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/core/src/slice/index.rs:26:9
   5: <alloc::vec::Vec<T,A> as core::ops::index::IndexMut<I>>::index_mut
             at /rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/alloc/src/vec/mod.rs:2540:9
   6: datafusion::datasource::file_format::parquet::fetch_metadata
             at /Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/datafusion-6.0.0/src/datasource/file_format/parquet.rs:272:13
   7: <datafusion::datasource::file_format::parquet::ParquetFormat as datafusion::datasource::file_format::FileFormat>::infer_schema::{{closure}}
             at /Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/datafusion-6.0.0/src/datasource/file_format/parquet.rs:96:27
   8: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
             at /rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/core/src/future/mod.rs:80:19
   9: <core::pin::Pin<P> as core::future::future::Future>::poll
             at /rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/core/src/future/future.rs:119:9
  10: datafusion::datasource::listing::table::ListingOptions::infer_schema::{{closure}}
             at /Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/datafusion-6.0.0/src/datasource/listing/table.rs:99:27
  11: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
             at /rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/core/src/future/mod.rs:80:19
  12: datafusion::logical_plan::builder::LogicalPlanBuilder::scan_parquet_with_name::{{closure}}
             at /Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/datafusion-6.0.0/src/logical_plan/builder.rs:287:31
  13: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
             at /rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/core/src/future/mod.rs:80:19
  14: datafusion::logical_plan::builder::LogicalPlanBuilder::scan_parquet::{{closure}}
             at /Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/datafusion-6.0.0/src/logical_plan/builder.rs:255:9
  15: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
             at /rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/core/src/future/mod.rs:80:19
  16: datafusion::execution::context::ExecutionContext::read_parquet::{{closure}}
             at /Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/datafusion-6.0.0/src/execution/context.rs:403:13
  17: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
             at /rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/core/src/future/mod.rs:80:19
  18: read_parquet::main::{{closure}}
             at ./src/main.rs:79:14
  19: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
             at /rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/core/src/future/mod.rs:80:19
  20: tokio::park::thread::CachedParkThread::block_on::{{closure}}
             at /Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/park/thread.rs:263:54
  21: tokio::coop::with_budget::{{closure}}
             at /Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/coop.rs:106:9
  22: std::thread::local::LocalKey<T>::try_with
             at /rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/std/src/thread/local.rs:399:16
  23: std::thread::local::LocalKey<T>::with
             at /rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/std/src/thread/local.rs:375:9
  24: tokio::coop::with_budget
             at /Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/coop.rs:99:5
  25: tokio::coop::budget
             at /Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/coop.rs:76:5
  26: tokio::park::thread::CachedParkThread::block_on
             at /Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/park/thread.rs:263:31
  27: tokio::runtime::enter::Enter::block_on
             at /Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/runtime/enter.rs:151:13
  28: tokio::runtime::thread_pool::ThreadPool::block_on
             at /Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/runtime/thread_pool/mod.rs:77:9
  29: tokio::runtime::Runtime::block_on
             at /Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/runtime/mod.rs:463:43
  30: read_parquet::main
             at ./src/main.rs:80:5
  31: core::ops::function::FnOnce::call_once
             at /rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/core/src/ops/function.rs:227:5

Expected behavior

To properly read the parquet file.

Additional context

After debugging a bit the issue the error happens in fetch_statistics function. To be more precise the schema.fields().len() datasource/file_format/parquet.rs#L261 construct returns only the top fields, while the row_group_meta.columns() (datasource/file_format/parquet.rs#L276-L277) returns all leaves.

In the context of the given parquet file, there are 8 top level fields and about 262 leaves.

DataFusion is 6.0
Rust is 1.58.0-nightly (65c55bf93 2021-11-23)
Cargo is 1.58.0-nightly (e1fb17631 2021-11-22)

@andrei-ionescu andrei-ionescu added the bug Something isn't working label Nov 29, 2021
@andrei-ionescu andrei-ionescu changed the title Reading wide and nested parquet files results in index out of bounds Reading nested parquet files results in index out of bounds Nov 29, 2021
@AdheipSingh
Copy link

i am also getting similar error, i have nested json arrow record batch and converting arrow to parquet files.

somehow i am not able to query nested json in parquet file.

@andrei-ionescu
Copy link
Author

andrei-ionescu commented Nov 30, 2021

There is another fact: DataFusion has its own parquet reader - it does NOT use the Arrow-RS/Parquet native implementation. I have no idea why it is so.

@houqp
Copy link
Member

houqp commented Dec 2, 2021

I think this can be fixed with a quick and dirty workaround when we iterate through row_group_meta.columns() and only count top level fields by looking at repetition level.

There is another fact: DataFusion has its own parquet reader - it does NOT use the Arrow-RS/Parquet native implementation. I have no idea why it is so.

We are using the parquet crate you linked right now.

@andrei-ionescu
Copy link
Author

@houqp I can try get a swing at this issue.

@andrei-ionescu
Copy link
Author

andrei-ionescu commented Dec 2, 2021

I've been looking at the source code and it seems that the statistics are taken into account only for the top level columns. In the majority of places I see schema.fields() were schema is Arrow schema and fields() returns only top level fields.

Parquet, on the other side have statistics for all columns, regardless of the nested level.

I do understand the "quick and dirty workaround" and in regards to it I have the following questions:

  1. Is datafusion query engine using the column stats?
  2. What about writing parquet files? Shouldn't those file be written with stats for all columns?

@tustvold
Copy link
Contributor

This appears to now work correctly, I suspect it was fixed by apache/arrow-rs#1588

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants