Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ParquetRecordBatchStreamBuilder::new() panics instead of erroring out when opening a corrupted file #5315

Closed
mmaitre314 opened this issue Jan 20, 2024 · 2 comments · Fixed by #5362
Labels
enhancement Any new improvement worthy of a entry in the changelog good first issue Good for newcomers help wanted parquet Changes to the parquet crate

Comments

@mmaitre314
Copy link
Contributor

When opening a corrupted Parquet file where one of the row groups is missing a column, ParquetRecordBatchStreamBuilder::new() panics instead of returning an error:

 A panic occurred at (...)\parquet-50.0.0\src\file\metadata.rs:352: assertion `left == right` failed
  left: 2
 right: 1

This is due to the RowGroupMetaData::from_thrift() validating column counts using the assert_eq!() macro:

assert_eq!(schema_descr.num_columns(), rg.columns.len());

Could the check be replaced by an if statement returning an error? If that sounds OK, I can prepare a PR for review.

As workaround, it is possible to unwind the panic and handle it almost like a regular error. Unwinding async code is not straightforward though and it would be simpler to allow consistent error handling.

    let async_panic_handler = AssertUnwindSafe(async { ParquetRecordBatchStreamBuilder::new(blob_reader).await }).catch_unwind().await;
    if async_panic_handler.is_err() {
        info!("Skipping Blob (invalid metadata)");
        return Ok(())
    }
    let parquet_builder = async_panic_handler.unwrap()?;
@mmaitre314 mmaitre314 added the bug label Jan 20, 2024
@tustvold
Copy link
Contributor

tustvold commented Jan 20, 2024

Changing to an error seems straightforward enough, but I will point out the actual thrift decoding logic can also panic.

FWIW if using tokio, it already intercepts and handles panics for you.

#3279

@tustvold tustvold added good first issue Good for newcomers enhancement Any new improvement worthy of a entry in the changelog help wanted and removed bug labels Jan 20, 2024
@tustvold tustvold added the parquet Changes to the parquet crate label Mar 1, 2024
@tustvold
Copy link
Contributor

tustvold commented Mar 1, 2024

label_issue.py automatically added labels {'parquet'} from #5362

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog good first issue Good for newcomers help wanted parquet Changes to the parquet crate
Projects
None yet
2 participants