-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Fix regression with Incorrect results when reading parquet files with different schemas and statistics #8533
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
de3fc46 to
df9d082
Compare
df9d082 to
5633b43
Compare
| ParquetRecordBatchStreamBuilder::new_with_options(reader, options) | ||
| .await?; | ||
|
|
||
| let file_schema = builder.schema().clone(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought giving builder.schema() a name made the code clearer.
| &predicate, | ||
| builder.schema().as_ref(), | ||
| table_schema.as_ref(), | ||
| &file_schema, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
drive by cleanup -- no functioanl change
| parquet_schema, | ||
| row_group_metadata: metadata, | ||
| arrow_schema: predicate.schema().as_ref(), | ||
| arrow_schema, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the actual fix -- to use the file schema rather than the table schema to lookup columns
| } | ||
|
|
||
| #[test] | ||
| fn row_group_pruning_predicate_simple_expr() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since I updated the signature of prune_row_groups_by_statistics I also had to update the tests appropriately
|
|
||
| # Should see all 7 rows that have 'a=foo' | ||
| query TIR rowsort | ||
| select * from parquet_table where a = 'foo'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This query only returns 3 rows (not all 7) without this fix. You can see the differences here: 5633b43
| ########## | ||
|
|
||
|
|
||
| statement ok |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am really loving how easy it is to write end to end tests with files after all the work @devinjdangelo did for parallel partitioned writes ❤️
| let metrics = parquet_file_metrics(); | ||
| assert_eq!( | ||
| prune_row_groups_by_statistics( | ||
| &schema, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like these are existing tests? Should we add one test of prune_row_groups_by_statistics which schema (file schema) is different to predicate schema?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a good idea -- I will do so
The reason I think an end to end test in .slt is also required is that the bug is related to passing the wrong schema into prune_row_groups_by_statistics -- so I could write a test in terms of prune_row_groups_by_statistics that passes (passes in the right schema) but actual answers would still be wrong because the callsite of prune_row_groups_by_statistics would pass the wrong one
| foo NULL NULL | ||
| foo NULL NULL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Just for other reviewers to easily review the results)
File1:
foo 1 NULL
foo 2 NULL
foo 3 NULL
File2:
NULL 10 NULL
File3:
foo NULL NULL
foo NULL NULL
File4:
foo 100 10.5
foo 200 12.6
bzz 300 13.7
… different schemas and statistics (#8533) * Add test for schema evolution * Fix reading parquet statistics * Update tests for fix * Add comments to help explain the test * Add another test
… different schemas and statistics (apache#8533) * Add test for schema evolution * Fix reading parquet statistics * Update tests for fix * Add comments to help explain the test * Add another test
… different schemas and statistics (apache#8533) * Add test for schema evolution * Fix reading parquet statistics * Update tests for fix * Add comments to help explain the test * Add another test
… different schemas and statistics (apache#8533) * Add test for schema evolution * Fix reading parquet statistics * Update tests for fix * Add comments to help explain the test * Add another test
… different schemas and statistics (apache#8533) * Add test for schema evolution * Fix reading parquet statistics * Update tests for fix * Add comments to help explain the test * Add another test
… different schemas and statistics (apache#8533) * Add test for schema evolution * Fix reading parquet statistics * Update tests for fix * Add comments to help explain the test * Add another test
… different schemas and statistics (apache#8533) * Add test for schema evolution * Fix reading parquet statistics * Update tests for fix * Add comments to help explain the test * Add another test
Which issue does this PR close?
Fixes #8532
Rationale for this change
#8294 introduced a regression which will cause parquet files to be incorrectly pruned in some cases, leading to incorrect query results (some rows are filtered out incorrectly)
Note this regression was not released yet
What changes are included in this PR?
Are these changes tested?
Yes, new end to end .slt coverage is added
Are there any user-facing changes?
Bug fix (for an unreleased bug)