-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-2161: Fix row index generation in combination with range filtering #978
Conversation
The row indexes introduced in PARQUET-2117 are not computed correctly when: (1) range or offset metadata filter is applied, and (2) the first row group was eliminated by the filter For example, if a file has two row groups with 10 rows each, and we attempt to only read the 2nd row group, we are going to produce row indexes 0, 1, 2, ..., 9 instead of expected 10, 11, ..., 19. This happens because functions `filterFileMetaDataByStart` and `filterFileMetaDataByMidpoint` modify their input `FileMetaData`. To return correct result, `generateRowGroupOffsets` has to be computed before these filters are applied.
cc @shangxinli This is a small follow-up bug fix for #945 |
cc @ggershinsky |
Yep, I remember reviewing that PR. @prakharjain09 , can you also have a look at this fix? |
parquet-hadoop/src/test/java/org/apache/parquet/filter2/recordlevel/PhoneBookWriter.java
Outdated
Show resolved
Hide resolved
This looks correct to me. The logic also exists in the iceberg row position reader. See: apache/iceberg#1254 (comment). |
Thanks @chenjunjiedada . |
@ggershinsky Thanks for the review. I tweaked the error assertion message to better match the rest of the codebase. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for fixing this issue @ala . Changes looks good to me.
Thanks @ala |
@ggershinsky Do you know when the next release that will include the fix might happen? We are looking to unblock https://issues.apache.org/jira/browse/SPARK-39634 in Apache Spark. |
cc @shangxinli |
@ggershinsky @shangxinli Hi! I just wanted to ask if 1.12.4 release might be happening soon (it seems in the previous years there usually was a release around September-October time)? We could really use the fix in Spark. Also: do I need to cherry-pick this fix, or would the next release be cut from |
@ala Thanks for pinging me! At this moment, I don't have ETA yet. |
Make sure you have checked all steps below.
Jira
Tests
TestParquetReader
suite.Commits
Documentation