Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-2161: Fix row index generation in combination with range filtering #978

Merged
merged 2 commits into from
Jun 29, 2022

Conversation

ala
Copy link
Contributor

@ala ala commented Jun 20, 2022

Make sure you have checked all steps below.

Jira

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:
    • Extends TestParquetReader suite.

Commits

  • My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain Javadoc that explain what it does

The row indexes introduced in PARQUET-2117 are not computed correctly
when:
(1) range or offset metadata filter is applied, and
(2) the first row group was eliminated by the filter

For example, if a file has two row groups with 10 rows each, and we
attempt to only read the 2nd row group, we are going to produce row
indexes 0, 1, 2, ..., 9 instead of expected 10, 11, ..., 19.

This happens because functions `filterFileMetaDataByStart`
and `filterFileMetaDataByMidpoint` modify their input `FileMetaData`.
To return correct result, `generateRowGroupOffsets` has to be computed
before these filters are applied.
@ala
Copy link
Contributor Author

ala commented Jun 23, 2022

cc @shangxinli This is a small follow-up bug fix for #945

@ala
Copy link
Contributor Author

ala commented Jun 24, 2022

cc @ggershinsky

@ggershinsky
Copy link
Contributor

Yep, I remember reviewing that PR. @prakharjain09 , can you also have a look at this fix?

@chenjunjiedada
Copy link
Contributor

This looks correct to me. The logic also exists in the iceberg row position reader. See: apache/iceberg#1254 (comment).

@ggershinsky
Copy link
Contributor

Thanks @chenjunjiedada .
@ala , please handle the message comment, and I'll merge this PR.

@ala
Copy link
Contributor Author

ala commented Jun 29, 2022

@ggershinsky Thanks for the review. I tweaked the error assertion message to better match the rest of the codebase.

Copy link
Contributor

@prakharjain09 prakharjain09 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for fixing this issue @ala . Changes looks good to me.

@ggershinsky
Copy link
Contributor

Thanks @ala

@ggershinsky ggershinsky merged commit 5290bd5 into apache:master Jun 29, 2022
@ala ala deleted the row-idx-fix branch July 20, 2022 09:39
@ala
Copy link
Contributor Author

ala commented Jul 20, 2022

@ggershinsky Do you know when the next release that will include the fix might happen? We are looking to unblock https://issues.apache.org/jira/browse/SPARK-39634 in Apache Spark.

@ggershinsky
Copy link
Contributor

cc @shangxinli

@ala
Copy link
Contributor Author

ala commented Oct 17, 2022

@ggershinsky @shangxinli Hi! I just wanted to ask if 1.12.4 release might be happening soon (it seems in the previous years there usually was a release around September-October time)? We could really use the fix in Spark. Also: do I need to cherry-pick this fix, or would the next release be cut from master?

@shangxinli
Copy link
Contributor

@ala Thanks for pinging me! At this moment, I don't have ETA yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants