PARQUET-2161: Fix row index generation in combination with range filtering #978

ala · 2022-06-20T11:11:36Z

Make sure you have checked all steps below.

Jira

My PR addresses the following Parquet Jira issues and references them in the PR title. For example, "PARQUET-1234: My Parquet PR"
- https://issues.apache.org/jira/browse/PARQUET-2161
- In case you are adding a dependency, check if the license complies with the ASF 3rd Party License Policy.

Tests

My PR adds the following unit tests OR does not need testing for this extremely good reason:
- Extends TestParquetReader suite.

Commits

My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters (not including Jira issue reference)
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not "adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"

Documentation

In case of new functionality, my PR adds documentation that describes how to use it.
- All the public functions and the classes in the PR contain Javadoc that explain what it does

The row indexes introduced in PARQUET-2117 are not computed correctly when: (1) range or offset metadata filter is applied, and (2) the first row group was eliminated by the filter For example, if a file has two row groups with 10 rows each, and we attempt to only read the 2nd row group, we are going to produce row indexes 0, 1, 2, ..., 9 instead of expected 10, 11, ..., 19. This happens because functions `filterFileMetaDataByStart` and `filterFileMetaDataByMidpoint` modify their input `FileMetaData`. To return correct result, `generateRowGroupOffsets` has to be computed before these filters are applied.

ala · 2022-06-23T10:53:54Z

cc @shangxinli This is a small follow-up bug fix for #945

ala · 2022-06-24T15:55:18Z

cc @ggershinsky

ggershinsky · 2022-06-28T12:50:59Z

Yep, I remember reviewing that PR. @prakharjain09 , can you also have a look at this fix?

parquet-hadoop/src/test/java/org/apache/parquet/filter2/recordlevel/PhoneBookWriter.java

chenjunjiedada · 2022-06-29T13:06:38Z

This looks correct to me. The logic also exists in the iceberg row position reader. See: apache/iceberg#1254 (comment).

ggershinsky · 2022-06-29T13:12:29Z

Thanks @chenjunjiedada .
@ala , please handle the message comment, and I'll merge this PR.

ala · 2022-06-29T15:48:20Z

@ggershinsky Thanks for the review. I tweaked the error assertion message to better match the rest of the codebase.

prakharjain09

Thanks a lot for fixing this issue @ala . Changes looks good to me.

ggershinsky · 2022-06-29T19:21:31Z

Thanks @ala

ala · 2022-07-20T09:40:55Z

@ggershinsky Do you know when the next release that will include the fix might happen? We are looking to unblock https://issues.apache.org/jira/browse/SPARK-39634 in Apache Spark.

ggershinsky · 2022-07-26T06:57:27Z

cc @shangxinli

ala · 2022-10-17T15:34:28Z

@ggershinsky @shangxinli Hi! I just wanted to ask if 1.12.4 release might be happening soon (it seems in the previous years there usually was a release around September-October time)? We could really use the fix in Spark. Also: do I need to cherry-pick this fix, or would the next release be cut from master?

shangxinli · 2022-10-17T17:08:24Z

@ala Thanks for pinging me! At this moment, I don't have ETA yet.

ala force-pushed the row-idx-fix branch from 373ebd8 to b52ae80 Compare June 20, 2022 11:12

chenjunjiedada reviewed Jun 29, 2022

View reviewed changes

parquet-hadoop/src/test/java/org/apache/parquet/filter2/recordlevel/PhoneBookWriter.java Outdated Show resolved Hide resolved

chenjunjiedada approved these changes Jun 29, 2022

View reviewed changes

Adjust assert message

a00d1d6

prakharjain09 approved these changes Jun 29, 2022

View reviewed changes

ggershinsky merged commit 5290bd5 into apache:master Jun 29, 2022

ala deleted the row-idx-fix branch July 20, 2022 09:39

chenjunjiedada mentioned this pull request Oct 26, 2022

Parquet: Remove the row position since parquet row group has it natively apache/iceberg#6056

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PARQUET-2161: Fix row index generation in combination with range filtering #978

PARQUET-2161: Fix row index generation in combination with range filtering #978

ala commented Jun 20, 2022

ala commented Jun 23, 2022

ala commented Jun 24, 2022

ggershinsky commented Jun 28, 2022

chenjunjiedada commented Jun 29, 2022

ggershinsky commented Jun 29, 2022

ala commented Jun 29, 2022

prakharjain09 left a comment

ggershinsky commented Jun 29, 2022

ala commented Jul 20, 2022

ggershinsky commented Jul 26, 2022

ala commented Oct 17, 2022

shangxinli commented Oct 17, 2022

PARQUET-2161: Fix row index generation in combination with range filtering #978

PARQUET-2161: Fix row index generation in combination with range filtering #978

Conversation

ala commented Jun 20, 2022

Jira

Tests

Commits

Documentation

ala commented Jun 23, 2022

ala commented Jun 24, 2022

ggershinsky commented Jun 28, 2022

chenjunjiedada commented Jun 29, 2022

ggershinsky commented Jun 29, 2022

ala commented Jun 29, 2022

prakharjain09 left a comment

Choose a reason for hiding this comment

ggershinsky commented Jun 29, 2022

ala commented Jul 20, 2022

ggershinsky commented Jul 26, 2022

ala commented Oct 17, 2022

shangxinli commented Oct 17, 2022