Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet: Remove the row position since parquet row group has it natively #6056

Closed
wants to merge 2 commits into from

Conversation

flyrain
Copy link
Contributor

@flyrain flyrain commented Oct 25, 2022

After apache/parquet-java@c7bff51 shipped in parquet 1.12.3, the parquet row group provides the row index offset natively. We don't need to calculate it in Iceberg.

cc @chenjunjiedada @rdblue @wypoon @aokolnychyi

@flyrain
Copy link
Contributor Author

flyrain commented Oct 26, 2022

Looks like the RowIndexOffset won't be set correct if we read from the middle of a parquet file. It will start from 0 no matter where we start to read.

@chenjunjiedada
Copy link
Collaborator

Looks like the RowIndexOffset won't be set correct if we read from the middle of a parquet file. It will start from 0 no matter where we start to read.

@flyrain Do you mean parquet doesn't give the right offset? I remember there is a fix(apache/parquet-java#978) for this, not sure whether this fix is already in 1.12.3.

@flyrain
Copy link
Contributor Author

flyrain commented Oct 26, 2022

Yes, that's exactly why our test testReadRowNumbersWithSplits failed. apache/parquet-java#978 is not in 1.12.3(the latest parquet release). We may have to wait until it is released.

For example, if a file has two row groups with 10 rows each, and we attempt to only read the 2nd row group, we are going to produce row indexes 0, 1, 2, ..., 9 instead of expected 10, 11, ..., 19.

https://issues.apache.org/jira/browse/PARQUET-2161

@wypoon
Copy link
Contributor

wypoon commented Oct 26, 2022

So this change needs to wait until there is a parquet-mr release with a fix for PARQUET-2161, right?
Just so I understand, for our usage, the bug manifests when Parquet.ReadBuilder#split is called (and thus when Parquet.ReadBuilder#build is called, the VectorizedParquetReader or ParquetReader is constructed with a ParquetReadOptions that has a ParquetMetadataConverter.RangeMetadataFilter), right? Why doesn't it affect the test I added in TestSparkReaderDeletes?

@flyrain
Copy link
Contributor Author

flyrain commented Oct 26, 2022

So this change needs to wait until there is a parquet-mr release with a fix for [PARQUET-2161]

Yes.

Why doesn't it affect the test I added in TestSparkReaderDeletes?

I assume there is only one reader initialized in the test you added, which can set the right row offset since it reads from the beginning of the parquet file, instead of from the middle.

@wypoon
Copy link
Contributor

wypoon commented Oct 26, 2022

Why doesn't it affect the test I added in TestSparkReaderDeletes?

I assume there is only one reader initialized in the test you added, which can set the right row offset since it reads from the beginning of the parquet file, instead of from the middle.

Ah, in a real-world situation, we could have a parquet file with multiple row groups and it is split among multiple scan tasks, and it has deletes to be applied, in which case the read will be incorrect then (since some split(s) will be reading the file from the middle).

@rdblue
Copy link
Contributor

rdblue commented Nov 6, 2022

Do we trust this value from Parquet?

@flyrain
Copy link
Contributor Author

flyrain commented Nov 7, 2022

Do we trust this value from Parquet?

The approach parquet used is similar to what @chenjunjiedada implemented in Iceberg repo. As long as it is reliable(no bug), I don't see a reason to not trust it. By using it, we don't have to calculate it again since parquet lib did it already, and it also makes the Iceberg code base a bit cleaner.

@flyrain
Copy link
Contributor Author

flyrain commented May 17, 2023

We can reconsider this after this PR #7301 is merged.

@ricardopereira33
Copy link

Hi @flyrain @wypoon !

Are there any updates regarding this issue? We have a case when we write with Trino, and then we do data files compaction (or even just read a file), Spark can not read the file because the offsets are not within the range... example:
Screenshot 2023-11-06 at 09 45 59

Code section: https://github.com/apache/parquet-mr/blob/0a066d8a5c71386e56dee7bd7a21170b27e4283a/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1287

@Fokko
Copy link
Contributor

Fokko commented Nov 10, 2023

@flyrain @chenjunjiedada do we want to revisit this since we're on Parquet 1.13.1?

@flyrain
Copy link
Contributor Author

flyrain commented Nov 10, 2023

@Fokko, yes, we can revisit this with the new parquet release. It has the change Iceberg required. It should be safe and clean. Although, i'm not sure the issue mentioned by @ricardopereira33 is related. Hi @ricardopereira33, can you elaborate a bit?

@ricardopereira33
Copy link

Hi @flyrain ! I added all the details in this Slack thread. Can you see it?

@flyrain
Copy link
Contributor Author

flyrain commented Nov 13, 2023

Thanks for sharing @ricardopereira33. Looks like the issue is due to a bug in Parquet 1.12.3, not directly related to this PR. It'd be nice to have this PR in though.

Copy link

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Aug 23, 2024
Copy link

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

@github-actions github-actions bot closed this Sep 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants