-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parquet: Remove the row position since parquet row group has it natively #6056
Conversation
Looks like the |
@flyrain Do you mean parquet doesn't give the right offset? I remember there is a fix(apache/parquet-java#978) for this, not sure whether this fix is already in 1.12.3. |
Yes, that's exactly why our test
|
So this change needs to wait until there is a parquet-mr release with a fix for PARQUET-2161, right? |
Yes.
I assume there is only one reader initialized in the test you added, which can set the right row offset since it reads from the beginning of the parquet file, instead of from the middle. |
Ah, in a real-world situation, we could have a parquet file with multiple row groups and it is split among multiple scan tasks, and it has deletes to be applied, in which case the read will be incorrect then (since some split(s) will be reading the file from the middle). |
Do we trust this value from Parquet? |
The approach parquet used is similar to what @chenjunjiedada implemented in Iceberg repo. As long as it is reliable(no bug), I don't see a reason to not trust it. By using it, we don't have to calculate it again since parquet lib did it already, and it also makes the Iceberg code base a bit cleaner. |
We can reconsider this after this PR #7301 is merged. |
@flyrain @chenjunjiedada do we want to revisit this since we're on Parquet 1.13.1? |
@Fokko, yes, we can revisit this with the new parquet release. It has the change Iceberg required. It should be safe and clean. Although, i'm not sure the issue mentioned by @ricardopereira33 is related. Hi @ricardopereira33, can you elaborate a bit? |
Hi @flyrain ! I added all the details in this Slack thread. Can you see it? |
Thanks for sharing @ricardopereira33. Looks like the issue is due to a bug in Parquet 1.12.3, not directly related to this PR. It'd be nice to have this PR in though. |
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions. |
This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time. |
After apache/parquet-java@c7bff51 shipped in parquet 1.12.3, the parquet row group provides the row index offset natively. We don't need to calculate it in Iceberg.
cc @chenjunjiedada @rdblue @wypoon @aokolnychyi