Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix when reading struct-type data without an id in iceberg-parquet #11378

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

joyCurry30
Copy link

Fix issue: #11214
Fix when reading struct-type data without an id in iceberg-parquet, it returns null values.

@joyCurry30
Copy link
Author

@nastra Can you help review this PR? Thanks a lot.

@nastra
Copy link
Contributor

nastra commented Oct 24, 2024

@joyCurry30 I'll take a look early next week but this should at least have a test that reproduces the issue you're trying to fix

@joyCurry30
Copy link
Author

@joyCurry30 I'll take a look early next week but this should at least have a test that reproduces the issue you're trying to fix

Ok, I will add a test for this issue.

Copy link

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Nov 25, 2024
@nastra nastra added not-stale and removed stale labels Nov 25, 2024
@nastra
Copy link
Contributor

nastra commented Nov 25, 2024

@joyCurry30 please ping me once you added the test and I'll take a look

@joyCurry30 joyCurry30 force-pushed the Icebert-11214 branch 2 times, most recently from 73776cc to 3bc6636 Compare November 25, 2024 12:08
@joyCurry30
Copy link
Author

@nastra Sorry for replying so late. I've added the related test cases. Please review the code for me. If you have any suggestions, I would be happy to make the changes.

@nastra nastra requested a review from Fokko November 25, 2024 14:26
@nastra
Copy link
Contributor

nastra commented Dec 2, 2024

@Fokko could you review this one please?

@@ -49,7 +49,7 @@ class ParquetWriter<T> implements FileAppender<T>, Closeable {
private final long targetRowGroupSize;
private final Map<String, String> metadata;
private final ParquetProperties props;
private final CodecFactory.BytesCompressor compressor;
private final CodecFactory.BytesInputCompressor compressor;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be an unrelated change?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I will remove it.

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the issue, this will enable reading files that have the field-id missing for the StructType. However, Field-IDs allow for safe evolution of an Iceberg table. With this patch, you will be able to read the table, but once you rename the column-name of the struct, it will break and brick the table again.

if (field.getId() != null) {
return field.getId().intValue();
}
return expected.field(field.getName()).fieldId();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is essentially falling back to name-based schema resolution.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the field-ids are null, how can I find the expected field?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also left a comment on the issue itself #11214 (comment). I don't think this is a good solution since it hides the problem, as mentioned before, if you rename a struct field, it will break the table. Instead, I would leverage name mapping as suggested on the issue.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your comments, I will reconsider and try to re-address the issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any time, thank you for the PR, but I'm afraid that it will cause issues downstream

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants