Skip to content

Avro: Support partition values using a constants map #896

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Apr 9, 2020

Conversation

rdblue
Copy link
Contributor

@rdblue rdblue commented Apr 6, 2020

This updates Avro readers to support passing a map from field IDs to constants. This is used to inject constant values from partition data.

Unlike similar changes for Parquet (see #585), this doesn't replace the reader for a field because the readers still need to be called to update the decoder state by consuming a value from the stream. Instead, this adds a phase after reading a struct when the constants are set. This happens whether or not the data file contained a given field.

This is needed for partition values in InputFormats, #843, and is the Avro part of fixing #575.

@rdblue rdblue requested a review from rdsr April 6, 2020 20:50
@@ -41,7 +41,7 @@
}
} else {
byte[] bytes = new byte[buffer.remaining()];
buffer.get(bytes);
buffer.asReadOnlyBuffer().get(bytes);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a quick check on this method to make sure the incoming buffer is not modified and found that it is for off-heap buffers. This fixes the problem, but isn't really related. If anyone prefers, I can move this to a separate PR.

@rdblue
Copy link
Contributor Author

rdblue commented Apr 6, 2020

Opened #897 to track this for ORC.

Copy link
Contributor

@rdsr rdsr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, minor comments

}

private static Map<Integer, ?> constantsMap(PartitionSpec spec, StructLike partitionData) {
// use java.util.HashMap because partition data may contain null values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this true only for identity partitioning or in general?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general. All partition functions are required to return null when applied to null.

}
}
protected Object get(InternalRow struct, int pos) {
return null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we return the obj at pos here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous version of this code didn't implement reuse, probably because it needs the Spark type to correctly fetch the value. Rather than fix that in this PR, we can implement reuse in another one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good!

@rdblue
Copy link
Contributor Author

rdblue commented Apr 7, 2020

@rdsr, tests are passing. Could you have another look?

Copy link
Contributor

@rdsr rdsr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 . LGTM. Will keep it open for a day for others to review

@rdblue
Copy link
Contributor Author

rdblue commented Apr 8, 2020

Thanks for reviewing!

@rdblue rdblue merged commit b47e977 into apache:master Apr 9, 2020
rdblue added a commit that referenced this pull request Apr 11, 2020
This is a follow-up to #896, which added the same constant map support for Avro.

Fixes #575 for Parquet and replaces #585. Andrei did a lot of the work for this in #585.

Co-authored-by: Andrei Ionescu <webdev.andrei@gmail.com>
Fokko pushed a commit to Fokko/iceberg that referenced this pull request Apr 21, 2020
This is a follow-up to apache#896, which added the same constant map support for Avro.

Fixes apache#575 for Parquet and replaces apache#585. Andrei did a lot of the work for this in apache#585.

Co-authored-by: Andrei Ionescu <webdev.andrei@gmail.com>
sunchao pushed a commit to sunchao/iceberg that referenced this pull request May 9, 2023
* Move Boson iceberg code to iceberg repo

* fix

* rename

* addres comments

* remove wrong change

* address comments

* fix import order

* address comments

* fix comment

* change to private static class
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants