-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Avro: Support partition values using a constants map #896
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -41,7 +41,7 @@ | |||
} | |||
} else { | |||
byte[] bytes = new byte[buffer.remaining()]; | |||
buffer.get(bytes); | |||
buffer.asReadOnlyBuffer().get(bytes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a quick check on this method to make sure the incoming buffer is not modified and found that it is for off-heap buffers. This fixes the problem, but isn't really related. If anyone prefers, I can move this to a separate PR.
Opened #897 to track this for ORC. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, minor comments
} | ||
|
||
private static Map<Integer, ?> constantsMap(PartitionSpec spec, StructLike partitionData) { | ||
// use java.util.HashMap because partition data may contain null values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this true only for identity partitioning or in general?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general. All partition functions are required to return null
when applied to null
.
data/src/main/java/org/apache/iceberg/data/avro/DataReader.java
Outdated
Show resolved
Hide resolved
} | ||
} | ||
protected Object get(InternalRow struct, int pos) { | ||
return null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we return the obj at pos here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The previous version of this code didn't implement reuse, probably because it needs the Spark type to correctly fetch the value. Rather than fix that in this PR, we can implement reuse in another one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good!
@rdsr, tests are passing. Could you have another look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 . LGTM. Will keep it open for a day for others to review
Thanks for reviewing! |
This is a follow-up to apache#896, which added the same constant map support for Avro. Fixes apache#575 for Parquet and replaces apache#585. Andrei did a lot of the work for this in apache#585. Co-authored-by: Andrei Ionescu <webdev.andrei@gmail.com>
* Move Boson iceberg code to iceberg repo * fix * rename * addres comments * remove wrong change * address comments * fix import order * address comments * fix comment * change to private static class
This updates Avro readers to support passing a map from field IDs to constants. This is used to inject constant values from partition data.
Unlike similar changes for Parquet (see #585), this doesn't replace the reader for a field because the readers still need to be called to update the decoder state by consuming a value from the stream. Instead, this adds a phase after reading a struct when the constants are set. This happens whether or not the data file contained a given field.
This is needed for partition values in InputFormats, #843, and is the Avro part of fixing #575.