-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Ignore UnknownType in General Parquet Writer #12177
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some preliminary comments 👍
// According to | ||
// https://github.com/apache/parquet-java/blob/7f77908338192105a5adbfc420a7281d919e8596/parquet-column/src/main/java/org/apache/parquet/schema/LogicalTypeAnnotation.java#L992-L996 | ||
// This maps to UNKNOWN | ||
private static final LogicalTypeAnnotation UNKNOWN = LogicalTypeAnnotation.intervalType(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reusing the interval
doesn't feel right to me 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree—it doesn’t feel quite right. The main issue is that we need a corresponding Parquet type to represent unknown
, allowing us to construct the dummy writer and safely ignore it. Initially, I thought this wouldn't be a problem since Parquet defines an UNKNOWN
logical type (docs). But unfortunately, it doesn’t exist in LogicalTypeAnnotation : (
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is also up to interpretation, maybe it is okay to just write zero columns 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So in the attempt I did I use different schema for parquetSchema
and writerFactory
/ The third parameter is for ignoreUnknown = true
this.parquetSchema = ParquetSchemaUtil.convert(schema, "table", true);
this.model =
(ParquetValueWriter<T>) createWriterFunc.apply(ParquetSchemaUtil.convert(schema, "table"));
This way the created parquet will not have the "unknown" column in the schema. Will think more about these tomorrow 🤔
private static class UnknownWriter extends PrimitiveWriter<Object> { | ||
private UnknownWriter(ColumnDescriptor desc) { | ||
super(desc); | ||
} | ||
|
||
@Override | ||
public void write(int repetitionLevel, Object value) {} | ||
|
||
@Override | ||
public List<TripleWriter<?>> columns() { | ||
return ImmutableList.of(); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of adding the actual writers, I think it would be better to see if we can just skip over the column when creating a writer in BaseParquetWriter.createWriter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel that my mind gets stuck in the following dilemma:. Suppose the schema is (int, string, unknown, string); the record received by the writer is (1, "test1", null, "test2"). We must skip the null at index 2 when writing to Parquet.
iceberg/parquet/src/main/java/org/apache/iceberg/parquet/ParquetValueWriters.java
Lines 644 to 650 in 9981025
@Override | |
public void write(int repetitionLevel, S value) { | |
for (int i = 0; i < writers.length; i += 1) { | |
Object fieldValue = get(value, i); | |
writers[i].write(repetitionLevel, fieldValue); | |
} | |
} |
That leaves us with two options:
-
Use a dummy writer (
UnknownWriter
here) for the unknown column so that nothing is written for it, while keeping the one-to-one correspondence between the record’s fields and the writer array. -
Change the
StructWriter
/RecordWriter
to skip over the unknown column’s value. This would require maintaining additional mapping information between the record's full schema and the list of writers, which would involve a more extensive refactoring.
Given the potential impact of refactoring the StructWriter, I chose option 1—the dummy writer—as it maintains the field ordering without introducing significant changes. Do I miss anything? If so, I'd greatly appreciate any suggestions. Thanks in advance!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've played around with the code today, and I see the problem that you're encountering. Unfortunally, for the createWriterFunc
we have the Parquet MessageType
as an input, otherwise, we could skip over the Unkown fields pretty easily.
I think 1 could work, but we have to make sure that field doesn't leak into the Parquet schema (which is not allowed according to the spec).
parquet/src/test/java/org/apache/iceberg/parquet/TestParquetDataWriter.java
Outdated
Show resolved
Hide resolved
@@ -71,6 +89,10 @@ public GroupType struct(StructType struct, Type.Repetition repetition, int id, S | |||
Types.GroupBuilder<GroupType> builder = Types.buildGroup(repetition); | |||
|
|||
for (NestedField field : struct.fields()) { | |||
if (ignoreUnknownFields | |||
&& field.type().typeId() == org.apache.iceberg.types.Type.TypeID.UNKNOWN) { | |||
continue; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My interpretation of unknown
type: https://iceberg.apache.org/spec/#semi-structured-types
Default / null column type used when a more specific type is not known
—is that unknown
should only be used as a column type, not as an inner type within a list or map. In fact, we can’t even "ignore" the element field of a list in Parquet.
Not sure if we should update the spec to make this more explicit, since lists can currently take "any" type.
527dbcd
to
f369d42
Compare
Fixes: #12028
Not ready for review
An attempt to create an UnknownType to ignore corresponding record field for UnknownType