Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Ignore UnknownType in General Parquet Writer #12177

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

HonahX
Copy link
Contributor

@HonahX HonahX commented Feb 5, 2025

Fixes: #12028

Not ready for review

An attempt to create an UnknownType to ignore corresponding record field for UnknownType

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some preliminary comments 👍

Comment on lines +59 to +62
// According to
// https://github.com/apache/parquet-java/blob/7f77908338192105a5adbfc420a7281d919e8596/parquet-column/src/main/java/org/apache/parquet/schema/LogicalTypeAnnotation.java#L992-L996
// This maps to UNKNOWN
private static final LogicalTypeAnnotation UNKNOWN = LogicalTypeAnnotation.intervalType();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reusing the interval doesn't feel right to me 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree—it doesn’t feel quite right. The main issue is that we need a corresponding Parquet type to represent unknown, allowing us to construct the dummy writer and safely ignore it. Initially, I thought this wouldn't be a problem since Parquet defines an UNKNOWN logical type (docs). But unfortunately, it doesn’t exist in LogicalTypeAnnotation : (

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this means that it will end up in the Parquet schema, and that's not how it is defined in the spec:

image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is also up to interpretation, maybe it is okay to just write zero columns 🤔

Copy link
Contributor Author

@HonahX HonahX Feb 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in the attempt I did I use different schema for parquetSchema and writerFactory

/ The third parameter is for ignoreUnknown = true
this.parquetSchema = ParquetSchemaUtil.convert(schema, "table", true);
this.model =
        (ParquetValueWriter<T>) createWriterFunc.apply(ParquetSchemaUtil.convert(schema, "table"));

This way the created parquet will not have the "unknown" column in the schema. Will think more about these tomorrow 🤔

Comment on lines +393 to +405
private static class UnknownWriter extends PrimitiveWriter<Object> {
private UnknownWriter(ColumnDescriptor desc) {
super(desc);
}

@Override
public void write(int repetitionLevel, Object value) {}

@Override
public List<TripleWriter<?>> columns() {
return ImmutableList.of();
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of adding the actual writers, I think it would be better to see if we can just skip over the column when creating a writer in BaseParquetWriter.createWriter

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel that my mind gets stuck in the following dilemma:. Suppose the schema is (int, string, unknown, string); the record received by the writer is (1, "test1", null, "test2"). We must skip the null at index 2 when writing to Parquet.

@Override
public void write(int repetitionLevel, S value) {
for (int i = 0; i < writers.length; i += 1) {
Object fieldValue = get(value, i);
writers[i].write(repetitionLevel, fieldValue);
}
}

That leaves us with two options:

  1. Use a dummy writer (UnknownWriter here) for the unknown column so that nothing is written for it, while keeping the one-to-one correspondence between the record’s fields and the writer array.

  2. Change the StructWriter/RecordWriter to skip over the unknown column’s value. This would require maintaining additional mapping information between the record's full schema and the list of writers, which would involve a more extensive refactoring.

Given the potential impact of refactoring the StructWriter, I chose option 1—the dummy writer—as it maintains the field ordering without introducing significant changes. Do I miss anything? If so, I'd greatly appreciate any suggestions. Thanks in advance!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've played around with the code today, and I see the problem that you're encountering. Unfortunally, for the createWriterFunc we have the Parquet MessageType as an input, otherwise, we could skip over the Unkown fields pretty easily.

I think 1 could work, but we have to make sure that field doesn't leak into the Parquet schema (which is not allowed according to the spec).

@@ -71,6 +89,10 @@ public GroupType struct(StructType struct, Type.Repetition repetition, int id, S
Types.GroupBuilder<GroupType> builder = Types.buildGroup(repetition);

for (NestedField field : struct.fields()) {
if (ignoreUnknownFields
&& field.type().typeId() == org.apache.iceberg.types.Type.TypeID.UNKNOWN) {
continue;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My interpretation of unknown type: https://iceberg.apache.org/spec/#semi-structured-types

Default / null column type used when a more specific type is not known

—is that unknown should only be used as a column type, not as an inner type within a list or map. In fact, we can’t even "ignore" the element field of a list in Parquet.

Not sure if we should update the spec to make this more explicit, since lists can currently take "any" type.

@HonahX HonahX force-pushed the honahx_ignore_unknown branch from 527dbcd to f369d42 Compare March 10, 2025 01:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Skip Writing Fields with UnknownType
3 participants