[WIP] Ignore UnknownType in General Parquet Writer #12177

HonahX · 2025-02-05T07:58:37Z

Fixes: #12028

Not ready for review

An attempt to create an UnknownType to ignore corresponding record field for UnknownType

Fokko

Just some preliminary comments 👍

Fokko · 2025-02-05T09:47:32Z

parquet/src/main/java/org/apache/iceberg/parquet/TypeToMessageType.java

+  // According to
+  // https://github.com/apache/parquet-java/blob/7f77908338192105a5adbfc420a7281d919e8596/parquet-column/src/main/java/org/apache/parquet/schema/LogicalTypeAnnotation.java#L992-L996
+  // This maps to UNKNOWN
+  private static final LogicalTypeAnnotation UNKNOWN = LogicalTypeAnnotation.intervalType();


Reusing the interval doesn't feel right to me 🤔

I agree—it doesn’t feel quite right. The main issue is that we need a corresponding Parquet type to represent unknown, allowing us to construct the dummy writer and safely ignore it. Initially, I thought this wouldn't be a problem since Parquet defines an UNKNOWN logical type (docs). But unfortunately, it doesn’t exist in LogicalTypeAnnotation : (

But this means that it will end up in the Parquet schema, and that's not how it is defined in the spec:

It is also up to interpretation, maybe it is okay to just write zero columns 🤔

So in the attempt I did I use different schema for parquetSchema and writerFactory

/ The third parameter is for ignoreUnknown = true this.parquetSchema = ParquetSchemaUtil.convert(schema, "table", true); this.model = (ParquetValueWriter<T>) createWriterFunc.apply(ParquetSchemaUtil.convert(schema, "table"));

This way the created parquet will not have the "unknown" column in the schema. Will think more about these tomorrow 🤔

Fokko · 2025-02-05T12:24:33Z

parquet/src/main/java/org/apache/iceberg/parquet/ParquetValueWriters.java

+  private static class UnknownWriter extends PrimitiveWriter<Object> {
+    private UnknownWriter(ColumnDescriptor desc) {
+      super(desc);
+    }
+
+    @Override
+    public void write(int repetitionLevel, Object value) {}
+
+    @Override
+    public List<TripleWriter<?>> columns() {
+      return ImmutableList.of();
+    }
+  }


Instead of adding the actual writers, I think it would be better to see if we can just skip over the column when creating a writer in BaseParquetWriter.createWriter

I feel that my mind gets stuck in the following dilemma:. Suppose the schema is (int, string, unknown, string); the record received by the writer is (1, "test1", null, "test2"). We must skip the null at index 2 when writing to Parquet.

iceberg/parquet/src/main/java/org/apache/iceberg/parquet/ParquetValueWriters.java

Lines 644 to 650 in 9981025

@Override

public void write(int repetitionLevel, S value) {

for (int i = 0; i < writers.length; i += 1) {

Object fieldValue = get(value, i);

writers[i].write(repetitionLevel, fieldValue);

}

}

That leaves us with two options:

Use a dummy writer (UnknownWriter here) for the unknown column so that nothing is written for it, while keeping the one-to-one correspondence between the record’s fields and the writer array.

Change the StructWriter/RecordWriter to skip over the unknown column’s value. This would require maintaining additional mapping information between the record's full schema and the list of writers, which would involve a more extensive refactoring.

Given the potential impact of refactoring the StructWriter, I chose option 1—the dummy writer—as it maintains the field ordering without introducing significant changes. Do I miss anything? If so, I'd greatly appreciate any suggestions. Thanks in advance!

I've played around with the code today, and I see the problem that you're encountering. Unfortunally, for the createWriterFunc we have the Parquet MessageType as an input, otherwise, we could skip over the Unkown fields pretty easily.

I think 1 could work, but we have to make sure that field doesn't leak into the Parquet schema (which is not allowed according to the spec).

parquet/src/test/java/org/apache/iceberg/parquet/TestParquetDataWriter.java

HonahX · 2025-02-12T07:16:11Z

parquet/src/main/java/org/apache/iceberg/parquet/TypeToMessageType.java

@@ -71,6 +89,10 @@ public GroupType struct(StructType struct, Type.Repetition repetition, int id, S
    Types.GroupBuilder<GroupType> builder = Types.buildGroup(repetition);

    for (NestedField field : struct.fields()) {
+      if (ignoreUnknownFields
+          && field.type().typeId() == org.apache.iceberg.types.Type.TypeID.UNKNOWN) {
+        continue;


My interpretation of unknown type: https://iceberg.apache.org/spec/#semi-structured-types

Default / null column type used when a more specific type is not known

—is that unknown should only be used as a column type, not as an inner type within a list or map. In fact, we can’t even "ignore" the element field of a list in Parquet.

Not sure if we should update the spec to make this more explicit, since lists can currently take "any" type.

github-actions bot added the parquet label Feb 5, 2025

Fokko reviewed Feb 5, 2025

View reviewed changes

nastra reviewed Feb 5, 2025

View reviewed changes

parquet/src/test/java/org/apache/iceberg/parquet/TestParquetDataWriter.java Outdated Show resolved Hide resolved

HonahX commented Feb 12, 2025

View reviewed changes

HonahX added 5 commits March 9, 2025 19:52

initial draft

0f14e2e

ignore Unknown fields in when creating parquetSchema

d46db94

Fix tests in TestParquetDataWriter

1b6e0a9

Fix style

15c0eb6

Fix test

f369d42

HonahX force-pushed the honahx_ignore_unknown branch from 527dbcd to f369d42 Compare March 10, 2025 01:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Ignore UnknownType in General Parquet Writer #12177

[WIP] Ignore UnknownType in General Parquet Writer #12177

HonahX commented Feb 5, 2025

Fokko left a comment

Fokko Feb 5, 2025

HonahX Feb 12, 2025

Fokko Feb 13, 2025

Fokko Feb 13, 2025

HonahX Feb 14, 2025 •

edited

Loading

Fokko Feb 5, 2025

HonahX Feb 6, 2025

Fokko Feb 7, 2025

HonahX Feb 12, 2025

	@Override
	public void write(int repetitionLevel, S value) {
	for (int i = 0; i < writers.length; i += 1) {
	Object fieldValue = get(value, i);
	writers[i].write(repetitionLevel, fieldValue);
	}
	}

[WIP] Ignore UnknownType in General Parquet Writer #12177

Are you sure you want to change the base?

[WIP] Ignore UnknownType in General Parquet Writer #12177

Conversation

HonahX commented Feb 5, 2025

Fokko left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HonahX Feb 14, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HonahX Feb 14, 2025 •

edited

Loading