fix(ingest): Repair affected logical timestamp milli tables #14161

jonvex · 2025-10-26T15:10:20Z

Describe the issue this Pull Request addresses

This pr #9743 adds more schema evolution functionality and schema processing. However, we used the InternalSchema system to do various operations such as fix null ordering, reorder, and add columns. At the time, InternalSchema only had a single Timestamp type. When converting back to avro, this was assumed to be micros. Therefore, if the schema provider had any millis columns, the processed schema would end up with those columns as micros.

In this pr to update column stats with better support for logical types: #13711, the schema issues were fixed, as well as additional issues with handling and conversion of timestamps during ingestion.

this pr aims to add functionality to spark and hive readers and writers to automatically repair affected tables.
After switching to use the 1.1 binary, the affected columns will undergo evolution from timestamp-micros to timestamp-mills. Normally a lossy evolution that is not supported, this evolution is ok because the data is actually still timestamp-millis it is just mislabeled as micros in the parquet and table schemas

Summary and Changelog

When reading from a hudi table using spark or hive reader if the table schema has a column as millis, but the data schema is micros, we will assume that this column is affected and read it as a millis value instead of a micros value. This correction is also applied to all readers that the default write paths use. As a table is rewritten the parquet files will be correct. A table's latest snapshot can be immediately fixed by writing one commit with the 1.1 binary, and then clustering the entire table.

Impact

Repair affected tables

Risk Level

High,
extensive testing was done and functional tests were added.

Documentation Update

#14100

Contributor's checklist

Read through contributor's guide
Enough context is provided in the sections above
Adequate tests were added if applicable

...t/hudi-java-client/src/test/java/org/apache/hudi/hadoop/TestHoodieFileGroupReaderOnHive.java

...ent/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/HoodieSparkParquetReader.java

...di-spark-client/src/main/scala/org/apache/hudi/SparkFileFormatInternalRowReaderContext.scala

...main/scala/org/apache/spark/sql/execution/datasources/parquet/HoodieParquetReadSupport.scala

yihua · 2025-10-28T20:31:58Z

hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java

+            if (oldSchema.getLogicalType() == null || newSchema.getLogicalType() == null) {
+              return oldValue;
+            } else if (oldSchema.getLogicalType() instanceof LogicalTypes.TimestampMillis) {


Is this change needed?

Yeah, aditya was getting ingest failures from this

hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieAvroDataBlock.java

hudi-hadoop-common/src/main/java/org/apache/hudi/io/hadoop/HoodieAvroParquetReader.java

hudi-common/src/main/java/org/apache/parquet/schema/AvroSchemaRepair.java

hudi-hadoop-common/src/main/java/org/apache/parquet/avro/HoodieAvroReadSupport.java

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HiveHoodieReaderContext.java

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/avro/HoodieAvroParquetReader.java

...ommon/src/main/scala/org/apache/spark/sql/execution/datasources/orc/SparkOrcReaderBase.scala

...rg/apache/spark/sql/execution/datasources/parquet/HoodieFileGroupReaderBasedFileFormat.scala

hudi-utilities/src/test/resources/logical-repair/mor_write_updates/5/data.json

hudi-utilities/src/test/resources/logical-repair/cow_write_updates/2/data.json

hudi-utilities/src/test/resources/logical-repair/trips_logical_types_json_cow_write.zip

...src/main/scala/org/apache/spark/sql/execution/datasources/parquet/Spark40ParquetReader.scala

...src/main/scala/org/apache/spark/sql/execution/datasources/parquet/Spark35ParquetReader.scala

...src/main/scala/org/apache/spark/sql/execution/datasources/parquet/Spark34ParquetReader.scala

...src/main/scala/org/apache/spark/sql/execution/datasources/parquet/Spark33ParquetReader.scala

...park-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestCOWDataSource.scala

...park-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestMORDataSource.scala

...ent/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/HoodieSparkParquetReader.java

hudi-common/src/main/java/org/apache/parquet/schema/AvroSchemaRepair.java

...main/scala/org/apache/spark/sql/execution/datasources/parquet/HoodieParquetReadSupport.scala

hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieAvroDataBlock.java

nsivabalan · 2025-11-05T01:12:24Z

hudi-hadoop-common/src/main/java/org/apache/hudi/io/hadoop/HoodieAvroParquetReader.java

    //       sure that in case the file-schema is not equal to read-schema we'd still
    //       be able to read that file (in case projection is a proper one)
    Configuration hadoopConf = storage.getConf().unwrapCopyAs(Configuration.class);
+    Schema repairedFileSchema = AvroSchemaRepair.repairLogicalTypes(getSchema(), schema);


lets do this only if table schema contains logical ts-millis.

The AvroSchemaRepair.repairLogicalTypes will return the original schema if there is no ts-millis.

We'll follow up with additional minor perf optimization in a separate PR.

nsivabalan · 2025-11-05T04:39:07Z

hudi-hadoop-common/src/main/java/org/apache/parquet/avro/HoodieAvroReadSupport.java

    adjustConfToReadWithFileProduceMode(legacyMode, configuration);
    ReadContext readContext = super.init(configuration, keyValueMetaData, fileSchema);
-    MessageType requestedSchema = readContext.getRequestedSchema();
+    MessageType requestedSchema = SchemaRepair.repairLogicalTypes(readContext.getRequestedSchema(), tableSchema);


same comment as above.

The AvroSchemaRepair.repairLogicalTypes will return the original schema if there is no ts-millis.

...park-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestMORDataSource.scala

...src/main/scala/org/apache/spark/sql/execution/datasources/parquet/Spark35ParquetReader.scala

...rg/apache/spark/sql/execution/datasources/parquet/HoodieFileGroupReaderBasedFileFormat.scala

...di-spark-client/src/main/scala/org/apache/hudi/SparkFileFormatInternalRowReaderContext.scala

yihua · 2025-11-10T22:51:04Z

hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieAvroDataBlock.java

    checkArgument(type != HoodieRecordType.SPARK, "Not support read avro to spark record");
    // TODO AvroSparkReader need
-    RecordIterator iterator = RecordIterator.getInstance(this, content);
+    RecordIterator iterator = RecordIterator.getInstance(this, content, true);


Should this take the config to determine if the timestamp millis field should be repaired?

oh I thought you already reviewed these changes. and I was just assisting Vamsi and lin on triaging test failures from CI.

looks like a gap which needs to be fixed.

The logic still guarantees correctness, just that it can incur additional schema comparison which is a minor overhead. We'll address this perf optimization in a separate PR.

hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieAvroDataBlock.java

hudi-common/src/main/java/org/apache/parquet/schema/AvroSchemaRepair.java

danny0405 · 2025-11-11T02:41:34Z

hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieAvroDataBlock.java


-      if (recordNeedsRewriteForExtendedAvroTypePromotion(writerSchema, readerSchema)) {
-        this.reader = new GenericDatumReader<>(writerSchema, writerSchema);
+      Schema repairedWriterSchema = AvroSchemaRepair.repairLogicalTypes(writerSchema, readerSchema);


should we also check the flag enableLogicalTimestampFieldRepair ?

We're prioritizing correctness in this PR. The schema repair overhead is minimal. We'll take the minor perf optimization separately.

...rg/apache/spark/sql/execution/datasources/parquet/Spark35LegacyHoodieParquetFileFormat.scala

danny0405 · 2025-11-11T03:06:34Z

hudi-common/src/main/java/org/apache/hudi/common/engine/HoodieReaderContext.java

  // the default iterator mode is engine-specific record mode
  private IteratorMode iteratorMode = IteratorMode.ENGINE_RECORD;
  protected final HoodieConfig hoodieReaderConfig;
+  private boolean enableLogicalTimestampFieldRepair = true;


can we by default disable it to avoid schema parse overhead for other engines?

Same as the other comment that we're prioritizing correctness in this PR.

...utilities/src/test/java/org/apache/hudi/utilities/deltastreamer/TestHoodieDeltaStreamer.java

...park-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestMORDataSource.scala

hudi-hadoop-common/src/test/java/org/apache/hudi/common/table/TestHoodieTableMetaClient.java

...atasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestPartitionStatsIndex.scala

...ain/scala/org/apache/spark/sql/hudi/command/procedures/ShowColumnStatsOverlapProcedure.scala

nsivabalan · 2025-11-11T13:13:08Z

@hudi-bot run azure

hudi-bot · 2025-11-11T15:35:07Z

CI report:

ffe8d53 UNKNOWN
4446be8 UNKNOWN
9f96890 UNKNOWN
a04d3ca UNKNOWN
08f332a UNKNOWN
3da1946 UNKNOWN
9b1c057 UNKNOWN
d57cede UNKNOWN
ea6eca9 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

…4161) Co-authored-by: Jonathan Vexler <=> Co-authored-by: sivabalan <n.siva.b@gmail.com> Co-authored-by: Vamsi <vamsi@onehouse.ai> Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com> Co-authored-by: Lin Liu <linliu.code@gmail.com>

TheR1sing3un · 2025-12-04T02:37:18Z

...rg/apache/spark/sql/execution/datasources/parquet/HoodieFileGroupReaderBasedFileFormat.scala

    val requestedAvroSchema = AvroSchemaUtils.pruneDataSchema(avroTableSchema, AvroConversionUtils.convertStructTypeToAvroSchema(requestedSchema, sanitizedTableName), exclusionFields)
    val dataAvroSchema = AvroSchemaUtils.pruneDataSchema(avroTableSchema, AvroConversionUtils.convertStructTypeToAvroSchema(dataSchema, sanitizedTableName), exclusionFields)

+    spark.sessionState.conf.setConfString("spark.sql.parquet.enableVectorizedReader", supportVectorizedRead.toString)


A friendly reminder:
If we modify this configuration in the conf of spark sessionState in the hudi logic, it may disrupt the read logic of other datasources.
For example, if this configuration is initially set to true, When a spark sql reads a hudi table and another datasource table such as a hive table, the behavior we hope for is that whether the hudi performs vectorized reading is controlled by the hudi logic itself, while hive directly performs vectorized reading.
However, if we change this configuration here, perhaps this will lead to hive not performing vectorized reading.

cc @jonvex @yihua

A friendly reminder: If we modify this configuration in the conf of spark sessionState in the hudi logic, it may disrupt the read logic of other datasources. For example, if this configuration is initially set to true, When a spark sql reads a hudi table and another datasource table such as a hive table, the behavior we hope for is that whether the hudi performs vectorized reading is controlled by the hudi logic itself, while hive directly performs vectorized reading. However, if we change this configuration here, perhaps this will lead to hive not performing vectorized reading.

cc @jonvex @yihua

Same problem in BaseFileOnlyRelation.scala: #9129 #10134

you mean the override of the conf in spark session would affect the behavior of federation queries with multiple data sources in one spark workload, like Hive data source together with Hudi?

@yihua it looks like a regression though.

…4161) Co-authored-by: Jonathan Vexler <=> Co-authored-by: sivabalan <n.siva.b@gmail.com> Co-authored-by: Vamsi <vamsi@onehouse.ai> Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com> Co-authored-by: Lin Liu <linliu.code@gmail.com>

github-actions bot added the size:XL PR with lines of changes > 1000 label Oct 26, 2025

jonvex mentioned this pull request Oct 26, 2025

Timestamp millis repair #14120

Closed

3 tasks

yihua added the release-1.1.0 label Oct 27, 2025

yihua added this to the release-1.1.0 milestone Oct 27, 2025

yihua reviewed Oct 27, 2025

View reviewed changes

...t/hudi-java-client/src/test/java/org/apache/hudi/hadoop/TestHoodieFileGroupReaderOnHive.java Outdated Show resolved Hide resolved

yihua reviewed Oct 27, 2025

View reviewed changes

yihua reviewed Oct 28, 2025

View reviewed changes

...di-spark-client/src/main/scala/org/apache/hudi/SparkFileFormatInternalRowReaderContext.scala Show resolved Hide resolved

...main/scala/org/apache/spark/sql/execution/datasources/parquet/HoodieParquetReadSupport.scala Outdated Show resolved Hide resolved

yihua reviewed Oct 28, 2025

View reviewed changes

hudi-hadoop-common/src/main/java/org/apache/hudi/io/hadoop/HoodieAvroParquetReader.java Outdated Show resolved Hide resolved

hudi-common/src/main/java/org/apache/parquet/schema/AvroSchemaRepair.java Outdated Show resolved Hide resolved

yihua reviewed Oct 28, 2025

View reviewed changes

...src/main/scala/org/apache/spark/sql/execution/datasources/parquet/Spark40ParquetReader.scala Outdated Show resolved Hide resolved

yihua reviewed Oct 28, 2025

View reviewed changes

jonvex changed the title ~~Timestamp logical final~~ fix(ingest): Fix logical type handling in deltastreamer and repair affected tables Oct 29, 2025

jonvex changed the title ~~fix(ingest): Fix logical type handling in deltastreamer and repair affected tables~~ fix(ingest): Repair affected logical timestamp milli tables Oct 29, 2025

nsivabalan reviewed Nov 5, 2025

View reviewed changes

yihua force-pushed the timestamp_logical_final branch from 15a5d8d to d23db24 Compare November 6, 2025 22:35

yihua reviewed Nov 7, 2025

View reviewed changes

...rg/apache/spark/sql/execution/datasources/parquet/HoodieFileGroupReaderBasedFileFormat.scala Outdated Show resolved Hide resolved

yihua force-pushed the timestamp_logical_final branch from 4a63d1b to 042caf3 Compare November 7, 2025 23:08

yihua mentioned this pull request Nov 7, 2025

Timestamp logical final #14233

Closed

3 tasks

Jonathan Vexler added 8 commits November 7, 2025 20:51

current progress

cf502e3

seems to be working for spark non vectorized and avro

c5e6421

filters working

1f28581

prevent overflow

19fdf52

use read support instead of mapping function

8a8194f

use repaired schema instead of doing operations after we read the data

a957db1

add spark log support

9e660fb

remove find cols to multiply class

290f90f

yihua and others added 2 commits November 10, 2025 12:30

Fix naming of configs

fddab79

fix col stats test failures

5e8bb52

yihua reviewed Nov 10, 2025

View reviewed changes

...di-spark-client/src/main/scala/org/apache/hudi/SparkFileFormatInternalRowReaderContext.scala Show resolved Hide resolved

Addressing Siva's feedback on fixing col stats for v1 index defn

4446be8

nsivabalan force-pushed the timestamp_logical_final branch from ffe8d53 to 4446be8 Compare November 10, 2025 22:54

yihua reviewed Nov 10, 2025

View reviewed changes

nsivabalan marked this pull request as ready for review November 10, 2025 22:56