Add Hudi sink connector support #4164

voonhous · 2025-10-24T08:39:00Z

This PR adds Hudi CDC sink support to Flink CDC.

As of now the following features are supported:

Simple bucket index
Non-partitioned tables
MOR tables
Compaction plan generation (Compaction execution will require a separate process as of now)

Future improvements will be made to bring along more future support other native hudi features gradually/iteratively as we are trying to keep the PR small and manageable for reviews.

voonhous · 2025-10-24T08:39:34Z

@danny0405 @cshuo FYI

voonhous · 2025-10-27T07:12:31Z

Changes here will require Hudi 1.1.0 to be released first.

cshuo

@voonhous thanks for the pr. Can you also describe the scope of pr for the Hudi CDC sink, e.g., what index types and table service(compaction) modes are supported.

cshuo · 2025-10-29T09:52:26Z

...a/org/apache/flink/cdc/connectors/hudi/sink/function/MultiTableEventStreamWriteFunction.java

+     */
+    private void processFlushForTableFunction(
+            EventBucketStreamWriteFunction tableFunction, Event flushEvent) {
+        try {


no need to use reflection now? call tableFunction.flushRemaining(false); directly

cshuo · 2025-10-29T09:56:58Z

.../java/org/apache/flink/cdc/connectors/hudi/sink/function/EventBucketStreamWriteFunction.java

+        }
+
+        // Extract record key from event data using cached field getters
+        String recordKey = extractRecordKeyFromEvent(dataChangeEvent);


record key can be get from HoodieFlinkInternalRow directly by calling HoodieFlinkInternalRow#getRecordKey(). So extractRecordKeyFromEvent is unnecessary, and primaryKeyFieldGetters can be removed.

cshuo · 2025-10-29T10:08:26Z

...c/main/java/org/apache/flink/cdc/connectors/hudi/sink/function/EventStreamWriteFunction.java

+
+/** Base infrastructures for streaming writer function to handle Events. */
+public abstract class EventStreamWriteFunction extends AbstractStreamWriteFunction<Event>
+        implements EventProcessorFunction {


we should make minimal changes to StreamWriteFunction and BucketStreamWriteFunction, the generic type should kept as HoodieFlinkInternalRow. We can confine operations of Event within MultiTableEventStreamWriteFunction and StreamWriteFunction only need to provide the following operations:

processData(HoodieFlinkInternalRow): DataChangeEvent can be converted to HoodieFlinkInternalRow in MultiTableEventStreamWriteFunction.

flushRemaining(): called when flush event is received.

updateSchema()?: when shema change event is received, and need update inner schema or related fields, like index fields.

Seems no need to implement EventProcessorFunction, actually processSchemaChange and processFlush of EventStreamWriteFunction will never be called.

Okay, i tried this, i remember what the problem was:

Event to HoodieFlinkInternalRow conversion in MultiTableEventStreamWriteFunction.

HoodieFlinkInternalRow constructor requires fileId and instantTime upfront

These values come from defineRecordLocation() which needs bucket number

Cannot create HoodieFlinkInternalRow before calling defineRecordLocation

fileId and instantTime are not required to construct HoodieFlinkInternalRow, these two fields are later set in defineRecordLocation().

Seems no need to implement EventProcessorFunction, actually processSchemaChange and processFlush of EventStreamWriteFunction will never be called.

Caused by: java.lang.RuntimeException: Failed to process schema event for table: hudi_inventory_bptbsn.products at org.apache.flink.cdc.connectors.hudi.sink.function.MultiTableEventStreamWriteFunction.processSchemaChange(MultiTableEventStreamWriteFunction.java:296) at org.apache.flink.cdc.connectors.hudi.sink.function.MultiTableEventStreamWriteFunction.processElement(MultiTableEventStreamWriteFunction.java:167) at org.apache.flink.cdc.connectors.hudi.sink.function.MultiTableEventStreamWriteFunction.processElement(MultiTableEventStreamWriteFunction.java:72) at org.apache.flink.streaming.api.operators.ProcessOperator.processElement(ProcessOperator.java:66) at org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.pushToOperator(CopyingChainingOutput.java:75) at org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:50) at org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:29) at org.apache.flink.streaming.api.operators.StreamMap.processElement(StreamMap.java:38) at org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.pushToOperator(CopyingChainingOutput.java:75) at org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:50) at org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:29) at org.apache.flink.cdc.connectors.hudi.sink.bucket.FlushEventAlignmentOperator.processElement(FlushEventAlignmentOperator.java:94) at org.apache.flink.streaming.runtime.tasks.OneInputStreamTask$StreamTaskNetworkOutput.emitRecord(OneInputStreamTask.java:238) at org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.processElement(AbstractStreamTaskNetworkInput.java:157) at org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.emitNext(AbstractStreamTaskNetworkInput.java:114) at org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:65) at org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:638) at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:231) at org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:973) at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:917) at org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:970) at org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:949) at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:763) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:575) at java.base/java.lang.Thread.run(Unknown Source) Caused by: java.lang.UnsupportedOperationException: #processSchemaChange should not be called at org.apache.flink.cdc.connectors.hudi.sink.function.EventBucketStreamWriteFunction.processSchemaChange(EventBucketStreamWriteFunction.java:158) at org.apache.flink.cdc.connectors.hudi.sink.function.MultiTableEventStreamWriteFunction.processSchemaChange(MultiTableEventStreamWriteFunction.java:293) ... 24 more

It is being invoked.

cshuo · 2025-10-29T10:15:27Z

...src/main/java/org/apache/flink/cdc/connectors/hudi/sink/event/HudiRecordEventSerializer.java

+ * <p>Assumes that CreateTableEvent will always arrive before DataChangeEvent for each table,
+ * following the standard CDC pipeline startup sequence.
+ */
+public class HudiRecordEventSerializer implements HudiRecordSerializer<Event> {


Seems HudiRecordEventSerializer is designed to deal with serializing for multiple tables. Like comments in EventStreamWriteFunction, HudiRecordEventSerializer can be a field of MultiTableStreamWriteOperatorCoordinator? serializing data change event to HoodieFlinkInternalRow which are then dispatched to corresponding table write functions.

cshuo · 2025-10-29T13:03:04Z

...line-connector-hudi/src/main/java/org/apache/flink/cdc/connectors/hudi/sink/v2/HudiSink.java

+        // - Data events go to their specific bucket's task
+        DataStream<BucketWrapper> partitionedStream =
+                bucketAssignedStream.partitionCustom(
+                        (key, numPartitions) -> key % numPartitions,


Maybe we should also consider data skew problem since there are records from multiple table & partitions. You can refer to BucketIndexUtil#getPartitionIndexFunc.

cshuo · 2025-11-04T01:20:17Z

...onnector-hudi/src/main/java/org/apache/flink/cdc/connectors/hudi/sink/util/RowDataUtils.java

+            DataChangeEvent dataChangeEvent, Schema schema) {
+        List<String> partitionKeys = schema.partitionKeys();
+        if (partitionKeys == null || partitionKeys.isEmpty()) {
+            return "default";


should be "" here?

Yeap, good catch, fixed.

cshuo · 2025-11-04T01:24:24Z

...c/main/java/org/apache/flink/cdc/connectors/hudi/sink/function/EventStreamWriteFunction.java

+
+/** Base infrastructures for streaming writer function to handle Events. */
+public abstract class EventStreamWriteFunction extends AbstractStreamWriteFunction<Event>
+        implements EventProcessorFunction {


fileId and instantTime are not required to construct HoodieFlinkInternalRow, these two fields are later set in defineRecordLocation().

cshuo · 2025-11-04T01:37:22Z

.../java/org/apache/flink/cdc/connectors/hudi/sink/function/EventBucketStreamWriteFunction.java

+    }
+
+    /**
+     * Calculate bucket from HoodieFlinkInternalRow using the record key. The record key is already


Are we going to support bucket byhoodie.bucket.index.hash.field?

Not yet, was planning on standardising eveyrthing to use record keys first. Since there is an orthogonal discussion on config, i wanted to leave this out for a separate exercise.

cshuo · 2025-11-04T02:05:23Z

...onnector-hudi/src/main/java/org/apache/flink/cdc/connectors/hudi/sink/util/RowDataUtils.java

+            String instantTime) {
+
+        // Extract record key from primary key fields
+        String recordKey = extractRecordKeyFromDataChangeEvent(dataChangeEvent, schema);


can we use RowDataKeyGen to get record key and partition path directly?

cshuo

Thanks @voonhous for updating, overall LGTM now. Left some minor comments.

cshuo · 2025-11-06T07:28:36Z

...a/org/apache/flink/cdc/connectors/hudi/sink/function/MultiTableEventStreamWriteFunction.java

+
+    public void endInput() {
+        super.endInput();
+        flushRemaining(true);


flushRemaining(true); is not needed here?

cshuo · 2025-11-06T07:34:17Z

...a/org/apache/flink/cdc/connectors/hudi/sink/function/MultiTableEventStreamWriteFunction.java

+     * <p>Implements {@link EventProcessorFunction#processDataChange(DataChangeEvent)}.
+     */
+    @Override
+    public void processDataChange(DataChangeEvent event) throws Exception {


this method seems useless. Add the following processDataChange method in line 356 as an api in EventProcessorFunction?

cshuo · 2025-11-06T07:42:38Z

...a/org/apache/flink/cdc/connectors/hudi/sink/function/MultiTableEventStreamWriteFunction.java

+                        });
+                // Ensure tableFunction is initialized
+                getOrCreateTableFunction(tableId);
+            } else if (event instanceof SchemaChangeEvent) {


event is always an SchemaChangeEvent

cshuo · 2025-11-06T07:43:32Z

...a/org/apache/flink/cdc/connectors/hudi/sink/function/MultiTableEventStreamWriteFunction.java

+        }
+    }
+
+    public static void createHudiTablePath(Configuration config, TableId tableId)


tableId is not used.

cshuo · 2025-11-06T07:53:04Z

...a/org/apache/flink/cdc/connectors/hudi/sink/function/MultiTableEventStreamWriteFunction.java

+                                throw new RuntimeException(
+                                        "Failed during first-time initialization for table: " + tId,
+                                        e);
+                            }


initializedTables.put(tableId, true); should be put here instead of inside getOrCreateTableFunction?

initializedTables.computeIfAbsent( tableId, tId -> { return true; // - This value gets inserted as initializedTables.put(tableId, true) });

The computeIfAbsent will insert as true, no need to put again.

Example:

I have fixed the relevant calls and lifecycle management of this map.

cshuo · 2025-11-06T08:17:37Z

...rc/main/java/org/apache/flink/cdc/connectors/hudi/sink/operator/MultiTableWriteOperator.java

+    }
+
+    private MultiTableWriteOperator(
+            Configuration config,


config is not used.

cshuo · 2025-11-06T08:22:15Z

...onnector-hudi/src/main/java/org/apache/flink/cdc/connectors/hudi/sink/util/RowDataUtils.java

+                        };
+                break;
+            case TINYINT:
+                fieldGetter = row -> row.getBoolean(fieldPos);


getByte(fieldPos)?

Same underlying implementation with more validation, let's keep it.

cshuo · 2025-11-06T08:30:46Z

...udi/src/main/java/org/apache/flink/cdc/connectors/hudi/sink/bucket/BucketAssignOperator.java

+        int bucketNumber = BucketIdentifier.getBucketId(recordKey, tableIndexKeyFields, numBuckets);
+
+        // Use partition function to map bucket to task index for balanced distribution
+        int taskIndex = partitionIndexFunc.apply(numBuckets, partition, bucketNumber);


partitionIndexFunc from hudi repo is designed for single table, here records may come from different tables, so maybe we can use tableId + "_" + partition instead of partition here?

Make sense, good catch!

cshuo · 2025-11-06T08:34:00Z

...nector-hudi/src/main/java/org/apache/flink/cdc/connectors/hudi/sink/HudiDataSinkFactory.java

+                .validateExcept(PREFIX_TABLE_PROPERTIES, PREFIX_CATALOG_PROPERTIES);
+
+        FactoryHelper.DefaultContext factoryContext = (FactoryHelper.DefaultContext) context;
+        Configuration config = factoryContext.getFactoryConfiguration();


can we add some validity check here? like index type check.

Added a validation to ensure that index is BUCKET as that is the only index we are supporting.

cshuo · 2025-11-06T08:37:04Z

flink-cdc-connect/flink-cdc-pipeline-connectors/flink-cdc-pipeline-connector-hudi/pom.xml

+
+        <dependency>
+            <groupId>org.apache.hudi</groupId>
+            <artifactId>hudi-flink1.20.x</artifactId>


use ${flink.major.version}?

cshuo · 2025-11-06T10:17:24Z

...che/flink/cdc/connectors/hudi/sink/coordinator/MultiTableStreamWriteOperatorCoordinator.java

+
+    public MultiTableStreamWriteOperatorCoordinator(Configuration conf, Context context) {
+        super(conf, context);
+        conf.setString("fs.file.impl", "org.apache.hadoop.fs.LocalFileSystem");


what's this config used for?

Oh, was trying to fix some URI issue, but the rootcause of the problem was actually with classloader, will remove.

cshuo · 2025-11-06T11:47:15Z

...che/flink/cdc/connectors/hudi/sink/coordinator/MultiTableStreamWriteOperatorCoordinator.java

+                // Ensure the table's filesystem structure exists before creating a client.
+                StreamerUtil.initTableIfNotExists(tableConfig);
+                HoodieFlinkWriteClient<?> writeClient =
+                        FlinkWriteClients.createWriteClient(tableConfig);


FlinkWriteClients.createWriteClient(tableConfig) is used for driver, embedded timeline service is hard coded as true, is that as expected?

My chain of thought is that it doesn't really matter. I have already created a coordinator writeclient which is running an embedded timeline server in which all the TM will connect to.

For other tableConfigs, i wanted to set embedded timeline service as false so that they do not start an embedded timeline server. FWIU, these writeclients will mainly be used for making commits to timeline. And do not need any coordination, so they can just use an MEMORY based timeline server/filesystem view, in which they will refresh the file system view before commit, which might be safer.

cshuo · 2025-11-06T11:53:51Z

...che/flink/cdc/connectors/hudi/sink/coordinator/MultiTableStreamWriteOperatorCoordinator.java

+            // The baseConfig points to the dummy coordinator path.
+            // A .hoodie directory is required for the timeline server to start.
+            StreamerUtil.initTableIfNotExists(this.baseConfig);
+            this.timelineServerClient = FlinkWriteClients.createWriteClient(this.baseConfig);


Write client for each table on the writer task will load view storage conf to get the config for the remote timeline service, and the view storage conf is located in table_base_path/.hoodie/.aux/view_storage_conf, not sure whether the view storage conf file for each table is properly created.

This is the "coordinator" writeclient, it will start an embedded timeline server and all the other tables will use this "coordintor"'s timeline server for FileSystemView requests.

I mean write client in writers are getting the ip/port conf of the timeline server through FileSystemViewStorageConfig, so each table should save view storage properties properly in coordinator.

Addressed the comments and manually verified that the view_storage_conf are both pointing to the same timeline instance:

cshuo

@voonhous Thanks for updating, LGTM. Left some minor comments. cc @danny0405 also take a look

cshuo · 2025-11-07T06:57:37Z

...ine-e2e-tests/src/test/java/org/apache/flink/cdc/pipeline/tests/MySqlToIcebergE2eITCase.java

            LOG.error("Update table for CDC failed.", e);
            throw e;
        }
+


unnecessary change.

cshuo · 2025-11-07T06:57:49Z

...ine-e2e-tests/src/test/java/org/apache/flink/cdc/pipeline/tests/MySqlToIcebergE2eITCase.java

        recordsInSnapshotPhase =
                recordsInSnapshotPhase.stream().sorted().collect(Collectors.toList());
        validateSinkResult(warehouse, database, "products", recordsInSnapshotPhase);
+        Thread.sleep(3600000L);


unnecessary change.

cshuo · 2025-11-07T07:07:43Z

...che/flink/cdc/connectors/hudi/sink/coordinator/MultiTableStreamWriteOperatorCoordinator.java

+
+        if (tableId == null) {
+            LOG.warn("No tableId found for path: {}. Cannot process event.", tablePath);
+            return;


this is an unexpected case, should also fail the job here?

cshuo · 2025-11-07T07:16:38Z

...e-tests/src/test/java/org/apache/flink/cdc/pipeline/tests/utils/PipelineTestEnvironment.java

                JobStatus jobStatus = message.getJobState();
                if (!expectedStatus.isTerminalState() && jobStatus.isTerminalState()) {
+                    try {
+                        Thread.sleep(50000);


is this necessary?

Nope, it was for debugging.

cshuo · 2025-11-07T07:16:44Z

...e-tests/src/test/java/org/apache/flink/cdc/pipeline/tests/utils/PipelineTestEnvironment.java

                    "taskmanager.numberOfTaskSlots: 10",
                    "parallelism.default: 4",
-                    "execution.checkpointing.interval: 300",
+                    "execution.checkpointing.interval: 30s",


is this necessary?

Nope, it was for debugging.

…per functions

voonhous · 2025-11-21T10:59:17Z

@lvyanquan @yuxiqian Bumped the version, but am not sure why the CI is failing, the E2E tests are passing locally:

Can you please advise and help to review? Thank you!

lvyanquan · 2025-11-24T06:11:30Z

We've re-triggered the CI tests, and a checkstyle issue was reported. You can fix it and trigger a new test run once it's resolved.

cshuo · 2025-11-24T06:32:56Z

@lvyanquan style issue is fixed, thks for helping.

github-actions bot added build e2e-tests labels Oct 24, 2025

cshuo reviewed Oct 29, 2025

View reviewed changes

cshuo reviewed Nov 4, 2025

View reviewed changes

voonhous force-pushed the hudi-connector-rework-push-to-origin branch from 9f52239 to 6254c7f Compare November 5, 2025 06:53

cshuo reviewed Nov 6, 2025

View reviewed changes

cshuo reviewed Nov 7, 2025

View reviewed changes

voonhous force-pushed the hudi-connector-rework-push-to-origin branch 2 times, most recently from ad7a2e7 to 317439f Compare November 7, 2025 10:08

voonhous added 15 commits November 7, 2025 18:10

Flink-CDC checkin 2

aaa4c60

Checkpoint 34 - Fix checkstyle and RAT

dac5e01

Checkpoint 35 - Remove unused code

46d90fd

Checkpoint 36 - Add restore with checkpoint test

c236bb4

Checkpoint 36 - Fix spotless

43f4844

Checkpoint 37 - Fix spotless and import errors

d5cdb75

Checkpoint 38 - Start-stop-checkpoint fix

76722f7

Checkpoint 39 - Enable MDT

aeb4f3b

Checkpoint 40 - Remove MDT configs

b3f38bc

Checkpoint 41 - Fix testSyncWholeDb

be3f4d2

Checkpoint 42 - Add compaction scheduling support

de2f698

Checkpoint 43 - Change compaction to be event driven

a62aeb5

Checkpoint 44 - Remove reflection call

410be89

Checkpoint 45 - Change partitioning logic to avoid skew

e6460d8

Checkpoint 45 - Update naming convention to reduce confusion.

70600b2

voonhous added 13 commits November 7, 2025 18:10

Checkpoint 46 - Add partition path extractor

ed733aa

Checkpoint 47 - Use HoodieFlinkInternalRow

6679b9e

Checkpoint 48 - Fix partitioning issue for non-partitioned tables

bccb947

Checkpoint 49 - Use RowDataKeyGen

26dadf0

Checkpoint 50 - Remove code duplication via overloading

41aa916

Checkpoint 51 - Refactor and remove Event*Functions

8a96e23

Checkpoint 52 - Use RowDataKeyGen implementations of RowDataUtils hel…

7132370

…per functions

Checkpoint 53 - Fix checkstyle issues

19568cf

Checkpoint 54 - Remove reflection usage

dc04d27

Checkpoint 55 - Address comments

b5400c0

Checkpoint 56 - Address comments 2

c527244

Checkpoint 57 - Remove manual embedded timeline server management

5559469

Checkpoint 58 - Remove unnecessary changes

13a2921

voonhous force-pushed the hudi-connector-rework-push-to-origin branch from 317439f to 13a2921 Compare November 7, 2025 10:10

voonhous force-pushed the hudi-connector-rework-push-to-origin branch 7 times, most recently from 249fa2b to ffce04e Compare November 21, 2025 08:10

voonhous added 2 commits November 21, 2025 16:36

Checkpoint 59 - Bump hudi version to 1.1.0

2712125

Checkpoint 60 - Fix java8 usage of Optional from isEmpty to !isPresent

db61821

voonhous force-pushed the hudi-connector-rework-push-to-origin branch from ffce04e to db61821 Compare November 21, 2025 08:36

Checkpoint 61 - fix table initialization

e566daf

fix style

314b7ef

Add Hudi sink connector support #4164

Are you sure you want to change the base?

Add Hudi sink connector support #4164

Uh oh!

Conversation

voonhous commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

voonhous commented Oct 24, 2025

Uh oh!

voonhous commented Oct 27, 2025

Uh oh!

cshuo left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cshuo Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cshuo Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cshuo Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cshuo left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

voonhous commented Oct 24, 2025 •

edited

Loading

cshuo Oct 29, 2025 •

edited

Loading

cshuo Oct 29, 2025 •

edited

Loading

cshuo Oct 29, 2025 •

edited

Loading

voonhous Nov 6, 2025 •

edited

Loading