Flink: Fix range distribution npe when value is null #11662

Guosmilesmile · 2024-11-27T02:40:07Z

When configuring the distribution mode to RANGE, if the partition field in the data contains null values, it will cause the SortKey serialization to fail, resulting in the job continuously restarting.

If we set the partition as bucket(5, name), and some data has a null value for name, then the DataStatisticsOperator will throw an error when serializing the statistics and sending them to the coordinator.

In SortKeySerializer.serialize

  public void serialize(SortKey record, DataOutputView target) throws IOException {

    for (int i = 0; i < size; ++i) {
      int fieldId = transformedFields[i].fieldId();
      Type.TypeID typeId = transformedFields[i].type().typeId();
      switch (typeId) {
        case BOOLEAN:
          target.writeBoolean(record.get(i, Boolean.class));
          break;
        case INTEGER:
        case DATE:
          target.writeInt(record.get(i, Integer.class));
          break;
        
          ...
       }

Error log:

Caused by: org.apache.flink.util.SerializedThrowable: java.lang.NullPointerException
    at org.apache.iceberg.flink.sink.shuffle.SortKeySerializer.serialize(SortKeySerializer.java:133) ~[flink-onejar-4.0.0.jar:?]
    at org.apache.iceberg.flink.sink.shuffle.SortKeySerializer.serialize(SortKeySerializer.java:49) ~[flink-onejar-4.0.0.jar:?]
    at org.apache.flink.api.common.typeutils.base.MapSerializer.serialize(MapSerializer.java:136) ~[flink-dist-1.19.1.jar:1.19.1]
    at org.apache.iceberg.flink.sink.shuffle.DataStatisticsSerializer.serialize(DataStatisticsSerializer.java:113) ~[flink-onejar-4.0.0.jar:?]
    at org.apache.iceberg.flink.sink.shuffle.DataStatisticsSerializer.serialize(DataStatisticsSerializer.java:38) ~[flink-onejar-4.0.0.jar:?]
    at org.apache.iceberg.flink.sink.shuffle.StatisticsUtil.serializeDataStatistics(StatisticsUtil.java:46) ~[flink-onejar-4.0.0.jar:?]
    at org.apache.iceberg.flink.sink.shuffle.StatisticsEvent.createTaskStatisticsEvent(StatisticsEvent.java:51) ~[flink-onejar-4.0.0.jar:?]
    at org.apache.iceberg.flink.sink.shuffle.DataStatisticsOperator.snapshotState(DataStatisticsOperator.java:227) ~[flink-onejar-4.0.0.jar:?]
    at org.apache.flink.streaming.api.operators.StreamOperatorStateHandler.snapshotState(StreamOperatorStateHandler.java:234) ~[flink-dist-1.19.1.jar:1.19.1]

We think If a field is set as a partition and a transformation function is also applied, the scenario will inevitably occur when the field contains null values. I believe that during the statistics, we can ignore the data with null values in the relevant partition field. If the sample size is small, it should not affect the job's execution. If the sample size is large, using the default round strategy would also be appropriate.

This PR mainly traverses the contents of the partition field and skips the statistics collection for any parts that contain null values.

ConeyLiu · 2024-11-27T03:15:34Z

Thanks for the contributions. Should we better handle the null value instead of skipping it? Could you also add the UT to cover these changes?
cc @stevenzwu @pvary

Guosmilesmile · 2024-11-27T05:01:35Z

@ConeyLiu yes，I have added UT to cover the relevant changes.

pvary · 2024-11-27T06:22:55Z

Seems like a hacky solution to me. We will miss all of the advantages for the range distribution mode for records which are affected. Could we just serialize the null values correctly instead?

Guosmilesmile · 2024-11-27T07:06:26Z

@pvary Yes, you are right. We changed our approach to handle null values instead of filtering them out. Before serialization, we added a flag for each field, where true indicates that the field is null. During serialization and deserialization, we read the flag first, which allows us to correctly identify null. The related code has been updated; please take a look to see if it is feasible.

flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/sink/shuffle/SortKeySerializer.java

...20/flink/src/test/java/org/apache/iceberg/flink/sink/shuffle/TestDataStatisticsOperator.java

pvary · 2024-11-27T12:26:27Z

@Guosmilesmile: Thanks for the changes. Left some comments, but started the tests to see if this change cause any other test failures.

Please remove the 1.19, 1.18 changes for now - we usually cherry pick the changes to the other versions after the original PR for the main versions has been merged. This is better for the reviewer (smaller number of files), and better for the contributor (if there is a change request during the review, they don't have to keep the different versions in sync)

It would be nice to have an end2end test for null values too.
Currently we only tests that the statistics are collected correctly, but there might be some issues when applying the stats. It would be nice to have a test for this case too.

pvary · 2024-11-27T12:28:11Z

@Guosmilesmile: Please apply spotless to remove the formatting failures:

./gradlew spotlessApply

Guosmilesmile · 2024-11-27T12:53:51Z

@pvary Thank you very much for your guidance. I have made the changes according to your suggestions. Please take a look and see if they are feasible.

pvary · 2024-11-27T13:13:55Z

Thanks for the changes @Guosmilesmile!
One question remains from my side:

It would be nice to have an end2end test for null values too.
Currently we only tests that the statistics are collected correctly, but there might be some issues when applying the stats. It would be nice to have a test for this case too.

Guosmilesmile · 2024-11-27T14:00:44Z

@pvary I have added test cases for DataStatisticsOperator and DataStatisticsCoordinator, specifically for the null scenario. I verified the ProcessElement and EventHand scenarios, simulating the process of data being processed by the operation and receiving statistical information that contains null values for handling. Could you please take a look and see if this is feasible? Thank you!

Guosmilesmile · 2024-11-27T15:05:35Z

@pvary I have also added a unit test for the scenario of building from the table to the sink, which includes data containing null values. Could you please review it as well? Thank you very much!

pvary · 2024-11-27T16:34:26Z

...flink/src/test/java/org/apache/iceberg/flink/sink/shuffle/TestDataStatisticsCoordinator.java

+      dataStatisticsCoordinator.start();
+      tasksReady(dataStatisticsCoordinator);
+
+      SortKey NullSortKey = Fixtures.SORT_KEY.copy();


nit: The variable name should start with lower case letter

@pvary I have change it , please check it, thank you

@pvary I'm sorry, there was a little mistake just now. Please try again. Thank you very much.

Guosmilesmile · 2024-11-27T21:21:38Z

@pvary I'm sorry, I just re-uploaded the missing parts. Please help me trigger the test again. I apologize for the inconvenience.

stevenzwu · 2024-11-27T23:05:09Z

flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/sink/shuffle/SortKeySerializer.java

@@ -124,6 +124,14 @@ public void serialize(SortKey record, DataOutputView target) throws IOException
    for (int i = 0; i < size; ++i) {
      int fieldId = transformedFields[i].fieldId();
      Type.TypeID typeId = transformedFields[i].type().typeId();
+      Object value = record.get(i, Object.class);


this will add a boolean flag overhead for every field. but I can't think of any better alternative. so this is fine to me.

However, ideally we need to figure out how to evolve it in a compatible way. probably leverage TypeSerializerSnapshot

@stevenzwu I have looked at the current implementation in Flink, and the serialization for null values references the implementation of NullableSerializer. The underlying approach is also to add a boolean flag at the beginning of the field to distinguish, which is quite similar to the approach in this PR. Can we use the current method to fix this issue for now, or are there any other better handling methods?

NullableSerializer<T> extends TypeSerializer<T> @Override public void serialize(T record, DataOutputView target) throws IOException { if (record == null) { target.writeBoolean(true); target.write(padding); } else { target.writeBoolean(false); originalSerializer.serialize(record, target); } } @Override public T deserialize(DataInputView source) throws IOException { boolean isNull = deserializeNull(source); return isNull ? null : originalSerializer.deserialize(source); }

@stevenzwu: What do you think about a serializer version, something like the SimpleVersionedSerializer uses. Maybe don't store the version for every key, but inherit the version from the statistics object?

Nulls are not allowed everywhere in Flink. Primitive types generally don't support null. In other places, Flink uses boolean flags to encode nulls, e.g. in PojoSerializer: https://github.com/apache/flink/blob/b9c92371dba07b6bfe4368d7b6d7f7c575b4c603/flink-core/src/main/java/org/apache/flink/api/java/typeutils/runtime/PojoSerializer.java#L392

KeyBy operations are not allowed on null values.

+1 to Steven's suggestion to version the serializer. We already have SortKeySerializerSnapshot. We just need to add a version check for the new boolean serializer because this is an incompatible change which requires serializer migration.

Guosmilesmile · 2024-11-28T07:28:14Z

@pvary Because a new flag has been added, the testSerializationSize test case needs to be modified to reflect the corresponding size value + 1byte for the flag . It has been adjusted. I have run the tests locally myself, and all unit tests have passed . I would appreciate it if you could start the tests again.

mxm

The approach here generally looks good to me once we address the remaining comments.

mxm · 2024-11-28T10:38:11Z

flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/sink/shuffle/SortKeySerializer.java

@@ -124,6 +124,14 @@ public void serialize(SortKey record, DataOutputView target) throws IOException
    for (int i = 0; i < size; ++i) {
      int fieldId = transformedFields[i].fieldId();
      Type.TypeID typeId = transformedFields[i].type().typeId();
+      Object value = record.get(i, Object.class);


Nulls are not allowed everywhere in Flink. Primitive types generally don't support null. In other places, Flink uses boolean flags to encode nulls, e.g. in PojoSerializer: https://github.com/apache/flink/blob/b9c92371dba07b6bfe4368d7b6d7f7c575b4c603/flink-core/src/main/java/org/apache/flink/api/java/typeutils/runtime/PojoSerializer.java#L392

KeyBy operations are not allowed on null values.

+1 to Steven's suggestion to version the serializer. We already have SortKeySerializerSnapshot. We just need to add a version check for the new boolean serializer because this is an incompatible change which requires serializer migration.

Guosmilesmile · 2024-11-28T11:40:28Z

@mxm Thank you very much for your suggestions. I need to add a version check in SortKeySerializerSnapshot. If the state is restored from an old version, I will directly return TypeSerializerSchemaCompatibility.incompatible(). Additionally, if the readVersion in readSnapshot is an old version, I will throw an exception indicating that the version is not supported. Is this approach feasible?

Guosmilesmile · 2024-11-29T02:27:33Z

@mxm @stevenzwu @pvary Thank you all for your suggestions. I have submitted a version that mainly modifies the SortKeySerializerSnapshot and implements version detection. If the version does not match, it will return the corresponding message to avoid parsing exceptions and data errors when restoring from an old state to a new state. I would appreciate it if you could take a look and see if this modification is feasible

mxm · 2024-11-29T07:07:11Z

@Guosmilesmile Ideally, we want to return compatibleAfterMigration() and make sure the old and new serializer can be instantiated.

Guosmilesmile · 2024-11-29T07:47:31Z

@mxm I'm sorry, I'm a bit confused and would like to ask for your advice. Which part do I need to modify?

Based on the current modifications, simply change TypeSerializerSchemaCompatibility.incompatible() to compatibleAfterMigration().
Copy a class call OldSortKeySerializer from SortKeySerializer, keep the serialize and deserialize with old version ?

Please provide detailed advice. Thank you very much.

mxm

Hey @Guosmilesmile! Please see the suggestions inline. Let me know if you have questions.

mxm · 2024-11-29T10:16:00Z

flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/sink/shuffle/SortKeySerializer.java

-      if (readVersion == 1) {
-        readV1(in);
-      } else {
-        throw new IllegalArgumentException("Unknown read version: " + readVersion);
+      switch (readVersion) {
+        case 1:
+          throw new UnsupportedOperationException(
+              String.format(
+                  "No longer supported version [%s]. Please upgrade first . ", readVersion));
+        case 2:
+          readV2(in);
+          break;


Reading the snapshot isn't any different for V1/V2. The readV1() or readV2() methods are identical. We can just keep the current one. However, we need to save the serializer version as an int field which we can read in restoreSerializer() to pass it into the serializer, just like the schema and the sort order. We can then serialize/deserialize depending on the version used.

For example, in the serializer we would do:

if (version == 1) { // Don't read/write null indicator } else { // Read/write null indicator }

mxm · 2024-11-29T10:16:05Z

flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/sink/shuffle/SortKeySerializer.java

+      if (oldSerializerSnapshot.getCurrentVersion() != this.getCurrentVersion()) {
+        return TypeSerializerSchemaCompatibility.incompatible();
+      }


Suggested change

if (oldSerializerSnapshot.getCurrentVersion() != this.getCurrentVersion()) {

return TypeSerializerSchemaCompatibility.incompatible();

}

if (oldSerializerSnapshot.getCurrentVersion() == 1 && this.getCurrentVersion() == 2) {

return TypeSerializerSchemaCompatibility.compatibleAfterMigration();

}

Guosmilesmile · 2024-11-29T12:56:48Z

Hi @mxm! Thank you very much for your suggestions. Following your advice, I added a version number to SortKeySerializer through restoreSerializer, implementing different serialization methods for v1 and v2.

Based on the performance in the test job, the TaskManager chieve state compatibility. However, the JobManager fails to recover due to theDataStatisticsCoordinator, which leads to a failure to start because the SortKeySerializer used inCompletedStatisticsSerializer is instantiated directly.

In resetToCheckpoint, I can only get byte[] checkpointData, and the coordinator directly uses SortKeySerializer to deserialize the checkpoint data. How can I obtain the version number in resetToCheckpoint and switch back to the current version number at the end? This is the problem I am currently facing.

My approach is to save the version number of the SortKeySerializer in the checkpointData. This will allow the CompletedStatisticsSerializer to be linked with the version of SortKeySerializer.

CheckpointData now is checkpointId + StatisticsType + SortKey，We can change to checkpointId + SortKeySerializerVersion + StatisticsType + SortKey. Due to the modification of the SortKey serialization method, it is not possible to append the version number at the end of the checkpointData, as the last position is occupied by the SortKey. Therefore, the version number can only be added at the beginning.

I will modify the serialize and deserialize methods of the CompletedStatisticsSerializer to provide both v1 and v2. During the restore process, first attempt to parse using v2; if that fails, I will then try v1 to retrieve the version number of the state. After restoring the state with the version of SortKeySerializer , I will switch SortKeySerializer to the latest version number.

Running the test job can resolve the issue. The related code has been submitted, and the unit tests have also been added. Could you please help take a look at whether this solution is feasible? Thank you very much!

Here is error log


2024-11-29 19:44:15,670 ERROR org.apache.flink.runtime.operators.coordination.RecreateOnResetOperatorCoordinator [] - Failed to reset the coordinator to checkpoint and start.
java.io.UncheckedIOException: Fail to deserialize aggregated statistics
    at org.apache.iceberg.flink.sink.shuffle.StatisticsUtil.deserializeCompletedStatistics(StatisticsUtil.java:81) ~[flink-onejar-4.0.0-test.jar:?]
    at org.apache.iceberg.flink.sink.shuffle.DataStatisticsCoordinator.resetToCheckpoint(DataStatisticsCoordinator.java:372) ~[flink-onejar-4.0.0-test.jar:?]
    at org.apache.flink.runtime.operators.coordination.RecreateOnResetOperatorCoordinator$DeferrableCoordinator.resetAndStart(RecreateOnResetOperatorCoordinator.java:418) ~[flink-dist-1.19.1.jar:1.19.1]
    at org.apache.flink.runtime.operators.coordination.RecreateOnResetOperatorCoordinator.lambda$resetToCheckpoint$7(RecreateOnResetOperatorCoordinator.java:157) ~[flink-dist-1.19.1.jar:1.19.1]
    at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859) ~[?:?]
    at java.util.concurrent.CompletableFuture.uniWhenCompleteStage(CompletableFuture.java:883) ~[?:?]
    at java.util.concurrent.CompletableFuture.whenComplete(CompletableFuture.java:2257) ~[?:?]
    at org.apache.flink.runtime.operators.coordination.RecreateOnResetOperatorCoordinator.resetToCheckpoint(RecreateOnResetOperatorCoordinator.java:144) ~[flink-dist-1.19.1.jar:1.19.1]
    at org.apache.flink.runtime.operators.coordination.OperatorCoordinatorHolder.resetToCheckpoint(OperatorCoordinatorHolder.java:303) ~[flink-dist-1.19.1.jar:1.19.1]
    at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreStateToCoordinators(CheckpointCoordinator.java:2128) ~[flink-dist-1.19.1.jar:1.19.1]
    at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreLatestCheckpointedStateInternal(CheckpointCoordinator.java:1837) ~[flink-dist-1.19.1.jar:1.19.1]
    at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreSavepoint(CheckpointCoordinator.java:1939) ~[flink-dist-1.19.1.jar:1.19.1]
    at org.apache.flink.runtime.scheduler.DefaultExecutionGraphFactory.tryRestoreExecutionGraphFromSavepoint(DefaultExecutionGraphFactory.java:239) ~[flink-dist-1.19.1.jar:1.19.1]
    at org.apache.flink.runtime.scheduler.DefaultExecutionGraphFactory.createAndRestoreExecutionGraph(DefaultExecutionGraphFactory.java:214) ~[flink-dist-1.19.1.jar:1.19.1]
    at org.apache.flink.runtime.scheduler.SchedulerBase.createAndRestoreExecutionGraph(SchedulerBase.java:381) ~[flink-dist-1.19.1.jar:1.19.1]
    at org.apache.flink.runtime.scheduler.SchedulerBase.<init>(SchedulerBase.java:224) ~[flink-dist-1.19.1.jar:1.19.1]
    at org.apache.flink.runtime.scheduler.DefaultScheduler.<init>(DefaultScheduler.java:140) ~[flink-dist-1.19.1.jar:1.19.1]
    at org.apache.flink.runtime.scheduler.DefaultSchedulerFactory.createInstance(DefaultSchedulerFactory.java:162) ~[flink-dist-1.19.1.jar:1.19.1]
    at org.apache.flink.runtime.jobmaster.DefaultSlotPoolServiceSchedulerFactory.createScheduler(DefaultSlotPoolServiceSchedulerFactory.java:121) ~[flink-dist-1.19.1.jar:1.19.1]
    at org.apache.flink.runtime.jobmaster.JobMaster.createScheduler(JobMaster.java:379) ~[flink-dist-1.19.1.jar:1.19.1]
    at org.apache.flink.runtime.jobmaster.JobMaster.<init>(JobMaster.java:356) ~[flink-dist-1.19.1.jar:1.19.1]
    at org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.internalCreateJobMasterService(DefaultJobMasterServiceFactory.java:128) ~[flink-dist-1.19.1.jar:1.19.1]
    at org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.lambda$createJobMasterService$0(DefaultJobMasterServiceFactory.java:100) ~[flink-dist-1.19.1.jar:1.19.1]
    at org.apache.flink.util.function.FunctionUtils.lambda$uncheckedSupplier$4(FunctionUtils.java:112) ~[flink-dist-1.19.1.jar:1.19.1]
    at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700) [?:?]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
    at java.lang.Thread.run(Thread.java:829) [?:?]
Caused by: java.io.UTFDataFormatException: malformed input: partial character at end
    at org.apache.flink.core.memory.DataInputDeserializer.readUTF(DataInputDeserializer.java:290) ~[flink-dist-1.19.1.jar:1.19.1]
    at org.apache.iceberg.flink.sink.shuffle.SortKeySerializer.deserialize(SortKeySerializer.java:262) ~[flink-onejar-4.0.0-test.jar:?]
    at org.apache.iceberg.flink.sink.shuffle.SortKeySerializer.deserialize(SortKeySerializer.java:219) ~[flink-onejar-4.0.0-test.jar:?]
    at org.apache.iceberg.flink.sink.shuffle.SortKeySerializer.deserialize(SortKeySerializer.java:49) ~[flink-onejar-4.0.0-test.jar:?]
    at org.apache.flink.api.common.typeutils.base.MapSerializer.deserialize(MapSerializer.java:153) ~[flink-dist-1.19.1.jar:1.19.1]
    at org.apache.iceberg.flink.sink.shuffle.CompletedStatisticsSerializer.deserialize(CompletedStatisticsSerializer.java:99) ~[flink-onejar-4.0.0-test.jar:?]
    at org.apache.iceberg.flink.sink.shuffle.CompletedStatisticsSerializer.deserialize(CompletedStatisticsSerializer.java:38) ~[flink-onejar-4.0.0-test.jar:?]
    at org.apache.iceberg.flink.sink.shuffle.StatisticsUtil.deserializeCompletedStatistics(StatisticsUtil.java:79) ~[flink-onejar-4.0.0-test.jar:?]
    ... 27 more

…rsion and null sort key，and add UT for the change

mxm · 2024-12-02T09:45:26Z

Thanks for the update @Guosmilesmile! Unfortunately, we don't have a way to encode the serializer version for all serializers, so a best-effort approach to retry with a different serializer version on restore failure may be the best we can do.

mxm · 2024-12-02T09:36:12Z

flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/sink/shuffle/SortKeySerializer.java

+          this.version = 1;
+          break;
+        case 2:
+          readV1(in);


Can we remove the version suffix in the method name? It handles both versions.

mxm · 2024-12-02T09:37:57Z

...flink/src/main/java/org/apache/iceberg/flink/sink/shuffle/CompletedStatisticsSerializer.java

@@ -93,15 +126,35 @@ public void serialize(CompletedStatistics record, DataOutputView target) throws

  @Override
  public CompletedStatistics deserialize(DataInputView source) throws IOException {
+    long checkpointId = source.readLong();
+    changeSortKeySerializerVersion(source.readInt());


AFAIK We were not writing the serializer version before this change. So not sure this works when restoring the serializer to read old data.

mxm · 2024-12-02T09:42:18Z

flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/sink/shuffle/StatisticsUtil.java

+      try {
+        DataInputDeserializer input = new DataInputDeserializer(bytes);
+        return statisticsSerializer.deserializeV1(input);
+      } catch (IOException ioException) {
+        throw new UncheckedIOException("Fail to deserialize aggregated statistics", ioException);
+      }


I see, we're retrying here in case the restore fails.

…rsion，and add UT for the change

Guosmilesmile · 2024-12-02T10:38:57Z

@mxm Thank you very much for your suggestions. I have made the necessary modifications, and I encountered an occasional scenario where deserializing v1 checkpointData use v2 was successful, but the value is wrong and case another exception in other side, so I added a validation for the contents of CompletedStatistics. I appreciate you taking the time out of your busy schedule to review it again. I am very grateful for your guidance.

…rsion，and add UT for the change

Guosmilesmile · 2024-12-04T11:38:18Z

@mxm @pvary @stevenzwu Thank you all very much for your guidance. I would appreciate it if you could take another look and let me know what needs to be done next to keep things moving forward.

flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/sink/shuffle/CompletedStatistics.java

flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/sink/shuffle/StatisticsUtil.java

…rsion，and add UT for the change

mxm

I think this looks good, but I'll defer to @stevenzwu and @pvary for the final approval.

flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/sink/shuffle/StatisticsUtil.java

…rsion，and add UT for the change

Guosmilesmile · 2024-12-05T14:38:16Z

I think this looks good, but I'll defer to @stevenzwu and @pvary for the final approval.

@mxm Thank you very much for taking the time to review my code. The relevant comments have been added.

@pvary @stevenzwu I would appreciate it if both of you could take a look for the final approval.

flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/sink/shuffle/SortKeySerializer.java

…rsion，and add UT for the change

pvary · 2024-12-10T05:51:54Z

@stevenzwu: any last minute comments?

pvary · 2024-12-10T21:36:00Z

Thanks @Guosmilesmile for the PR, and @mxm and @stevenzwu for the review!

@Guosmilesmile please prepare the backport commits to the other Flink versions.

Thanks, Peter

Guosmilesmile · 2024-12-11T05:48:07Z

Thanks @Guosmilesmile for the PR, and @mxm and @stevenzwu for the review!

@Guosmilesmile please prepare the backport commits to the other Flink versions.

Thanks, Peter

Thank you all for your reviews @mxm @stevenzwu @pvary .

@pvary The related backport has been submitted, see #11745. Thank you!

…to Flink 1.18 and 1.19 (#11745)

Flink: Fix range distribution npe when value is null

cbcf744

github-actions bot added the flink label Nov 27, 2024

pvary reviewed Nov 27, 2024

View reviewed changes

flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/sink/shuffle/SortKeySerializer.java Show resolved Hide resolved

pvary reviewed Nov 27, 2024

View reviewed changes

flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/sink/shuffle/SortKeySerializer.java Show resolved Hide resolved

pvary reviewed Nov 27, 2024

View reviewed changes

...20/flink/src/test/java/org/apache/iceberg/flink/sink/shuffle/TestDataStatisticsOperator.java Outdated Show resolved Hide resolved

pvary reviewed Nov 27, 2024

View reviewed changes

stevenzwu reviewed Nov 27, 2024

View reviewed changes

mxm reviewed Nov 28, 2024

View reviewed changes

mxm reviewed Nov 29, 2024

View reviewed changes

Flink: SortKeySerializer and CompletedStatisticsSerializer support ve…

a49b87b

…rsion and null sort key，and add UT for the change

Guosmilesmile force-pushed the fix-rangeMode branch from cd3d93d to a49b87b Compare December 1, 2024 14:41

mxm reviewed Dec 2, 2024

View reviewed changes

Guosmilesmile added 2 commits December 2, 2024 18:19

Flink: SortKeySerializer and CompletedStatisticsSerializer support ve…

1e27dcb

…rsion，and add UT for the change

Flink: SortKeySerializer and CompletedStatisticsSerializer support ve…

5ebc768

…rsion，and add UT for the change

Guosmilesmile added 2 commits December 3, 2024 13:42

Flink: SortKeySerializer and CompletedStatisticsSerializer support ve…

ae1dfe0

…rsion，and add UT for the change

Flink: SortKeySerializer and CompletedStatisticsSerializer support ve…

a2d4e04

…rsion，and add UT for the change

pvary reviewed Dec 4, 2024

View reviewed changes

flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/sink/shuffle/CompletedStatistics.java Show resolved Hide resolved

pvary reviewed Dec 4, 2024

View reviewed changes

flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/sink/shuffle/StatisticsUtil.java Show resolved Hide resolved

Flink: SortKeySerializer and CompletedStatisticsSerializer support ve…

fb226bb

…rsion，and add UT for the change

Guosmilesmile force-pushed the fix-rangeMode branch from c412954 to fb226bb Compare December 4, 2024 18:08

Guosmilesmile requested a review from pvary December 5, 2024 14:08

mxm approved these changes Dec 5, 2024

View reviewed changes

flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/sink/shuffle/StatisticsUtil.java Outdated Show resolved Hide resolved

Flink: SortKeySerializer and CompletedStatisticsSerializer support ve…

d5fb96b

…rsion，and add UT for the change

Guosmilesmile requested a review from stevenzwu December 5, 2024 14:39

pvary reviewed Dec 9, 2024

View reviewed changes

flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/sink/shuffle/SortKeySerializer.java Outdated Show resolved Hide resolved

Flink: SortKeySerializer and CompletedStatisticsSerializer support ve…

13ddf61

…rsion，and add UT for the change

Guosmilesmile requested a review from pvary December 9, 2024 23:10

pvary approved these changes Dec 10, 2024

View reviewed changes

pvary merged commit ac6509a into apache:main Dec 10, 2024
20 checks passed

Guosmilesmile mentioned this pull request Dec 11, 2024

Flink: Backport #11662 Fix range distribution npe when value is null to Flink 1.18 and 1.19 #11745

Merged

zachdisc pushed a commit to zachdisc/iceberg that referenced this pull request Dec 23, 2024

Flink: Fix range distribution npe when value is null (apache#11662)

6dac99b

pvary pushed a commit that referenced this pull request Jan 3, 2025

Flink: Backport #11662 Fix range distribution npe when value is null …

4d35682

…to Flink 1.18 and 1.19 (#11745)

Flink: Fix range distribution npe when value is null #11662

Flink: Fix range distribution npe when value is null #11662

Conversation

Guosmilesmile commented Nov 27, 2024

ConeyLiu commented Nov 27, 2024

Guosmilesmile commented Nov 27, 2024

pvary commented Nov 27, 2024

Guosmilesmile commented Nov 27, 2024

pvary commented Nov 27, 2024

pvary commented Nov 27, 2024

Guosmilesmile commented Nov 27, 2024

pvary commented Nov 27, 2024

Guosmilesmile commented Nov 27, 2024

Guosmilesmile commented Nov 27, 2024

pvary Nov 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Guosmilesmile commented Nov 27, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Guosmilesmile commented Nov 28, 2024 • edited Loading

mxm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Guosmilesmile commented Nov 28, 2024 • edited Loading

Guosmilesmile commented Nov 29, 2024

mxm commented Nov 29, 2024

Guosmilesmile commented Nov 29, 2024 • edited Loading

mxm left a comment

Choose a reason for hiding this comment

mxm Nov 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Guosmilesmile commented Nov 29, 2024 • edited Loading

mxm commented Dec 2, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Guosmilesmile commented Dec 2, 2024 • edited Loading

Guosmilesmile commented Dec 4, 2024

mxm left a comment

Choose a reason for hiding this comment

Guosmilesmile commented Dec 5, 2024

pvary commented Dec 10, 2024

pvary commented Dec 10, 2024

Guosmilesmile commented Dec 11, 2024

pvary Nov 27, 2024 •

edited

Loading

Guosmilesmile commented Nov 28, 2024 •

edited

Loading

Guosmilesmile commented Nov 28, 2024 •

edited

Loading

Guosmilesmile commented Nov 29, 2024 •

edited

Loading

mxm Nov 29, 2024 •

edited

Loading

Guosmilesmile commented Nov 29, 2024 •

edited

Loading

Guosmilesmile commented Dec 2, 2024 •

edited

Loading