Skip to content

Conversation

@linliu-code
Copy link
Collaborator

Change Logs

As title.

Impact

Simplify existing code path.

Risk level (write none, low medium or high below)

Medium.

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
    ticket number here and follow the instruction to make
    changes to the website.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@github-actions github-actions bot added the size:XL PR with lines of changes > 1000 label Jul 9, 2025
@linliu-code linliu-code force-pushed the rfc_91_3 branch 2 times, most recently from ae85c2b to c5464f7 Compare July 11, 2025 22:15
@linliu-code linliu-code changed the title [HUDI-8401][RFC-97] Support KeepValues submerge mode [HUDI-8401][RFC-97] Support KeepValues partial merge mode Jul 11, 2025
@linliu-code linliu-code marked this pull request as ready for review July 11, 2025 22:55
@danny0405 danny0405 changed the title [HUDI-8401][RFC-97] Support KeepValues partial merge mode [HUDI-8401] Support KeepValues partial merge mode Jul 16, 2025
this.readerContext = readerContext;
this.partialUpdateMode = partialUpdateMode;
this.mergeProperties = parseMergeProperties(props);
if (partialUpdateMode == PartialUpdateMode.KEEP_VALUES) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like the PartialUpdateMode.KEEP_VALUES mode is kind of orthoganal to existing partial merging logic, is it possible we add another two partial update BufferedRecordMergers and also a new #partialMerge method here so make the logic more clear.

* @param mergedSchema The merged schema for the merged record.
* @return whether the Avro schema is partial compared to the merged schema.
*/
public static boolean isPartial(Schema schema, Schema mergedSchema) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could be very costly, we need to ensure both of these two schemas are cached by AvroSchemaCache.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides cached by AvroSchemaCache, maybe we can also add some short-circuit logic, such as checking by comparing fields number firstly?

Copy link
Collaborator

@cshuo cshuo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#13498 is landed, this pr can be rebased on master.

boolean enablePartialMerging) {
// Note that: When either newRecord or oldRecord is a delete record,
// skip partial update since delete records do not have meaningful columns.
if (partialUpdateMode == PartialUpdateMode.NONE
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

partialUpdateMode can not be PartialUpdateMode.NONE now, maybe we can add validating check for update mode in the constructor, and remove the check here.

* @param mergedSchema The merged schema for the merged record.
* @return whether the Avro schema is partial compared to the merged schema.
*/
public static boolean isPartial(Schema schema, Schema mergedSchema) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides cached by AvroSchemaCache, maybe we can also add some short-circuit logic, such as checking by comparing fields number firstly?

this.partialUpdateMode = partialUpdateMode;
this.mergeProperties = parseMergeProperties(props);
if (partialUpdateMode == PartialUpdateMode.KEEP_VALUES) {
partialMergingUtils = new PartialMergingUtils<>();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All methods in PartialMergingUtils can be static? Seems don't need to instantiate PartialMergingUtils object.

// The merged schema contains fields that only appear in either older and/or newer record.
Schema mergedSchema =
getCachedMergedSchema(oldSchema, newSchema, readerSchema);
Map<String, Integer> fieldNameToIdFromNewRecordSchema =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since ids of fields are not used at all, we can use Schema.Field field = newSchema.getField(fieldName); to check existence instead, which is also a map#get(String) operation.

Schema schema2 = schemaPair.getLeft().getRight();
Schema refSchema = schemaPair.getRight();
Map<String, Integer> nameToIdMapping1 =
getCachedFieldNameToIdMapping(schema1);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use schema1.getFields().stream().map(Schema.Field::name).collect(Collectors.toSet()); to get the fields name set instead? Since this operation is not at record-level, and only occurs during cache miss for MERGED_SCHEMA_CACHE.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems FIELD_NAME_TO_ID_MAPPING_CACHE can be removed then.

@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:XL PR with lines of changes > 1000

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants