-
Notifications
You must be signed in to change notification settings - Fork 2.5k
fix: [HUDI-8401] Unifying Partial Update modes and Merge Into Partial Update Encoding #17604
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…and Table property for PartialUpdateMode
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have we done any performance benchmarks as I remember there's concern around the performance impact?
| * Class to assist with merging two versions of the record that may contain partial updates using | ||
| * {@link org.apache.hudi.common.table.PartialUpdateMode#KEEP_VALUES} mode. | ||
| */ | ||
| public class KeepValuesPartialMergingUtils<T> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: Utils typically don't have any state. Let's name this something like PartialMergerWithKeepValues
| private static final Map<HoodieSchema, Map<String, Integer>> | ||
| FIELD_NAME_TO_ID_MAPPING_CACHE = new ConcurrentHashMap<>(); | ||
| private static final Map<Pair<Pair<HoodieSchema, HoodieSchema>, HoodieSchema>, HoodieSchema> | ||
| MERGED_SCHEMA_CACHE = new ConcurrentHashMap<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This cache can just grow over the life of the application. I think we can create a single instance for each BufferedRecordMerger and then make this an instance variable
| Object[] fieldVals = new Object[fields.size()]; | ||
| int idx = 0; | ||
| List<HoodieSchemaField> mergedSchemaFields = mergedSchema.getFields(); | ||
| for (HoodieSchemaField mergedSchemaField : mergedSchemaFields) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the expected behavior for nested fields?
Describe the issue this Pull Request addresses
This PR unifies Partial Update Modes defined via table properties with partial update encoding used by Spark SQL
MERGE INTO, and cleans up inconsistencies in how partial updates are handled across the write path.Specifically, prior to this change:
This work was originally based on PR #13540 from Lin, with additional fixes and refinements applied.
Summary and Changelog
This PR introduces a revised and unified design for partial update handling across all record formats and write paths.
Key changes:
BufferedRecordMergerFactory.Revised PartialUpdateMode design:
Possible values for table property
hoodie.table.partial.update.mode:IGNORE_DEFAULTSFILL_UNAVAILABLEKEEP_VALUESThere is no default value for
PartialUpdateMode. For tables without partial update requirements, this property may be absent.Note on
KEEP_VALUES:MERGE INTO.BufferedRecordMergerfor merging, this mode is expected to be set accordingly.BufferedRecordMergerFactory changes:
BufferedRecordMergerFactory.create()now accepts:enablePartialEncoding(boolean): indicates whether partial (vs full) record merging is used.Option<PartialUpdateMode>: defines the merge semantics when partial encoding is enabled.If
enablePartialEncodingistrue, the providedPartialUpdateModewill be honored.Interaction between table property and Merge Into encoding:
Pending:
Impact
Cleaning up PartialUpdateMode feature and unifying MergeInto partial update encoding along with other PartialUpdateModes from table config.
Risk Level
medium
Documentation Update
The config description must be updated if new configs are added or the default value of the configs are changed.
Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the instruction to make changes to the website.
Contributor's checklist