-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HUDI-4699] RFC for auto record key generation #10365
Conversation
## Abstract | ||
|
||
One of the prerequisites to create an Apache Hudi table is to configure record keys(a.k.a primary keys). Since Hudi’s | ||
origin at Uber revolved around supporting mutable workloads at large scale, these were deemed mandatory. As we started |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of mentioning Uber, maybe we can talk about while primary keys bring a lot of benefits, it also means that user need to model the data and thus some amount cognitive overhead. We can also point to Oracle where row id gets generated automatically irrespective of whether we define primary key rght upfront.
- What impact (if any) will there be on existing users? | ||
- If we are changing behavior how will we phase out the older behavior? | ||
- If we need special migration tools, describe them here. | ||
- When will we remove the existing behavior | ||
|
||
## Test Plan | ||
|
||
Describe in few sentences how the RFC will be tested. How will we know that the implementation works as expected? How will we know nothing broke?. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's briefly mention about these. If users upgrade/downgrade will it all be handled OOB?
Combining them in a single string key as below | ||
"${commit_timestamp}_${batch_row_id}" | ||
|
||
For row-id generation we plan to use a combination of “spark partition id” and a row Id (sequential Id generation) to generate unique identity value for every row w/in batch (this particular component is available in Spark out-of-the-box, but could be easily implemented for any parallel execution framework like Flink, etc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to mention the equivalent in Flink or other engines?
Change Logs
Adding RFC for auto record key generation.
Impact
Will unblock more use-cases where users do not need to set any record keys.
Risk level (write none, low medium or high below)
low.
Documentation Update
Describe any necessary documentation update if there is any new feature, config, or user-facing change
ticket number here and follow the instruction to make
changes to the website.
Contributor's checklist