-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Duplicate Row in Same Partition using Global Bloom Index #9536
Comments
@Raghvendradubey Can you share the table properties. Is it COW or MOR? I noticed you turned on the flag 'hoodie.bloom.index.update.partition.path', Did the partition value updated for the duplicate record you are noticing? If yes, Did you applied the similar behaviour when you tried to reproduce? |
@Raghvendradubey Also I noticed the metadata table is enabled and you using 0.10.0. You may also want to upgrade the hudi version to 0.12.3 or 0.13.1. |
@ad1happy2go It's COW, partition value not updated cause I was trying to update same partition record key, and it resulted into 2 record key in same partition. |
@Raghvendradubey So if I understood It correctly, You got this issue when it tried to update partition path? That may be the root cause. Did you tried the similar thing when you tried to reproduce with small data. |
@ad1happy2go "You got this issue when it tried to update partition path" - Yes, |
FWIU, this is a sporadic thing that OP is not able to reproduce anymore. Might be related to this issue: #9035 One way to determine if it is caused by this issue is:
|
@voonhous I saw this issue again in another dataset which is on default Bloom Index, and again the same issue. I verified your steps there are two files but commit time is hoodie_commit_time is different for duplicate record, also I do not find any issue in error log file for the specific time also all the task successfully executed when duplicate records has been written. |
@Raghvendradubey Did you not saw any task failures in spark UI also as pointed out by @voonhous ? |
@ad1happy2go no failed task, I verified all tasks for all the stages nothing failed or reattempted. |
this is metadata field of duplicate record -
|
can somebody help here to identify the issue? |
@Raghvendradubey I worked on this but also couldn't reproduce on my end. I am trying with a bigger dataset. It's difficult to identify as the code is not failing and we are also not seeing any task failures/reattempts. |
@Raghvendradubey This was a bug which got fixed with this PR - https://issues.apache.org/jira/browse/HUDI-6946 Please try with 0.14.1 and let us know in case you still faces the issue. Thanks. |
@Raghvendradubey Did you got a chance to try this one? Do you still see this issue? |
hey @Raghvendradubey : any follow ups on this. |
Hi @ad1happy2go @nsivabalan After migrating to new Hudi version 0.14.0 I didn't face this issue again, thanks for your support. |
Great! Thanks @Raghvendradubey . Closing this issue. |
Hi Team,
I am facing an issue of duplicate record keys while data upserts into Hudi on EMR.
Hudi Jar -
hudi-spark3.1.2-bundle_2.12-0.10.1.jar
EMR Version -
emr-6.5.0
Workflow -
files on S3 -> EMR(hudi) -> Hudi Tables(S3)
Schedule - once in a day
Insert Data Size -
5 to 10 MB per batch
Hudi Configuration for Upsert -
hudi_options = {
'hoodie.table.name': "txn_table"
'hoodie.datasource.write.recordkey.field': "transaction_id",
'hoodie.datasource.write.partitionpath.field': 'billing_date',
'hoodie.datasource.write.table.name': "txn_table",
'hoodie.datasource.write.operation': 'upsert',
'hoodie.datasource.write.precombine.field': 'transaction_id',
'hoodie.index.type': "GLOBAL_BLOOM",
'hoodie.bloom.index.update.partition.path': "true",
'hoodie.upsert.shuffle.parallelism': 10,
'hoodie.insert.shuffle.parallelism': 10,
'hoodie.datasource.hive_sync.database': "dwh",
'hoodie.datasource.hive_sync.table': "txn_table",
'hoodie.datasource.hive_sync.partition_fields': "billing_date",
'hoodie.datasource.write.hive_style_partitioning': "true",
'hoodie.datasource.hive_sync.enable': "true",
'hoodie.datasource.write.keygenerator.consistent.logical.timestamp.enabled': "true",
'hoodie.datasource.hive_sync.support_timestamp': "true",
'hoodie.metadata.enable': "true"
}
Issue Occurrence -
It's been around a month while running our job in production but this issue has been seen for the first time.
Even when I tried to reproduce the issue with the same dataset it was not reproducible, records updated successfully.
Issue Steps -
1 - There is a batch of data for which first we do insert in txn_table, which has unique id through out the partition i.e transaction_id(defined as record key)
2 - Next day, on the update of the record key a new row is created with same record key in same partition with updated value.
3 - both the duplicate rows were able to be read but when I try to update then it updates only the latest row.
4 - On checking the parquet file, a duplicate record with updated value was present in a different file in the same partition.
Steps to Reproduce -
Issue is not reproducible, even when same dataset tried to ingest again with same configuration Upsert was fine.
Please let me know If I am missing some configuration.
Thanks
Raghvendra
The text was updated successfully, but these errors were encountered: