Duplicate Row in Same Partition using Global Bloom Index #9536

Raghvendradubey · 2023-08-25T13:23:43Z

Hi Team,

I am facing an issue of duplicate record keys while data upserts into Hudi on EMR.

Hudi Jar -
hudi-spark3.1.2-bundle_2.12-0.10.1.jar

EMR Version -
emr-6.5.0

Workflow -
files on S3 -> EMR(hudi) -> Hudi Tables(S3)

Schedule - once in a day

Insert Data Size -
5 to 10 MB per batch

Hudi Configuration for Upsert -

hudi_options = {
'hoodie.table.name': "txn_table"
'hoodie.datasource.write.recordkey.field': "transaction_id",
'hoodie.datasource.write.partitionpath.field': 'billing_date',
'hoodie.datasource.write.table.name': "txn_table",
'hoodie.datasource.write.operation': 'upsert',
'hoodie.datasource.write.precombine.field': 'transaction_id',
'hoodie.index.type': "GLOBAL_BLOOM",
'hoodie.bloom.index.update.partition.path': "true",
'hoodie.upsert.shuffle.parallelism': 10,
'hoodie.insert.shuffle.parallelism': 10,
'hoodie.datasource.hive_sync.database': "dwh",
'hoodie.datasource.hive_sync.table': "txn_table",
'hoodie.datasource.hive_sync.partition_fields': "billing_date",
'hoodie.datasource.write.hive_style_partitioning': "true",
'hoodie.datasource.hive_sync.enable': "true",
'hoodie.datasource.write.keygenerator.consistent.logical.timestamp.enabled': "true",
'hoodie.datasource.hive_sync.support_timestamp': "true",
'hoodie.metadata.enable': "true"
}

Issue Occurrence -
It's been around a month while running our job in production but this issue has been seen for the first time.
Even when I tried to reproduce the issue with the same dataset it was not reproducible, records updated successfully.

Issue Steps -

1 - There is a batch of data for which first we do insert in txn_table, which has unique id through out the partition i.e transaction_id(defined as record key)
2 - Next day, on the update of the record key a new row is created with same record key in same partition with updated value.
3 - both the duplicate rows were able to be read but when I try to update then it updates only the latest row.
4 - On checking the parquet file, a duplicate record with updated value was present in a different file in the same partition.

Steps to Reproduce -

Issue is not reproducible, even when same dataset tried to ingest again with same configuration Upsert was fine.

Please let me know If I am missing some configuration.

Thanks
Raghvendra

ad1happy2go · 2023-08-25T13:33:39Z

@Raghvendradubey Can you share the table properties. Is it COW or MOR? I noticed you turned on the flag 'hoodie.bloom.index.update.partition.path', Did the partition value updated for the duplicate record you are noticing? If yes, Did you applied the similar behaviour when you tried to reproduce?

ad1happy2go · 2023-08-25T13:37:57Z

@Raghvendradubey Also I noticed the metadata table is enabled and you using 0.10.0. You may also want to upgrade the hudi version to 0.12.3 or 0.13.1.

Raghvendradubey · 2023-08-27T15:27:39Z

@ad1happy2go It's COW, partition value not updated cause I was trying to update same partition record key, and it resulted into 2 record key in same partition.

ad1happy2go · 2023-08-28T04:22:14Z

@Raghvendradubey So if I understood It correctly, You got this issue when it tried to update partition path? That may be the root cause. Did you tried the similar thing when you tried to reproduce with small data.

Raghvendradubey · 2023-08-28T06:07:58Z

@ad1happy2go "You got this issue when it tried to update partition path" - Yes,
"Did you tried the similar thing when you tried to reproduce with small data" yes when the same thing tried to reproduce the issue with same source data then it worked fine.
It's been around a month with this hudi configuration but this issue has been seen first time. rest of the day it worked fine.

voonhous · 2023-08-31T03:47:24Z

FWIU, this is a sporadic thing that OP is not able to reproduce anymore.

Might be related to this issue: #9035

One way to determine if it is caused by this issue is:

Identify the 2 parquet files that the 2 files are situated in
If it is caused by the issue linked above, the commit time should be the same (assuming COW table)
If it is this issue and if you are still able to access your Spark tracking URL, you can probably look at the timing of the stages and see if there's a zombie executor/task has not been killed after reconcileAgainstMarker has been called.

Raghvendradubey · 2023-09-12T13:41:54Z

@voonhous I saw this issue again in another dataset which is on default Bloom Index, and again the same issue. I verified your steps

there are two files but commit time is hoodie_commit_time is different for duplicate record, also I do not find any issue in error log file for the specific time also all the task successfully executed when duplicate records has been written.

ad1happy2go · 2023-09-13T13:16:48Z

@Raghvendradubey Did you not saw any task failures in spark UI also as pointed out by @voonhous ?

Raghvendradubey · 2023-09-14T06:34:31Z

@ad1happy2go no failed task, I verified all tasks for all the stages nothing failed or reattempted.

Raghvendradubey · 2023-09-14T06:42:41Z

this is metadata field of duplicate record -

_hoodie_commit_time	_hoodie_commit_seqno	_hoodie_record_key	_hoodie_partition_path	_hoodie_file_name
20230905093840399	20230905093840399_288_11214	nomupay_transaction_id:NP-b62e25f04e29205777612835243	processor_name=PLANET/oas_stamp=2023-08-01 19:40:00.0	6a77386e-4a50-4648-916d-568d72f349e1-0_288-60-2990_20230905093840399.parquet
20230801210728594	20230801210728594_0_5	nomupay_transaction_id:NP-b62e25f04e29205777612835243	processor_name=PLANET/oas_stamp=2023-08-01 19:40:00.0	55d3f136-4c9e-47c2-8797-5c7bc0d0163a-0_0-33-1641_20230801210728594.parquet

Raghvendradubey · 2023-09-18T14:15:15Z

can somebody help here to identify the issue?

ad1happy2go · 2023-09-18T14:25:23Z

@Raghvendradubey I worked on this but also couldn't reproduce on my end. I am trying with a bigger dataset. It's difficult to identify as the code is not failing and we are also not seeing any task failures/reattempts.
Will update you soon. Thanks.

ad1happy2go · 2024-01-17T09:06:08Z

@Raghvendradubey This was a bug which got fixed with this PR - https://issues.apache.org/jira/browse/HUDI-6946

Please try with 0.14.1 and let us know in case you still faces the issue. Thanks.

ad1happy2go · 2024-01-31T12:27:05Z

@Raghvendradubey Did you got a chance to try this one? Do you still see this issue?

nsivabalan · 2024-04-09T02:28:59Z

hey @Raghvendradubey : any follow ups on this.

Raghvendradubey · 2024-04-09T05:17:41Z

Hi @ad1happy2go @nsivabalan After migrating to new Hudi version 0.14.0 I didn't face this issue again, thanks for your support.

ad1happy2go · 2024-04-09T10:36:18Z

Great! Thanks @Raghvendradubey . Closing this issue.

codope added data-consistency phantoms, duplicates, write skew, inconsistent snapshot data-duplication labels Aug 25, 2023

codope added this to Hudi Issue Support Aug 25, 2023

github-project-automation bot moved this to ⏳ Awaiting Triage in Hudi Issue Support Aug 25, 2023

codope moved this from ⏳ Awaiting Triage to 🚧 Needs Repro in Hudi Issue Support Aug 29, 2023

codope added the priority:critical production down; pipelines stalled; Need help asap. label Sep 18, 2023

codope moved this from 🚧 Needs Repro to 👤 User Action in Hudi Issue Support Jan 17, 2024

codope closed this as completed Apr 9, 2024

github-project-automation bot moved this from 👤 User Action to ✅ Done in Hudi Issue Support Apr 9, 2024

torvalds-dev bot mentioned this issue Jul 31, 2024

[SUPPORT] Query about using record_key with insert_overwrite option in Hudi write torvalds-dev/hudi#68

Open

torvalds-dev-testbot bot mentioned this issue Sep 10, 2024

[SUPPORT] How to ignore unchanged values during upserts for deliberately overlapping data? numberlabs-developers/hudi#254

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicate Row in Same Partition using Global Bloom Index #9536

Duplicate Row in Same Partition using Global Bloom Index #9536

Raghvendradubey commented Aug 25, 2023

ad1happy2go commented Aug 25, 2023

ad1happy2go commented Aug 25, 2023

Raghvendradubey commented Aug 27, 2023

ad1happy2go commented Aug 28, 2023

Raghvendradubey commented Aug 28, 2023

voonhous commented Aug 31, 2023

Raghvendradubey commented Sep 12, 2023

ad1happy2go commented Sep 13, 2023

Raghvendradubey commented Sep 14, 2023

Raghvendradubey commented Sep 14, 2023

Raghvendradubey commented Sep 18, 2023

ad1happy2go commented Sep 18, 2023

ad1happy2go commented Jan 17, 2024

ad1happy2go commented Jan 31, 2024 •

edited

Loading

nsivabalan commented Apr 9, 2024

Raghvendradubey commented Apr 9, 2024

ad1happy2go commented Apr 9, 2024

Duplicate Row in Same Partition using Global Bloom Index #9536

Duplicate Row in Same Partition using Global Bloom Index #9536

Comments

Raghvendradubey commented Aug 25, 2023

ad1happy2go commented Aug 25, 2023

ad1happy2go commented Aug 25, 2023

Raghvendradubey commented Aug 27, 2023

ad1happy2go commented Aug 28, 2023

Raghvendradubey commented Aug 28, 2023

voonhous commented Aug 31, 2023

Raghvendradubey commented Sep 12, 2023

ad1happy2go commented Sep 13, 2023

Raghvendradubey commented Sep 14, 2023

Raghvendradubey commented Sep 14, 2023

Raghvendradubey commented Sep 18, 2023

ad1happy2go commented Sep 18, 2023

ad1happy2go commented Jan 17, 2024

ad1happy2go commented Jan 31, 2024 • edited Loading

nsivabalan commented Apr 9, 2024

Raghvendradubey commented Apr 9, 2024

ad1happy2go commented Apr 9, 2024

ad1happy2go commented Jan 31, 2024 •

edited

Loading