Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate Row in Same Partition using Global Bloom Index #9536

Closed
Raghvendradubey opened this issue Aug 25, 2023 · 17 comments
Closed

Duplicate Row in Same Partition using Global Bloom Index #9536

Raghvendradubey opened this issue Aug 25, 2023 · 17 comments
Labels
data-consistency phantoms, duplicates, write skew, inconsistent snapshot data-duplication priority:critical production down; pipelines stalled; Need help asap.

Comments

@Raghvendradubey
Copy link

Hi Team,

I am facing an issue of duplicate record keys while data upserts into Hudi on EMR.

Hudi Jar -
hudi-spark3.1.2-bundle_2.12-0.10.1.jar

EMR Version -
emr-6.5.0

Workflow -
files on S3 -> EMR(hudi) -> Hudi Tables(S3)

Schedule - once in a day

Insert Data Size -
5 to 10 MB per batch

Hudi Configuration for Upsert -

hudi_options = {
'hoodie.table.name': "txn_table"
'hoodie.datasource.write.recordkey.field': "transaction_id",
'hoodie.datasource.write.partitionpath.field': 'billing_date',
'hoodie.datasource.write.table.name': "txn_table",
'hoodie.datasource.write.operation': 'upsert',
'hoodie.datasource.write.precombine.field': 'transaction_id',
'hoodie.index.type': "GLOBAL_BLOOM",
'hoodie.bloom.index.update.partition.path': "true",
'hoodie.upsert.shuffle.parallelism': 10,
'hoodie.insert.shuffle.parallelism': 10,
'hoodie.datasource.hive_sync.database': "dwh",
'hoodie.datasource.hive_sync.table': "txn_table",
'hoodie.datasource.hive_sync.partition_fields': "billing_date",
'hoodie.datasource.write.hive_style_partitioning': "true",
'hoodie.datasource.hive_sync.enable': "true",
'hoodie.datasource.write.keygenerator.consistent.logical.timestamp.enabled': "true",
'hoodie.datasource.hive_sync.support_timestamp': "true",
'hoodie.metadata.enable': "true"
}

Issue Occurrence -
It's been around a month while running our job in production but this issue has been seen for the first time.
Even when I tried to reproduce the issue with the same dataset it was not reproducible, records updated successfully.

Issue Steps -

1 - There is a batch of data for which first we do insert in txn_table, which has unique id through out the partition i.e transaction_id(defined as record key)
2 - Next day, on the update of the record key a new row is created with same record key in same partition with updated value.
3 - both the duplicate rows were able to be read but when I try to update then it updates only the latest row.
4 - On checking the parquet file, a duplicate record with updated value was present in a different file in the same partition.

Steps to Reproduce -

Issue is not reproducible, even when same dataset tried to ingest again with same configuration Upsert was fine.

Please let me know If I am missing some configuration.

Thanks
Raghvendra

@ad1happy2go
Copy link
Collaborator

@Raghvendradubey Can you share the table properties. Is it COW or MOR? I noticed you turned on the flag 'hoodie.bloom.index.update.partition.path', Did the partition value updated for the duplicate record you are noticing? If yes, Did you applied the similar behaviour when you tried to reproduce?

@codope codope added data-consistency phantoms, duplicates, write skew, inconsistent snapshot data-duplication labels Aug 25, 2023
@github-project-automation github-project-automation bot moved this to ⏳ Awaiting Triage in Hudi Issue Support Aug 25, 2023
@ad1happy2go
Copy link
Collaborator

@Raghvendradubey Also I noticed the metadata table is enabled and you using 0.10.0. You may also want to upgrade the hudi version to 0.12.3 or 0.13.1.

@Raghvendradubey
Copy link
Author

@ad1happy2go It's COW, partition value not updated cause I was trying to update same partition record key, and it resulted into 2 record key in same partition.

@ad1happy2go
Copy link
Collaborator

@Raghvendradubey So if I understood It correctly, You got this issue when it tried to update partition path? That may be the root cause. Did you tried the similar thing when you tried to reproduce with small data.

@Raghvendradubey
Copy link
Author

@ad1happy2go "You got this issue when it tried to update partition path" - Yes,
"Did you tried the similar thing when you tried to reproduce with small data" yes when the same thing tried to reproduce the issue with same source data then it worked fine.
It's been around a month with this hudi configuration but this issue has been seen first time. rest of the day it worked fine.

@codope codope moved this from ⏳ Awaiting Triage to 🚧 Needs Repro in Hudi Issue Support Aug 29, 2023
@voonhous
Copy link
Member

FWIU, this is a sporadic thing that OP is not able to reproduce anymore.

Might be related to this issue: #9035

One way to determine if it is caused by this issue is:

  1. Identify the 2 parquet files that the 2 files are situated in
  2. If it is caused by the issue linked above, the commit time should be the same (assuming COW table)
  3. If it is this issue and if you are still able to access your Spark tracking URL, you can probably look at the timing of the stages and see if there's a zombie executor/task has not been killed after reconcileAgainstMarker has been called.

@Raghvendradubey
Copy link
Author

@voonhous I saw this issue again in another dataset which is on default Bloom Index, and again the same issue. I verified your steps

there are two files but commit time is hoodie_commit_time is different for duplicate record, also I do not find any issue in error log file for the specific time also all the task successfully executed when duplicate records has been written.

@ad1happy2go
Copy link
Collaborator

@Raghvendradubey Did you not saw any task failures in spark UI also as pointed out by @voonhous ?

@Raghvendradubey
Copy link
Author

@ad1happy2go no failed task, I verified all tasks for all the stages nothing failed or reattempted.

@Raghvendradubey
Copy link
Author

this is metadata field of duplicate record -

_hoodie_commit_time _hoodie_commit_seqno _hoodie_record_key _hoodie_partition_path _hoodie_file_name
20230905093840399 20230905093840399_288_11214 nomupay_transaction_id:NP-b62e25f04e29205777612835243 processor_name=PLANET/oas_stamp=2023-08-01 19:40:00.0 6a77386e-4a50-4648-916d-568d72f349e1-0_288-60-2990_20230905093840399.parquet
20230801210728594 20230801210728594_0_5 nomupay_transaction_id:NP-b62e25f04e29205777612835243 processor_name=PLANET/oas_stamp=2023-08-01 19:40:00.0 55d3f136-4c9e-47c2-8797-5c7bc0d0163a-0_0-33-1641_20230801210728594.parquet

@Raghvendradubey
Copy link
Author

can somebody help here to identify the issue?

@ad1happy2go
Copy link
Collaborator

@Raghvendradubey I worked on this but also couldn't reproduce on my end. I am trying with a bigger dataset. It's difficult to identify as the code is not failing and we are also not seeing any task failures/reattempts.
Will update you soon. Thanks.

@codope codope added the priority:critical production down; pipelines stalled; Need help asap. label Sep 18, 2023
@ad1happy2go
Copy link
Collaborator

@Raghvendradubey This was a bug which got fixed with this PR - https://issues.apache.org/jira/browse/HUDI-6946

Please try with 0.14.1 and let us know in case you still faces the issue. Thanks.

@codope codope moved this from 🚧 Needs Repro to 👤 User Action in Hudi Issue Support Jan 17, 2024
@ad1happy2go
Copy link
Collaborator

ad1happy2go commented Jan 31, 2024

@Raghvendradubey Did you got a chance to try this one? Do you still see this issue?

@nsivabalan
Copy link
Contributor

hey @Raghvendradubey : any follow ups on this.

@Raghvendradubey
Copy link
Author

Hi @ad1happy2go @nsivabalan After migrating to new Hudi version 0.14.0 I didn't face this issue again, thanks for your support.

@ad1happy2go
Copy link
Collaborator

Great! Thanks @Raghvendradubey . Closing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data-consistency phantoms, duplicates, write skew, inconsistent snapshot data-duplication priority:critical production down; pipelines stalled; Need help asap.
Projects
Archived in project
Development

No branches or pull requests

5 participants