Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix][broker] Geo Replication lost messages or frequently fails due to Deduplication is not appropriate for Geo-Replication #23697

Merged
merged 22 commits into from
Feb 21, 2025

Conversation

poorbarcode
Copy link
Contributor

@poorbarcode poorbarcode commented Dec 9, 2024

Motivation

Background

How does deduplication work?

  • Producer sends messages and records sequence-ids that are pending responses in memory, the variable is {pendingMessages}
  • Brokers will respond to an invalid position -1:-1 if the sequence ID published is lower than the previous messages.
  • Producer checks whether the next sequence id in {pendingMessages} is larger than the one that was rejected.
    • {next} > {rejected}: ignore the error, and continue work
    • {next} < {rejected}: close channels and reconnect.

Conditions that issue happened

  • Replicators send all messages on a topic and do not matter what message was sent by which original producer.
    • (Highlight) it mixed messages into {pendingMessages} with message.sequenceId but ignores message.original-producer-name, which may cause the sequence-ids in {pendingMessages} is not increasing
  • The producer checking the sequence ID after receiving a -1:-1 send response will fail.

Issue-1: loss messages

  • Publishing in the source cluster
    • Producer-1 send 2 messages: M1(seq: 0), M2(seq: 1)
    • Producer-2 send 2 messages: M3(seq: 0), M4(seq: 1)
  • Replicate messages to the remote cluster
    • Copies M1 and M2
      • {pendingMessages}: [0,1]
    • Repeatedly copies M1 and M2. and copies M3 and M4.
      • {pendingMessages}: [0,1,0,1]
      • After repeatedly copying M1 and M2, the network broke.
      • Receive send receipt:
        • seq 0, position 0:0
        • seq 1, position 0:1
        • seq 0, position -1: -1
        • seq 1, position -1:-1
      • {pendingMessages}: [empty]
  • After a topic unloading.
    • The replicator will start after the topic is loaded up(backlog is 0 now).
    • The client will create a new connection.
  • Result:
    • Disabled deduplication: copied [M1, M2, M1, M2]
    • Enabled deduplication: copied [M1, M2]

You can reproduce the issue by the test testDeduplicationNotLostMessage


Issue-2: frequently fails

  • Original-producer-2: sent msg 3:0 with sequence-id 10
  • Original-producer-1: sent msg 3:1 with sequence-id 1
  • Original-producer-1: sent msg 3:2 with sequence-id 2
    -0 Replicator copies messages
    • {pendingMessages}: [10,1, 2]
  • Sent 3:0 successfully
    • {pendingMessages}: [1,2]
  • Resend 3:0(a duplicated publishing)
    • Receive -1:-1(new position relates to the latest publishing) for the latest send-response.
    • failed-sequenced:10 > pendingMessages[0].sequenceId: 1
    • Replicator-producer reconnects and resets the cursor, which leads to more duplicated publishing.
    • Loop to the step 1.

No test for reproducing this issue.

Modifications

Solution: replicators use a specified sequence ID(ledegrId:entryId of the original topic) instead of using the original producers’

  • Original-producer-2: sent msg 3:0 with sequence-id 10
  • Original-producer-1: sent msg 3:1 with sequence-id 1
  • Original-producer-1: sent msg 3:2 with sequence-id 2
  • Replicator copies messages with specified sequence ID(3:0)
    • {pendingMessages}: 3:0, 3:1, 3:2]
  • Sent 3:0 successfully
    • {pendingMessages}: [3:1, 3:2]
  • Resend 3:0(a duplicated publishing)
  • Receive -1:-1(new position relates to the latest publishing) for the latest send-response.
  • Ignored the error since failed-sequenced(3:0) < pendingMessages[0].sequenceId(3:2)

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Matching PR in forked repository

PR in forked repository: x

@poorbarcode poorbarcode self-assigned this Dec 9, 2024
@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label Dec 9, 2024
@poorbarcode poorbarcode changed the title [draft][fix][broker] GEO replication fails due to deduplication is not aprropriate for Geo-Replication [draft][fix][broker] Geo Replication fails due to Deduplication is not aprropriate for Geo-Replication Dec 9, 2024
@poorbarcode poorbarcode force-pushed the fix/repl_sequence_id branch 3 times, most recently from b9ae34a to 0d2e235 Compare December 17, 2024 11:56
@poorbarcode poorbarcode changed the title [draft][fix][broker] Geo Replication fails due to Deduplication is not aprropriate for Geo-Replication [fix][broker] Geo Replication fails due to Deduplication is not aprropriate for Geo-Replication Dec 17, 2024
@poorbarcode poorbarcode changed the title [fix][broker] Geo Replication fails due to Deduplication is not aprropriate for Geo-Replication [fix][broker] Geo Replication lost messages or frequently fails due to Deduplication is not aprropriate for Geo-Replication Dec 17, 2024
@poorbarcode
Copy link
Contributor Author

/pulsarbot rerun-failure-checks

@poorbarcode poorbarcode changed the title [fix][broker] Geo Replication lost messages or frequently fails due to Deduplication is not aprropriate for Geo-Replication [fix] [broker] Geo Replication lost messages or frequently fails due to Deduplication is not aprropriate for Geo-Replication Dec 18, 2024
@codecov-commenter
Copy link

codecov-commenter commented Dec 23, 2024

Codecov Report

Attention: Patch coverage is 31.75676% with 202 lines in your changes missing coverage. Please review.

Project coverage is 39.82%. Comparing base (bbc6224) to head (7c7dd34).
Report is 924 commits behind head on master.

Files with missing lines Patch % Lines
...roker/service/persistent/MessageDeduplication.java 15.17% 85 Missing and 10 partials ⚠️
...pulsar/client/impl/GeoReplicationProducerImpl.java 35.18% 52 Missing and 18 partials ⚠️
...ava/org/apache/pulsar/broker/service/Producer.java 56.25% 9 Missing and 5 partials ⚠️
.../java/org/apache/pulsar/client/impl/ClientCnx.java 33.33% 6 Missing and 2 partials ⚠️
...va/org/apache/pulsar/client/impl/ProducerImpl.java 63.63% 4 Missing ⚠️
...ava/org/apache/pulsar/common/protocol/Markers.java 0.00% 4 Missing ⚠️
...org/apache/pulsar/broker/service/TransportCnx.java 0.00% 0 Missing and 2 partials ⚠️
...er/service/persistent/GeoPersistentReplicator.java 50.00% 1 Missing and 1 partial ⚠️
...ar/broker/service/persistent/ShadowReplicator.java 0.00% 2 Missing ⚠️
...n/java/org/apache/pulsar/broker/service/Topic.java 0.00% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@              Coverage Diff              @@
##             master   #23697       +/-   ##
=============================================
- Coverage     73.57%   39.82%   -33.76%     
+ Complexity    32624    13492    -19132     
=============================================
  Files          1877     1798       -79     
  Lines        139502   142848     +3346     
  Branches      15299    16730     +1431     
=============================================
- Hits         102638    56886    -45752     
- Misses        28908    78562    +49654     
+ Partials       7956     7400      -556     
Flag Coverage Δ
inttests 26.80% <30.00%> (+2.21%) ⬆️
systests 23.91% <8.78%> (-0.41%) ⬇️
unittests 36.15% <28.37%> (-36.70%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...ache/pulsar/broker/service/AbstractReplicator.java 34.63% <100.00%> (-50.37%) ⬇️
...va/org/apache/pulsar/broker/service/ServerCnx.java 43.95% <ø> (-28.20%) ⬇️
...rg/apache/pulsar/client/impl/PulsarClientImpl.java 59.09% <100.00%> (-15.21%) ⬇️
...ar/client/impl/conf/ProducerConfigurationData.java 51.78% <ø> (-39.98%) ⬇️
...va/org/apache/pulsar/common/protocol/Commands.java 66.63% <100.00%> (-23.96%) ⬇️
...n/java/org/apache/pulsar/broker/service/Topic.java 8.00% <0.00%> (-28.37%) ⬇️
...org/apache/pulsar/broker/service/TransportCnx.java 0.00% <0.00%> (ø)
...er/service/persistent/GeoPersistentReplicator.java 38.05% <50.00%> (-39.97%) ⬇️
...ar/broker/service/persistent/ShadowReplicator.java 0.00% <0.00%> (-58.54%) ⬇️
...va/org/apache/pulsar/client/impl/ProducerImpl.java 68.82% <63.63%> (-14.77%) ⬇️
... and 5 more

... and 1646 files with indirect coverage changes

@codelipenghui
Copy link
Contributor

@poorbarcode It's a good idea to just use the ledger ID and entry ID for the message deduplication. In this case, we can also remove the deduplication state after the ledger get fully replicated. For example:

  • Replicating data from ledger 1
  • The last entry ID of ledger 1 is 10000
  • After the remote cluster get replicated data with ledger ID greater than 1
  • Then we can delete the deduplication state for ledger 1

@poorbarcode poorbarcode merged commit 4ac4f3c into apache:master Feb 21, 2025
104 of 107 checks passed
poorbarcode added a commit that referenced this pull request Feb 21, 2025
…o Deduplication is not appropriate for Geo-Replication (#23697)

(cherry picked from commit 4ac4f3c)
poorbarcode added a commit that referenced this pull request Feb 21, 2025
…o Deduplication is not appropriate for Geo-Replication (#23697)

(cherry picked from commit 4ac4f3c)
poorbarcode added a commit that referenced this pull request Feb 21, 2025
…o Deduplication is not appropriate for Geo-Replication (#23697)

(cherry picked from commit 4ac4f3c)
@BewareMyPower
Copy link
Contributor

@poorbarcode @gaoran10 @Technoboy- @codelipenghui Why could this PR be cherry-picked to branch-3.0 and branch-4.0?

It changes the PulsarApi.proto by adding a new feature flag. How could 3.0 and 4.0 be declared as LTS releases?

@poorbarcode
Copy link
Contributor Author

poorbarcode commented Feb 26, 2025

@BewareMyPower

@poorbarcode @gaoran10 @Technoboy- @codelipenghui Why could this PR be cherry-picked to branch-3.0 and branch-4.0?
It changes the PulsarApi.proto by adding a new feature flag. How could 3.0 and 4.0 be declared as LTS releases?

  • This PR fixed the bug and 3.0.x & 4.0.x also need this fix.
  • the change of feature flag is used for the compatibility

nikhil-ctds pushed a commit to datastax/pulsar that referenced this pull request Feb 27, 2025
…o Deduplication is not appropriate for Geo-Replication (apache#23697)

(cherry picked from commit 4ac4f3c)
(cherry picked from commit 307b5c9)
nikhil-ctds pushed a commit to datastax/pulsar that referenced this pull request Feb 27, 2025
…o Deduplication is not appropriate for Geo-Replication (apache#23697)

(cherry picked from commit 4ac4f3c)
(cherry picked from commit 26a211c)
mukesh-ctds pushed a commit to datastax/pulsar that referenced this pull request Feb 28, 2025
…o Deduplication is not appropriate for Geo-Replication (apache#23697)

(cherry picked from commit 4ac4f3c)
(cherry picked from commit 307b5c9)
mukesh-ctds added a commit to datastax/pulsar that referenced this pull request Feb 28, 2025
…ls due to Deduplication is not appropriate for Geo-Replication (apache#23697)"

This reverts commit 2607a49.
mukesh-ctds added a commit to datastax/pulsar that referenced this pull request Feb 28, 2025
…ntly fails due to Deduplication is not appropriate for Geo-Replication (apache#23697)""

This reverts commit a8a1c4a.
srinath-ctds pushed a commit to datastax/pulsar that referenced this pull request Feb 28, 2025
…o Deduplication is not appropriate for Geo-Replication (apache#23697)

(cherry picked from commit 4ac4f3c)
(cherry picked from commit 26a211c)
mukesh-ctds pushed a commit to datastax/pulsar that referenced this pull request Mar 2, 2025
…o Deduplication is not appropriate for Geo-Replication (apache#23697)

(cherry picked from commit 4ac4f3c)
(cherry picked from commit 307b5c9)
mukesh-ctds pushed a commit to datastax/pulsar that referenced this pull request Mar 2, 2025
…o Deduplication is not appropriate for Geo-Replication (apache#23697)

(cherry picked from commit 4ac4f3c)
(cherry picked from commit 307b5c9)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants