Bugfix: replication messaged dropped during host shutdown #6143

davidporter-id-au · 2024-06-24T02:15:28Z

What changed?

Internal details: CDNC-9597

A user reported some problems during a failover in which a workflow, during a continue-as-new event got dropped during replication silently, without any corresponding DLQ message. We were able to track down the (expected) cause to likely have been a shard movement during that time which triggers several unpleasant edge-conditions with interactions with the following:

A silent dropping of errors due to the shard closing
The in-memory offsets being advanced incorrectly during shard-closed events due to some missing control-flow handling
The GetReplication API call actually being a write call and tracking offsets (perhaps somewhat surprisingly, I really don't like this API).
The GetReplicationCall being called, with these invalid parameters during a shard-closing event.

How did you test it?

Tested locally and with unit tests. Was able to repro the sequence of events mostly with unit-tests.

taylanisikdemir · 2024-06-24T19:29:26Z

service/history/replication/task_processor.go

 			return
-		case p.requestChan <- &request{
+		}
+		p.requestChan <- &request{


let's still leave the select {} block here with p.done check because we don't want this line to be blocking in case of shut down.

ah, fair, I didn't consider this, let me fix

service/history/replication/task_processor_test.go

taylanisikdemir · 2024-06-24T19:48:23Z

service/history/replication/task_processor.go

@@ -194,19 +194,18 @@ Loop:
 		respChan := make(chan *types.ReplicationMessages, 1)
 		// TODO: when we support prefetching, LastRetrievedMessageID can be different than LastProcessedMessageID

-		select {
-		case <-p.done:
+		if p.isShuttingDown() {


Avoiding populating requestChan if already shut down will be an improvement but will not cover all the known problematic edge cases as discussed offline.

task_fetcher.go is the one making the remote calls periodically for each shard. Not adding a request to requestChan is unlikely to help because there might already be a request for the closing shard.

I'd recommend updating task fetcher's fetchAndDistributeTasks to skip the shards that are no longer owned by current host. Task Processor could notify Task Fetcher on shut down to deregister previously sent request.
Otherwise this change is only going to help if the task processor for a given shard hasn't sent a request to requestChan and being closed.

yeah, that's a much better idea 🤦 I didn't consider the race there

Urgh, this is going to be not easy: We have no in-memory representation of shard ownership, only these shard-ownership-lost errors bubbling up from the DB.

There's no big in-memory map of shard-IDs with atomics which is keeping track of whats present, and even if there was, it'd be still subject to races.

At first glance, I think, to truly fix this, we need to change this API to have a proper Ack/update rather then doing it on read (its obviously more complex that that because the naieve impl will explode the number of RPC calls)

If its ok, let me land this change as-is and we can explore redesigning this replication a bit more in detail

It's fine to merge this PR once other comments are addressed. i.e. The shutdown blocker issue.

Let's discuss the true fix as a follow up. Please make sure you capture the context in these comments in the follow up ticket so we don't forget :)

coveralls · 2024-06-27T06:10:21Z

Pull Request Test Coverage Report for Build 01905832-5907-4782-b9bc-f369f8e4ddf4

Details

14 of 15 (93.33%) changed or added relevant lines in 1 file are covered.
40 unchanged lines in 14 files lost coverage.
Overall coverage remained the same at 71.434%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
service/history/replication/task_processor.go	14	15	93.33%

Files with Coverage Reduction	New Missed Lines	%
common/task/weighted_round_robin_task_scheduler.go	1	89.05%
service/history/replication/task_processor.go	1	84.34%
service/history/task/transfer_standby_task_executor.go	2	86.23%
common/mapq/types/policy_collection.go	2	93.06%
common/cache/lru.go	2	93.01%
service/frontend/api/handler.go	2	75.62%
common/membership/hashring.go	2	84.69%
service/matching/tasklist/matcher.go	2	90.91%
common/persistence/historyManager.go	2	66.67%
service/history/handler/handler.go	3	96.23%

Totals
Change from base Build 0190573d-ff12-4850-94f0-8c77deb099df:	0.0%
Covered Lines:	104689
Relevant Lines:	146554

💛 - Coveralls

davidporter-id-au and others added 5 commits June 21, 2024 00:31

Fix

592860d

removing test

9f87c57

fmt

0512faf

Merge branch 'uber:master' into bugfix/replication-bug-host-shutdown

8d8bad4

fix

eb1dc56

davidporter-id-au marked this pull request as ready for review June 24, 2024 02:27

davidporter-id-au requested review from Shaddoll, neil-xie, Groxx, shijiesheng, agautam478, jakobht, 3vilhamster, sankari165, dkrotx, taylanisikdemir and demirkayaender as code owners June 24, 2024 02:27

davidporter-id-au changed the title ~~Bugfix/replication bug host shutdown~~ Bugfix: replication messaged dropped during host shutdown Jun 24, 2024

taylanisikdemir reviewed Jun 24, 2024

View reviewed changes

davidporter-id-au added 2 commits June 26, 2024 07:49

Fix

1c0e55d

Merge branch 'master' into bugfix/replication-bug-host-shutdown

834bfe7

taylanisikdemir approved these changes Jun 27, 2024

View reviewed changes

davidporter-id-au merged commit c1793e1 into uber:master Jun 28, 2024
19 checks passed

davidporter-id-au deleted the bugfix/replication-bug-host-shutdown branch June 28, 2024 16:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bugfix: replication messaged dropped during host shutdown #6143

Bugfix: replication messaged dropped during host shutdown #6143

davidporter-id-au commented Jun 24, 2024 •

edited

Loading

taylanisikdemir Jun 24, 2024

davidporter-id-au Jun 26, 2024

taylanisikdemir Jun 24, 2024 •

edited

Loading

davidporter-id-au Jun 26, 2024

davidporter-id-au Jun 27, 2024

taylanisikdemir Jun 27, 2024

coveralls commented Jun 27, 2024 •

edited

Loading

Bugfix: replication messaged dropped during host shutdown #6143

Bugfix: replication messaged dropped during host shutdown #6143

Conversation

davidporter-id-au commented Jun 24, 2024 • edited Loading

taylanisikdemir Jun 24, 2024

Choose a reason for hiding this comment

davidporter-id-au Jun 26, 2024

Choose a reason for hiding this comment

taylanisikdemir Jun 24, 2024 • edited Loading

Choose a reason for hiding this comment

davidporter-id-au Jun 26, 2024

Choose a reason for hiding this comment

davidporter-id-au Jun 27, 2024

Choose a reason for hiding this comment

taylanisikdemir Jun 27, 2024

Choose a reason for hiding this comment

coveralls commented Jun 27, 2024 • edited Loading

Pull Request Test Coverage Report for Build 01905832-5907-4782-b9bc-f369f8e4ddf4

Details

💛 - Coveralls

davidporter-id-au commented Jun 24, 2024 •

edited

Loading

taylanisikdemir Jun 24, 2024 •

edited

Loading

coveralls commented Jun 27, 2024 •

edited

Loading