Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix: replication messaged dropped during host shutdown #6143

Conversation

davidporter-id-au
Copy link
Contributor

@davidporter-id-au davidporter-id-au commented Jun 24, 2024

What changed?

Internal details: CDNC-9597

A user reported some problems during a failover in which a workflow, during a continue-as-new event got dropped during replication silently, without any corresponding DLQ message. We were able to track down the (expected) cause to likely have been a shard movement during that time which triggers several unpleasant edge-conditions with interactions with the following:

  • A silent dropping of errors due to the shard closing
  • The in-memory offsets being advanced incorrectly during shard-closed events due to some missing control-flow handling
  • The GetReplication API call actually being a write call and tracking offsets (perhaps somewhat surprisingly, I really don't like this API).
  • The GetReplicationCall being called, with these invalid parameters during a shard-closing event.

How did you test it?

Tested locally and with unit tests. Was able to repro the sequence of events mostly with unit-tests.

@davidporter-id-au davidporter-id-au marked this pull request as ready for review June 24, 2024 02:27
@davidporter-id-au davidporter-id-au changed the title Bugfix/replication bug host shutdown Bugfix: replication messaged dropped during host shutdown Jun 24, 2024
return
case p.requestChan <- &request{
}
p.requestChan <- &request{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's still leave the select {} block here with p.done check because we don't want this line to be blocking in case of shut down.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, fair, I didn't consider this, let me fix

service/history/replication/task_processor_test.go Outdated Show resolved Hide resolved
@@ -194,19 +194,18 @@ Loop:
respChan := make(chan *types.ReplicationMessages, 1)
// TODO: when we support prefetching, LastRetrievedMessageID can be different than LastProcessedMessageID

select {
case <-p.done:
if p.isShuttingDown() {
Copy link
Contributor

@taylanisikdemir taylanisikdemir Jun 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoiding populating requestChan if already shut down will be an improvement but will not cover all the known problematic edge cases as discussed offline.

task_fetcher.go is the one making the remote calls periodically for each shard. Not adding a request to requestChan is unlikely to help because there might already be a request for the closing shard.

I'd recommend updating task fetcher's fetchAndDistributeTasks to skip the shards that are no longer owned by current host. Task Processor could notify Task Fetcher on shut down to deregister previously sent request.
Otherwise this change is only going to help if the task processor for a given shard hasn't sent a request to requestChan and being closed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, that's a much better idea 🤦 I didn't consider the race there

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Urgh, this is going to be not easy: We have no in-memory representation of shard ownership, only these shard-ownership-lost errors bubbling up from the DB.

There's no big in-memory map of shard-IDs with atomics which is keeping track of whats present, and even if there was, it'd be still subject to races.

At first glance, I think, to truly fix this, we need to change this API to have a proper Ack/update rather then doing it on read (its obviously more complex that that because the naieve impl will explode the number of RPC calls)

If its ok, let me land this change as-is and we can explore redesigning this replication a bit more in detail

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's fine to merge this PR once other comments are addressed. i.e. The shutdown blocker issue.

Let's discuss the true fix as a follow up. Please make sure you capture the context in these comments in the follow up ticket so we don't forget :)

@coveralls
Copy link

coveralls commented Jun 27, 2024

Pull Request Test Coverage Report for Build 01905832-5907-4782-b9bc-f369f8e4ddf4

Details

  • 14 of 15 (93.33%) changed or added relevant lines in 1 file are covered.
  • 40 unchanged lines in 14 files lost coverage.
  • Overall coverage remained the same at 71.434%

Changes Missing Coverage Covered Lines Changed/Added Lines %
service/history/replication/task_processor.go 14 15 93.33%
Files with Coverage Reduction New Missed Lines %
common/task/weighted_round_robin_task_scheduler.go 1 89.05%
service/history/replication/task_processor.go 1 84.34%
service/history/task/transfer_standby_task_executor.go 2 86.23%
common/mapq/types/policy_collection.go 2 93.06%
common/cache/lru.go 2 93.01%
service/frontend/api/handler.go 2 75.62%
common/membership/hashring.go 2 84.69%
service/matching/tasklist/matcher.go 2 90.91%
common/persistence/historyManager.go 2 66.67%
service/history/handler/handler.go 3 96.23%
Totals Coverage Status
Change from base Build 0190573d-ff12-4850-94f0-8c77deb099df: 0.0%
Covered Lines: 104689
Relevant Lines: 146554

💛 - Coveralls

@davidporter-id-au davidporter-id-au merged commit c1793e1 into uber:master Jun 28, 2024
19 checks passed
@davidporter-id-au davidporter-id-au deleted the bugfix/replication-bug-host-shutdown branch June 28, 2024 16:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants