Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check for resurrected activities during RecordActivityTaskStarted #4806

Merged
merged 6 commits into from
May 17, 2022

Conversation

vytautas-karpavicius
Copy link
Contributor

@vytautas-karpavicius vytautas-karpavicius commented Apr 27, 2022

What changed?

  • Moved out resurrected activity check code out of task processing.
  • Added additional resurrected activity check in historyEngine.RecordActivityTaskStarted if it is already past scheduleToClose timeout.

Why?
Similarly to timer task processing, resurrected activity can sometimes be observed when matching triggers RecordActivityTaskStarted. In those cases activity gets started twice which result in non-deterministic errors, failures to replicate to remote cluster followed by DLQ messages.

To mitigate that we could reuse logic used in timer task processing and scan for previous history if such activity was already completed before. If so, delete it from mutable state and return that it does not exist.

As history scan may be expensive - only do it if timing of RecordActivityTaskStarted is suspicious. That is - if it is attempted to start after it should already be completed or timed out.

How did you test it?
Added additional unit test for history engine simulating resurrected activity condition.

Potential risks

Release notes

Documentation Changes

@coveralls
Copy link

coveralls commented Apr 27, 2022

Pull Request Test Coverage Report for Build 0180d1f5-182c-4cdd-835e-03a89a32cbdb

  • 27 of 127 (21.26%) changed or added relevant lines in 3 files are covered.
  • 80 unchanged lines in 13 files lost coverage.
  • Overall coverage decreased (-0.08%) to 56.867%

Changes Missing Coverage Covered Lines Changed/Added Lines %
service/history/historyEngine.go 25 30 83.33%
service/history/execution/integrity.go 0 95 0.0%
Files with Coverage Reduction New Missed Lines %
common/persistence/executionManager.go 2 77.82%
common/persistence/statsComputer.go 2 96.43%
common/util.go 2 51.17%
service/history/queue/timer_queue_processor.go 2 58.37%
service/history/queue/transfer_queue_processor.go 2 56.86%
service/history/task/transfer_active_task_executor.go 2 71.93%
service/matching/matcher.go 2 91.46%
common/task/fifoTaskScheduler.go 3 84.54%
service/history/task/fetcher.go 3 86.67%
service/history/shard/context.go 9 64.98%
Totals Coverage Status
Change from base Build 6f1aeae3-f27e-4a1c-8abc-efdd71843d0e: -0.08%
Covered Lines: 83873
Relevant Lines: 147489

💛 - Coveralls

@vytautas-karpavicius vytautas-karpavicius requested a review from a team April 27, 2022 11:54
@vytautas-karpavicius vytautas-karpavicius marked this pull request as ready for review April 27, 2022 11:54
Copy link
Contributor

@davidporter-id-au davidporter-id-au left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the release plan? Do you intend to have a 'monitor-only' mode?

if err != nil {
return nil, err
}
event := item.(*types.HistoryEvent)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, maybe typecheck this is ok?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code was moved (from service/history/task/timer_active_task_executor.go) as is. Going to leave it for now. Ideally we should introduce generics here with go 1.18.

// RecordActivityTaskStarted is already past scheduleToClose timeout.
// If at this point pending activity is still in mutable state it may be resurrected.
// Otherwise it would be completed or timed out already.
if isRunning && e.timeSource.Now().After(ai.ScheduledTime.Add(time.Duration(ai.ScheduleToCloseTimeout)*time.Second)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the risk of this being a false-positive due to clock-skew or GC or something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the risk of this being a false-positive due to clock-skew or GC or something?

This is just a trigger check. We don't want to run those checks often as they are potentially expensive with large histories. Worst case - wasted resources, increased latencies.

@vytautas-karpavicius vytautas-karpavicius merged commit ee5461b into master May 17, 2022
@vytautas-karpavicius vytautas-karpavicius deleted the resurrection-check branch May 17, 2022 13:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants