-
Notifications
You must be signed in to change notification settings - Fork 800
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Check for resurrected activities during RecordActivityTaskStarted #4806
Conversation
Pull Request Test Coverage Report for Build 0180d1f5-182c-4cdd-835e-03a89a32cbdb
💛 - Coveralls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the release plan? Do you intend to have a 'monitor-only' mode?
if err != nil { | ||
return nil, err | ||
} | ||
event := item.(*types.HistoryEvent) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, maybe typecheck this is ok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code was moved (from service/history/task/timer_active_task_executor.go
) as is. Going to leave it for now. Ideally we should introduce generics here with go 1.18.
// RecordActivityTaskStarted is already past scheduleToClose timeout. | ||
// If at this point pending activity is still in mutable state it may be resurrected. | ||
// Otherwise it would be completed or timed out already. | ||
if isRunning && e.timeSource.Now().After(ai.ScheduledTime.Add(time.Duration(ai.ScheduleToCloseTimeout)*time.Second)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the risk of this being a false-positive due to clock-skew or GC or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the risk of this being a false-positive due to clock-skew or GC or something?
This is just a trigger check. We don't want to run those checks often as they are potentially expensive with large histories. Worst case - wasted resources, increased latencies.
What changed?
historyEngine.RecordActivityTaskStarted
if it is already past scheduleToClose timeout.Why?
Similarly to timer task processing, resurrected activity can sometimes be observed when matching triggers RecordActivityTaskStarted. In those cases activity gets started twice which result in non-deterministic errors, failures to replicate to remote cluster followed by DLQ messages.
To mitigate that we could reuse logic used in timer task processing and scan for previous history if such activity was already completed before. If so, delete it from mutable state and return that it does not exist.
As history scan may be expensive - only do it if timing of RecordActivityTaskStarted is suspicious. That is - if it is attempted to start after it should already be completed or timed out.
How did you test it?
Added additional unit test for history engine simulating resurrected activity condition.
Potential risks
Release notes
Documentation Changes