-
Notifications
You must be signed in to change notification settings - Fork 813
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mutiple Bugfixes #794
Mutiple Bugfixes #794
Conversation
* fix workflow timeout version check bug * fix task event ID check bug * fix integtest race condition
service/history/conflictResolver.go
Outdated
history, nextPageToken, err = r.getHistory(domainID, execution, common.FirstEventID, replayNextEventID, | ||
|
||
var lastFirstEventID int64 | ||
for remainingHistorySize := replayNextEventID - common.FirstEventID; remainingHistorySize > 0; { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why can't we rely on the nextPageToken? Are you worried about reading more events? I thought since we are going to bound the query using nextEventID, there is no way we can read more events than nextEventID. There is a chance we can read less events if the nextEventID falls in between a batch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- i did not change the query
- the change here make sure we are actually applying events up to the next event id. if the next event id < the highest event id in a batch of history
say in a batch, the first event id is 10, next event id is 15, and we only want events up to 13 (exclusive).
then this change can make it work, while before, we are actually applying all events from 10 -> 15
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My question is why can't we rely on nextPageToken for the looping condition. I agree with you that last batch could have more events within the batch after nextEventID (although it is not possible unless we have a bug somewhere else). But then it guarantees that there won't be any more batches after that. So it is still ok to loop on the nextPageToken logic.
|
||
// load mutable state, if mutable state's next event ID <= task ID, will attempt to refresh | ||
// if still mutable state's next event ID <= task ID, will return nil, nil | ||
func loadMutableStateForTransferTask(context *workflowExecutionContext, transferTask *persistence.TransferTaskInfo, metricsClient metrics.Client, logger bark.Logger) (*mutableStateBuilder, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why only transfer task? Won't TimerTask have a similar issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to generalize the check for both timer and transfer tasks? Can we instead rely on version on the task and version on the mutable state?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the main difference is here
isDecisionRetry := transferTask.TaskType == persistence.TransferTaskTypeDecisionTask &&
vs
isDecisionRetry := timerTask.TaskType == persistence.TaskTypeDecisionTimeout &&
service/history/failoverCheck.go
Outdated
msBuilder.executionInfo.DecisionAttempt > 0 | ||
|
||
if transferTask.ScheduleID >= msBuilder.GetNextEventID() && !isDecisionRetry { | ||
metricsClient.IncCounter(metrics.TimerQueueProcessorScope, metrics.StaleMutableStateCounter) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably you want to use a different metric scope here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good. It would be nice if you can consolidate the logic for load of mutableState for timer and transfer tasks. And also account for version check on the task with version on mutable state.
2d8e5f2
to
898a060
Compare
solve #779 #770