Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mutiple Bugfixes #794

Merged
merged 5 commits into from
May 30, 2018
Merged

Mutiple Bugfixes #794

merged 5 commits into from
May 30, 2018

Conversation

wxing1292
Copy link
Contributor

@wxing1292 wxing1292 commented May 30, 2018

  • fix workflow timeout version check bug
  • fix task event ID check bug
  • fix integtest race condition

solve #779 #770

* fix workflow timeout version check bug
* fix task event ID check bug
* fix integtest race condition
@wxing1292 wxing1292 requested a review from samarabbas May 30, 2018 03:42
history, nextPageToken, err = r.getHistory(domainID, execution, common.FirstEventID, replayNextEventID,

var lastFirstEventID int64
for remainingHistorySize := replayNextEventID - common.FirstEventID; remainingHistorySize > 0; {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why can't we rely on the nextPageToken? Are you worried about reading more events? I thought since we are going to bound the query using nextEventID, there is no way we can read more events than nextEventID. There is a chance we can read less events if the nextEventID falls in between a batch.

Copy link
Contributor Author

@wxing1292 wxing1292 May 30, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. i did not change the query
  2. the change here make sure we are actually applying events up to the next event id. if the next event id < the highest event id in a batch of history

say in a batch, the first event id is 10, next event id is 15, and we only want events up to 13 (exclusive).
then this change can make it work, while before, we are actually applying all events from 10 -> 15

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My question is why can't we rely on nextPageToken for the looping condition. I agree with you that last batch could have more events within the batch after nextEventID (although it is not possible unless we have a bug somewhere else). But then it guarantees that there won't be any more batches after that. So it is still ok to loop on the nextPageToken logic.


// load mutable state, if mutable state's next event ID <= task ID, will attempt to refresh
// if still mutable state's next event ID <= task ID, will return nil, nil
func loadMutableStateForTransferTask(context *workflowExecutionContext, transferTask *persistence.TransferTaskInfo, metricsClient metrics.Client, logger bark.Logger) (*mutableStateBuilder, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why only transfer task? Won't TimerTask have a similar issue?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to generalize the check for both timer and transfer tasks? Can we instead rely on version on the task and version on the mutable state?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the main difference is here

isDecisionRetry := transferTask.TaskType == persistence.TransferTaskTypeDecisionTask &&

vs

isDecisionRetry := timerTask.TaskType == persistence.TaskTypeDecisionTimeout &&

msBuilder.executionInfo.DecisionAttempt > 0

if transferTask.ScheduleID >= msBuilder.GetNextEventID() && !isDecisionRetry {
metricsClient.IncCounter(metrics.TimerQueueProcessorScope, metrics.StaleMutableStateCounter)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably you want to use a different metric scope here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Copy link
Contributor

@samarabbas samarabbas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good. It would be nice if you can consolidate the logic for load of mutableState for timer and transfer tasks. And also account for version check on the task with version on mutable state.

@wxing1292 wxing1292 merged commit edfa972 into master May 30, 2018
@wxing1292 wxing1292 deleted the bugfix-conflict branch May 30, 2018 21:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants