Skip to content

Conversation

ReubenBond
Copy link
Member

No description provided.

@ReubenBond ReubenBond changed the title Fix race introduced in membership [WIP] Fix race introduced in membership Jul 1, 2019
@ReubenBond ReubenBond force-pushed the fix/membership-race branch from 8ef889c to 4437b17 Compare July 2, 2019 18:12
@ReubenBond ReubenBond force-pushed the fix/membership-race branch from 4437b17 to 6020aa0 Compare July 2, 2019 20:00
@ReubenBond ReubenBond changed the title [WIP] Fix race introduced in membership Fix race introduced in membership Jul 2, 2019
@ReubenBond
Copy link
Member Author

Ready for review

@@ -463,7 +479,7 @@ public async Task<AddressAndTag> RegisterAsync(ActivationAddress address, bool s
// on all silos other than first, we insert a retry delay and recheck owner before forwarding
if (hopCount > 0 && forwardAddress != null)
{
await Task.Delay(RETRY_DELAY);
await this.RefreshMembership();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of an arbitrary wait, refresh membership once and wait for that refresh to propagate to the directory

default:
return false;
}
return status != SiloStatus.None;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could have caused incorrect "forwarding to remote cluster" type messages beforehand. It still can (eg before a silo is Joining or after it's been cleaned up, but much less likely)


await Task.Delay(TimeSpan.FromMilliseconds(10));
} while (this.snapshot.Version < v || this.snapshot.Version < this.membershipTableManager.MembershipTableSnapshot.Version);
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refresh membership (at least) once & wait for the version to propagate through

if (status.IsTerminating())
var timeoutTask = Task.Delay(GossipTimeout);
var task = await Task.WhenAny(gossipTask, timeoutTask);
if (ReferenceEquals(task, timeoutTask))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Always wait up to 3 seconds for gossip to complete. May slow shutdown/startup, but also should give us generally more predictable behavior

}
else if (this.log.IsEnabled(LogLevel.Debug))
{
this.log.LogDebug("Timed out while gossiping status to other silos after {Timeout}", GossipTimeout);
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... but only log timeouts if we're shutting down or if debug logging is enabled

}

await pending;
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want at most 1 concurrent refresh

@jason-bragg jason-bragg merged commit 5d9d082 into dotnet:master Jul 5, 2019
ReubenBond added a commit to ReubenBond/orleans that referenced this pull request Jul 9, 2019
sergeybykov pushed a commit that referenced this pull request Jul 9, 2019
@github-actions github-actions bot locked and limited conversation to collaborators Dec 6, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants