ringpop: update hashring immediately on ring change #3130
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changed?
When a cadence host is added/removed/restarted, every other host in the cluster will receive a change notification via ringpop. This is the primary mechanism by which discovery / failure detection works today. When such a change notification is received, every node updates its consistent hash ring to route future requests to the correct owner. Its critical that the hashring update happens as soon as possible during deployments / restarts etc to avoid downtime / availability drops. Currently, there is an optimization to avoid too many updates to hashring within a short span of time. But this is hurting availability.
This patch adds a fix by updating ring as soon as notification is received. In addition, a dedup map is added to resolver to avoid updating the ring when (a) nothing changes on an event (b) the host added or removed is for a different role. This should mitigate the too many updates within a short span of time problem.
Why?
To reduce availability dips during deployments and host restarts.
How did you test it?
Localhost as well as in a staging environment.
Potential risks
In the worst case, discovery / failure detection can be broken. This would mean unavailability or host stealing shards from each other continuously.