-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ringhash: fix bug where ring hash can be stuck in transient failure despite having available endpoints #7364
Conversation
Depends on #7334. |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #7364 +/- ##
==========================================
+ Coverage 81.58% 81.98% +0.40%
==========================================
Files 358 360 +2
Lines 27388 27528 +140
==========================================
+ Hits 22345 22570 +225
+ Misses 3828 3773 -55
+ Partials 1215 1185 -30
|
fd418d4
to
4a32e08
Compare
Fixes grpc#7363, which has a thorough problem description. The test cases added reproduce the error without the fix, but not reliably, since the ring may be constructed in a way where we don't end up getting stuck. Running the test multiple times definitely end up triggering the problem. The solution is to keep a separate list of available addresses, and when there are no picks and we trigger connection attempts, try them one at a time. RELEASE NOTES: ringhash: fix bug that could prevent the balancer to recover from transient failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One comment, could you please take a look?
Overall looks good to me. One small bummer is that we are adding yet another field to store the subconn(s). I was thinking hard if making AddressMap an ordered Map (by morphing Keys() and Values()) would help. But then indexing the map with numbers becomes an issue.
When we publish relnotes, it would probably help users with some more additional context. for eg: this bug could have effected you if ___ |
Thanks for the review.
I am not sure what you suggest exactly. Do you mean that we could use the existing AddressMap, and turn it into an array by sorting it by key, for example? |
yes, but we dont have to block merge on that. LGTM! Thanks @atollena for taking care of this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change is LGTM. A bunch of minor comments on the tests though.
Thanks @atollena for handling all the comments and for your continued contributions. |
See #7363 for a problem description. Added
test cases reproduce the error without the fix, but not reliably, since the ring
may be constructed in a way where we don't get stuck. However running the test
multiple times definitely end up triggering the problem.
The solution is to keep a separate list of available addresses, and when there are no
picks and we trigger connection attempts, try them one at a time.
Fixes #7363.
RELEASE NOTES: