-
Notifications
You must be signed in to change notification settings - Fork 4.5k
balancergroup: Make closing terminal #8095
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #8095 +/- ##
==========================================
- Coverage 82.20% 82.19% -0.01%
==========================================
Files 387 387
Lines 38928 38944 +16
==========================================
+ Hits 31999 32011 +12
- Misses 5606 5608 +2
- Partials 1323 1325 +2
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel that the BalancerGroup
type should handle multiple concurrent calls into it. That's the whole point of it having a couple of mutexes and there is no documentation stating that it is not safe for concurrent access (I agree that there is no documentation stating that it is safe for concurrent access either).
But if the race really is about bg.Remove
and bg.Close
calls at the same time leading to a child policy added to the timeout cache from bg.Remove
not being cleaned up by bg.Close
(which I can see the race happening), then we should instead fix the implementation in BalancerGroup
.
Also, I think there is scope to simplify the |
@easwars made |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Codecov seems to have flagged a lot of places with missing tests. I think this was the case before this change as well. But now that we have those annotations, could you please file an issue to add tests (describing those cases), and maybe we can get some help from the TVCs?
Just FYI, the balancer group tests currently use the aggregator from the weightedtarget package. This is an artifact of history. The balancer group (when initially written) was tightly coupled with the priority balancer and the weighted target balancer. But we have separated them since, but haven't cleaned up these tests. This could also be a task for the TVCs. We might have to clearly define what needs to be done though.
@@ -305,11 +281,14 @@ func (bg *BalancerGroup) AddWithClientConn(id, balancerName string, cc balancer. | |||
// Store data in static map, and then check to see if bg is started. | |||
bg.outgoingMu.Lock() | |||
defer bg.outgoingMu.Unlock() | |||
if bg.outgoingClosed { | |||
return fmt.Errorf("BalancerGroup already closed") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: There are currently only two places where the balancer group returns an error by itself. Can we add a "balancergroup" prefix to those two and this one, that way, we can also ensure that the error string is not capitalized. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the prefix and removed BalancerGroup
here.
// If caching is enabled, search in the cache. Otherwise, cache is | ||
// guaranteed to be empty, searching is unnecessary. Also, skip the cache | ||
// if caching is disabled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this comment can actually be much simpler.
// Skip searching the cache if disabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed the comment.
Opened #8118 |
Fixes: #7783
Holding the mutex while making calls into balancergroup avoid concurrent calls into
balancergroup
which can result inbg.Close()
being called at the same time asbg.Remove()
leading to a child balancer being leaked. See #7783 (comment) for details about such a race.This PR makes
balancergroup
non-restartable, i.e. it can't be started again after callingClose
. To fix the race, inClose
, thebalancergroup
is marked as "closed" before releasing the mutex to clear the deletion cache. This ensures no further entries are added to the cache.Tested
Verified no failures in 100k runs.
RELEASE NOTES: