Skip to content

balancergroup: Make closing terminal #8095

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Feb 25, 2025

Conversation

arjan-bal
Copy link
Contributor

@arjan-bal arjan-bal commented Feb 17, 2025

Fixes: #7783

Holding the mutex while making calls into balancergroup avoid concurrent calls into balancergroup which can result in bg.Close() being called at the same time as bg.Remove() leading to a child balancer being leaked. See #7783 (comment) for details about such a race.

This PR makes balancergroup non-restartable, i.e. it can't be started again after calling Close. To fix the race, in Close, the balancergroup is marked as "closed" before releasing the mutex to clear the deletion cache. This ensures no further entries are added to the cache.

Tested

Verified no failures in 100k runs.

RELEASE NOTES:

  • priority: Fix race that could leak balancers and goroutines during shutdown.

@arjan-bal arjan-bal added this to the 1.71 Release milestone Feb 17, 2025
@arjan-bal arjan-bal requested a review from easwars February 17, 2025 04:02
Copy link

codecov bot commented Feb 17, 2025

Codecov Report

Attention: Patch coverage is 54.54545% with 25 lines in your changes missing coverage. Please review.

Project coverage is 82.19%. Comparing base (fabe274) to head (7e4e1bb).
Report is 16 commits behind head on master.

Files with missing lines Patch % Lines
internal/balancergroup/balancergroup.go 54.54% 18 Missing and 7 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #8095      +/-   ##
==========================================
- Coverage   82.20%   82.19%   -0.01%     
==========================================
  Files         387      387              
  Lines       38928    38944      +16     
==========================================
+ Hits        31999    32011      +12     
- Misses       5606     5608       +2     
- Partials     1323     1325       +2     
Files with missing lines Coverage Δ
balancer/rls/balancer.go 85.26% <ø> (-0.04%) ⬇️
balancer/weightedtarget/weightedtarget.go 89.47% <ø> (-2.20%) ⬇️
...internal/balancer/clustermanager/clustermanager.go 75.22% <ø> (-0.23%) ⬇️
xds/internal/balancer/priority/balancer.go 88.35% <ø> (-0.08%) ⬇️
internal/balancergroup/balancergroup.go 76.05% <54.54%> (-7.45%) ⬇️

... and 32 files with indirect coverage changes

Copy link
Contributor

@easwars easwars left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel that the BalancerGroup type should handle multiple concurrent calls into it. That's the whole point of it having a couple of mutexes and there is no documentation stating that it is not safe for concurrent access (I agree that there is no documentation stating that it is safe for concurrent access either).

But if the race really is about bg.Remove and bg.Close calls at the same time leading to a child policy added to the timeout cache from bg.Remove not being cleaned up by bg.Close (which I can see the race happening), then we should instead fix the implementation in BalancerGroup.

@easwars
Copy link
Contributor

easwars commented Feb 18, 2025

Also, I think there is scope to simplify the BalancerGroup type. When it was written, one of its goals was to be able to stop and re-start at any point. We don't have that requirement anymore. In fact, I see that in all places where a balancer group is created by calling balancergroup.New(), it is immediately followed by a call to bg.Start.

@easwars easwars assigned arjan-bal and unassigned easwars Feb 18, 2025
@arjan-bal arjan-bal modified the milestones: 1.71 Release, 1.72 Release Feb 19, 2025
@arjan-bal arjan-bal changed the title priority: Lock mu while calling into balancergroup balancergroup: Make closing terminal Feb 24, 2025
@arjan-bal arjan-bal assigned easwars and unassigned arjan-bal Feb 24, 2025
@arjan-bal
Copy link
Contributor Author

@easwars made balancergroup non restartable. Also reverted the fixes in priority and made balancergroup capable of handling the race.

Copy link
Contributor

@easwars easwars left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Codecov seems to have flagged a lot of places with missing tests. I think this was the case before this change as well. But now that we have those annotations, could you please file an issue to add tests (describing those cases), and maybe we can get some help from the TVCs?

Just FYI, the balancer group tests currently use the aggregator from the weightedtarget package. This is an artifact of history. The balancer group (when initially written) was tightly coupled with the priority balancer and the weighted target balancer. But we have separated them since, but haven't cleaned up these tests. This could also be a task for the TVCs. We might have to clearly define what needs to be done though.

@@ -305,11 +281,14 @@ func (bg *BalancerGroup) AddWithClientConn(id, balancerName string, cc balancer.
// Store data in static map, and then check to see if bg is started.
bg.outgoingMu.Lock()
defer bg.outgoingMu.Unlock()
if bg.outgoingClosed {
return fmt.Errorf("BalancerGroup already closed")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: There are currently only two places where the balancer group returns an error by itself. Can we add a "balancergroup" prefix to those two and this one, that way, we can also ensure that the error string is not capitalized. Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the prefix and removed BalancerGroup here.

Comment on lines 288 to 290
// If caching is enabled, search in the cache. Otherwise, cache is
// guaranteed to be empty, searching is unnecessary. Also, skip the cache
// if caching is disabled.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment can actually be much simpler.

// Skip searching the cache if disabled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the comment.

@easwars easwars assigned arjan-bal and unassigned easwars Feb 24, 2025
@arjan-bal
Copy link
Contributor Author

Codecov seems to have flagged a lot of places with missing tests. I think this was the case before this change as well. But now that we have those annotations, could you please file an issue to add tests (describing those cases), and maybe we can get some help from the TVCs?

Just FYI, the balancer group tests currently use the aggregator from the weightedtarget package. This is an artifact of history. The balancer group (when initially written) was tightly coupled with the priority balancer and the weighted target balancer. But we have separated them since, but haven't cleaned up these tests. This could also be a task for the TVCs. We might have to clearly define what needs to be done though.

Opened #8118

@arjan-bal arjan-bal merged commit aa629e0 into grpc:master Feb 25, 2025
15 checks passed
@arjan-bal arjan-bal deleted the fix-sync-in-priority branch February 25, 2025 05:59
purnesh42H pushed a commit to purnesh42H/grpc-go that referenced this pull request Feb 25, 2025
janardhanvissa pushed a commit to janardhanvissa/grpc-go that referenced this pull request Mar 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Flaky Test: RingHash_SwitchToLowerPriorityAndThenBack
2 participants