Skip to content

xdsclient: delay resource cache deletion to handle immediate re-subscription of same resource #8369

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

purnesh42H
Copy link
Contributor

@purnesh42H purnesh42H commented May 29, 2025

Fixes: #8125

RELEASE NOTES:

  • xdsclient: fixed an edge case race causing "resource doesn not exist" when rapidly subscribing and unsubscribing the same resource.

Copy link

codecov bot commented May 29, 2025

Codecov Report

Attention: Patch coverage is 71.95122% with 23 lines in your changes missing coverage. Please review.

Project coverage is 82.27%. Comparing base (bdbe6a2) to head (7f9e915).

Files with missing lines Patch % Lines
xds/internal/clients/xdsclient/authority.go 54.28% 10 Missing and 6 partials ⚠️
xds/internal/clients/xdsclient/channel.go 79.41% 5 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #8369      +/-   ##
==========================================
- Coverage   82.43%   82.27%   -0.16%     
==========================================
  Files         413      413              
  Lines       40430    40497      +67     
==========================================
- Hits        33328    33320       -8     
- Misses       5751     5799      +48     
- Partials     1351     1378      +27     
Files with missing lines Coverage Δ
xds/internal/clients/xdsclient/ads_stream.go 83.03% <100.00%> (+0.30%) ⬆️
xds/internal/clients/xdsclient/xdsclient.go 83.19% <100.00%> (-0.93%) ⬇️
xds/internal/clients/xdsclient/channel.go 77.83% <79.41%> (+0.33%) ⬆️
xds/internal/clients/xdsclient/authority.go 79.23% <54.28%> (-1.44%) ⬇️

... and 24 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@purnesh42H purnesh42H force-pushed the fix-xdsclient-resource-sub-unsub-race branch 3 times, most recently from d361cc6 to e70fea2 Compare June 2, 2025 13:25
@dfawley
Copy link
Member

dfawley commented Jun 2, 2025

cc @danielzhaotongliu

@purnesh42H purnesh42H force-pushed the fix-xdsclient-resource-sub-unsub-race branch from 99d6a11 to 2e6ba1b Compare June 3, 2025 15:12
@purnesh42H purnesh42H requested review from easwars and dfawley June 3, 2025 15:14
@purnesh42H purnesh42H added Area: xDS Includes everything xDS related, including LB policies used with xDS. Type: Bug and removed Area: xDS Includes everything xDS related, including LB policies used with xDS. labels Jun 3, 2025
@purnesh42H purnesh42H added this to the 1.74 Release milestone Jun 3, 2025
@dfawley dfawley removed their assignment Jun 3, 2025
@dfawley dfawley removed their request for review June 3, 2025 15:50
@purnesh42H purnesh42H force-pushed the fix-xdsclient-resource-sub-unsub-race branch from 2e6ba1b to a06eb28 Compare June 4, 2025 06:57
@danielzhaotongliu
Copy link
Contributor

Thanks for the fix pull request!

I have been running some tests and experiments across multiple trials to verify this race is fixed, it should be done soon.

@danielzhaotongliu
Copy link
Contributor

Status update:

The initial end-to-end experiment (that was able to reproduce #8125) revealed a potential separate new issue but the experiment results could not confirm/deny this is fixed. Working on improving the experiment and rerun to confirm the race condition would be fixed

Comment on lines 448 to 449
// Call the event handler to remove unsubscribed cache entries.
s.eventHandler.onRequiredToRemoveUnsubscribedCacheEntries(url)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this happen only if the call to Send below is successful?

Copy link
Contributor Author

@purnesh42H purnesh42H Jun 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I followed the c++ logic here which does this right before sending the request. I think because we want to ensure the cache entries are deleted even if discovery request fails. In case of failure when the stream restarts, nonce is reset anyways.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we want to ensure the cache entries are deleted even if discovery request fails. In case of failure when the stream restarts, nonce is reset anyways.

I think this is good information that is worth capturing in a comment. Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 148 to 157
// Get the resource types that this specific ADS stream was handling
// before stopping it.
xc.ads.mu.Lock()
typesHandledByStream := make([]ResourceType, 0, len(xc.ads.resourceTypeState))
for typ := range xc.ads.resourceTypeState {
typesHandledByStream = append(typesHandledByStream, typ)
}
xc.ads.mu.Unlock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need all this cleanup code in close?

An xdsChannel is closed only after all authorities have released their reference to the channel, and authority releases its reference to a channel only when it has no more watches. So, shouldn't the authority's cache be empty at that point?

Copy link
Contributor Author

@purnesh42H purnesh42H Jun 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another rare case of having unsubscriptions for which we have not yet actually sent unsubscribe messages, and now we never will, so we do a pass to delete any cache entries for which we've unsubscribed. Such a scenario can happen if unsubscribe happens while stream is broken and then xdsclient is closed. In the case when garbage collection is done via onRequest, the xdschannel.Close() part will be no-op.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not happy with having to grab the lock of one type from another type. Could you please add a TODO here to ensure that this is cleaned up at some point in the future. Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a TODO. Although we can just move the logic to a helper in adsStreamImpl?

rdsName1 := "test-listener-resource2"
rdsName2 := "test-route-configuration-resource2"
listenerResource1 := e2e.DefaultClientListener(ldsResourceName1, rdsName1)
listenerResource2 := e2e.DefaultClientListener(ldsResourceName2, rdsName2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we even need the second resource? What role does it play?

Copy link
Contributor Author

@purnesh42H purnesh42H Jun 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the second resource (ldsResource2) is needed because it allows the test to keep the stream-level flow control active (adsStreamImpl.fc.pending == true) independently of the lifecycle of ldsResource1. This ensures that when ldsResource1's last watcher is cancelled (triggering an actual unsubscription request to the server), that request gets buffered.

To generate unsubscription request for ldsResource1, its last watcher (blockingWatcherR1) must be cancelled. Now, to ensure this unsubscription request is buffered by adsStreamImpl (due to adsStreamImpl.fc.pending == true), another watcher (to a resource other than ldsResource1) from the same server response batch must still be "pending". Note that the management server sends an update containing both listenerResource1 and listenerResource2. The role of blockingWatcherR2 for ldsResourceName2 is to hold on to onDoneR2 to keep adsStreamImpl.fc.pending true. So, when cancelR1() is called:

  • blockingWatcherR1 is removed. It's the last watcher for ldsResourceName1.
  • The authority tells xdsChannel to unsubscribe ldsResourceName1.
  • xdsChannel tells adsStreamImpl to unsubscribe ldsResourceName1.
  • adsStreamImpl sees fc.pending is true (because onDoneR2 is still held) and buffers the unsubscription request for ldsResource1.
  • When ldsResourceName1 is resubscribed by a new watcher, the test verifies that new watcher gets the resource from cache because unsubscription request is not yet sent. Without the fix the new watcher would have timed out waiting for ldsResource1.

If we had only ldsResource1, the unsubscription request (from cancelR1()) might get sent out immediately after onDoneR1 is called if it hasn't been processed by the authority's serializer yet, rather than being reliably buffered for the test to verify the resource being served from cache.

@easwars easwars assigned purnesh42H and unassigned easwars Jun 16, 2025
@purnesh42H purnesh42H force-pushed the fix-xdsclient-resource-sub-unsub-race branch 3 times, most recently from 990cb99 to 4a0f4e6 Compare June 19, 2025 19:47
@purnesh42H purnesh42H requested a review from easwars June 19, 2025 19:51
@purnesh42H purnesh42H assigned easwars and unassigned purnesh42H Jun 19, 2025
a.logger.Infof("Resubscribing to resource of type %q and name %q", typ.TypeName, name)
}
xc.channel.subscribe(typ, name)
if len(state.watchers) > 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should invert this condition and continue if the condition is not met. That will ensure that the interesting piece of functionality is less indented. Please see: go/go-style/decisions#indent-error-flow

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 148 to 157
// Get the resource types that this specific ADS stream was handling
// before stopping it.
xc.ads.mu.Lock()
typesHandledByStream := make([]ResourceType, 0, len(xc.ads.resourceTypeState))
for typ := range xc.ads.resourceTypeState {
typesHandledByStream = append(typesHandledByStream, typ)
}
xc.ads.mu.Unlock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not happy with having to grab the lock of one type from another type. Could you please add a TODO here to ensure that this is cleaned up at some point in the future. Thanks.

@easwars easwars assigned purnesh42H and unassigned easwars Jun 23, 2025
@purnesh42H purnesh42H force-pushed the fix-xdsclient-resource-sub-unsub-race branch from 1506c39 to 7f9e915 Compare June 23, 2025 17:58
@purnesh42H
Copy link
Contributor Author

purnesh42H commented Jun 23, 2025

Status update:

The initial end-to-end experiment (that was able to reproduce #8125) revealed a potential separate new issue but the experiment results could not confirm/deny this is fixed. Working on improving the experiment and rerun to confirm the race condition would be fixed

@danielzhaotongliu the test in this PR confirms that the resource is served from cache until an attempt to send unsubscription request is made. Without this fix, the watcher will timeout in case of subscribing a resource who has no watchers but hasn't yet sent unsubscribe request. Its similar to the change that was made in c++. I have imported the change to a fresh cl/774823957 over latest. Could you run the tests on that?

Even if we can't repro the race or if there are other reasons/issues, we probably should handle in separate issue. We still would like to merge this fix because logically it fixes the race of unsubscribe/subscribe from the xdsclient point of view as described in #8125.

@danielzhaotongliu
Copy link
Contributor

LGTM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

xdsclient: race around resource subscriptions and unsubscsriptions
5 participants