-
Notifications
You must be signed in to change notification settings - Fork 4.5k
xdsclient: delay resource cache deletion to handle immediate re-subscription of same resource #8369
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
xdsclient: delay resource cache deletion to handle immediate re-subscription of same resource #8369
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #8369 +/- ##
==========================================
- Coverage 82.43% 82.27% -0.16%
==========================================
Files 413 413
Lines 40430 40497 +67
==========================================
- Hits 33328 33320 -8
- Misses 5751 5799 +48
- Partials 1351 1378 +27
🚀 New features to boost your workflow:
|
d361cc6
to
e70fea2
Compare
99d6a11
to
2e6ba1b
Compare
2e6ba1b
to
a06eb28
Compare
Thanks for the fix pull request! I have been running some tests and experiments across multiple trials to verify this race is fixed, it should be done soon. |
Status update: The initial end-to-end experiment (that was able to reproduce #8125) revealed a potential separate new issue but the experiment results could not confirm/deny this is fixed. Working on improving the experiment and rerun to confirm the race condition would be fixed |
// Call the event handler to remove unsubscribed cache entries. | ||
s.eventHandler.onRequiredToRemoveUnsubscribedCacheEntries(url) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this happen only if the call to Send
below is successful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I followed the c++ logic here which does this right before sending the request. I think because we want to ensure the cache entries are deleted even if discovery request fails. In case of failure when the stream restarts, nonce is reset anyways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we want to ensure the cache entries are deleted even if discovery request fails. In case of failure when the stream restarts, nonce is reset anyways.
I think this is good information that is worth capturing in a comment. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
// Get the resource types that this specific ADS stream was handling | ||
// before stopping it. | ||
xc.ads.mu.Lock() | ||
typesHandledByStream := make([]ResourceType, 0, len(xc.ads.resourceTypeState)) | ||
for typ := range xc.ads.resourceTypeState { | ||
typesHandledByStream = append(typesHandledByStream, typ) | ||
} | ||
xc.ads.mu.Unlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need all this cleanup code in close
?
An xdsChannel
is closed only after all authorities have released their reference to the channel, and authority releases its reference to a channel only when it has no more watches. So, shouldn't the authority's cache be empty at that point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is another rare case of having unsubscriptions for which we have not yet actually sent unsubscribe messages, and now we never will, so we do a pass to delete any cache entries for which we've unsubscribed. Such a scenario can happen if unsubscribe happens while stream is broken and then xdsclient is closed. In the case when garbage collection is done via onRequest, the xdschannel.Close() part will be no-op.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not happy with having to grab the lock of one type from another type. Could you please add a TODO here to ensure that this is cleaned up at some point in the future. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a TODO. Although we can just move the logic to a helper in adsStreamImpl?
rdsName1 := "test-listener-resource2" | ||
rdsName2 := "test-route-configuration-resource2" | ||
listenerResource1 := e2e.DefaultClientListener(ldsResourceName1, rdsName1) | ||
listenerResource2 := e2e.DefaultClientListener(ldsResourceName2, rdsName2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we even need the second resource? What role does it play?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the second resource (ldsResource2
) is needed because it allows the test to keep the stream-level flow control active (adsStreamImpl.fc.pending == true
) independently of the lifecycle of ldsResource1
. This ensures that when ldsResource1's last watcher is cancelled (triggering an actual unsubscription request to the server), that request gets buffered.
To generate unsubscription request for ldsResource1
, its last watcher (blockingWatcherR1
) must be cancelled. Now, to ensure this unsubscription request is buffered by adsStreamImpl
(due to adsStreamImpl.fc.pending == true
), another watcher (to a resource other than ldsResource1
) from the same server response batch must still be "pending". Note that the management server sends an update containing both listenerResource1
and listenerResource2
. The role of blockingWatcherR2
for ldsResourceName2
is to hold on to onDoneR2
to keep adsStreamImpl.fc.pending
true. So, when cancelR1() is called:
blockingWatcherR1
is removed. It's the last watcher forldsResourceName1
.- The
authority
tellsxdsChannel
to unsubscribeldsResourceName1
. xdsChannel
tellsadsStreamImpl
to unsubscribeldsResourceName1
.adsStreamImpl
seesfc.pending
is true (becauseonDoneR2
is still held) and buffers the unsubscription request for ldsResource1.- When
ldsResourceName1
is resubscribed by a new watcher, the test verifies that new watcher gets the resource from cache because unsubscription request is not yet sent. Without the fix the new watcher would have timed out waiting forldsResource1
.
If we had only ldsResource1
, the unsubscription request (from cancelR1()
) might get sent out immediately after onDoneR1 is called if it hasn't been processed by the authority's serializer yet, rather than being reliably buffered for the test to verify the resource being served from cache.
990cb99
to
4a0f4e6
Compare
a.logger.Infof("Resubscribing to resource of type %q and name %q", typ.TypeName, name) | ||
} | ||
xc.channel.subscribe(typ, name) | ||
if len(state.watchers) > 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should invert this condition and continue
if the condition is not met. That will ensure that the interesting piece of functionality is less indented. Please see: go/go-style/decisions#indent-error-flow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
// Get the resource types that this specific ADS stream was handling | ||
// before stopping it. | ||
xc.ads.mu.Lock() | ||
typesHandledByStream := make([]ResourceType, 0, len(xc.ads.resourceTypeState)) | ||
for typ := range xc.ads.resourceTypeState { | ||
typesHandledByStream = append(typesHandledByStream, typ) | ||
} | ||
xc.ads.mu.Unlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not happy with having to grab the lock of one type from another type. Could you please add a TODO here to ensure that this is cleaned up at some point in the future. Thanks.
1506c39
to
7f9e915
Compare
@danielzhaotongliu the test in this PR confirms that the resource is served from cache until an attempt to send unsubscription request is made. Without this fix, the watcher will timeout in case of subscribing a resource who has no watchers but hasn't yet sent unsubscribe request. Its similar to the change that was made in c++. I have imported the change to a fresh cl/774823957 over latest. Could you run the tests on that? Even if we can't repro the race or if there are other reasons/issues, we probably should handle in separate issue. We still would like to merge this fix because logically it fixes the race of unsubscribe/subscribe from the xdsclient point of view as described in #8125. |
LGTM. |
Fixes: #8125
RELEASE NOTES: