xdsclient: fix flaky test TestServeAndCloseDoNotRace #7814

easwars · 2024-11-06T16:57:39Z

This PR addresses two issues:

A bug in the code:
- In authority.watchResource, the watch is handled by scheduling a callback on the serializer and waiting for the callback to be executed by blocking on a done channel that is closed when the callback has completed its work.
- In the event where we fail to schedule the callback on the serializer, we were not closing the done channel. This meant that the authority.watchResource call would block forever.
- The same bug existed in autority.dumpResources as well.
A bug in the test:
- xdsclient.NewForTesting accepts a bootstrap configuration as one of its options and sets the fallback bootstrap configuration (i.e. the configuration that gets used when the bootstrap env vars are not set). As part of the cleanup function that is returned, the fallback bootstrap configuration is unset.
- In TestServeAndCloseDoNotRace, the following happens in a loop:
  - A new xDS-enabled gRPC server is created with the BootstrapContentsForTesting server option. This results in an xDS client being created (or refcount on an existing one being incremented) using xdsclient.NewForTesting with the provided bootstrap configuration.
  - A goroutine is invoked to call Serve.
  - A goroutine is invoked to call Stop. This results in the xDS client cleanup function being invoked.
  - This means that there could be a race between one iteration of the loop calling the cleanup function (and therefore erasing the fallback boostrap configuration) and one iteration of the loop creating the server (and therefore writing the fallback bootstrap configuration).
- The fix involves not resetting the fallback bootstrap configuration from the cleanup function returned by xdsclient.NewForTesting, but instead leaving that up to the tests.

Ran on forge for 100K times without a flake.

Fixes #7627

RELEASE NOTES: none

…e to schedule a callback to do the actual work

codecov · 2024-11-06T17:03:54Z

Codecov Report

Attention: Patch coverage is 80.00000% with 1 line in your changes missing coverage. Please review.

Project coverage is 81.77%. Comparing base (18d218d) to head (902a545).
Report is 2 commits behind head on master.

Files with missing lines	Patch %	Lines
xds/internal/xdsclient/authority.go	75.00%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #7814      +/-   ##
==========================================
- Coverage   82.00%   81.77%   -0.23%     
==========================================
  Files         373      373              
  Lines       37735    37848     +113     
==========================================
+ Hits        30945    30951       +6     
- Misses       5512     5587      +75     
- Partials     1278     1310      +32

Files with missing lines	Coverage Δ
xds/internal/xdsclient/client_new.go	`85.54% <100.00%> (-0.35%)`	⬇️
xds/internal/xdsclient/authority.go	`75.32% <75.00%> (-7.18%)`	⬇️

... and 22 files with indirect coverage changes

xds/internal/xdsclient/authority.go

arjan-bal · 2024-11-06T19:34:00Z

xds/internal/xdsclient/client_new.go

@@ -152,8 +157,7 @@ func NewForTesting(opts OptionsForTesting) (XDSClient, func(), error) {
 	if err := bootstrap.SetFallbackBootstrapConfig(opts.Contents); err != nil {
 		return nil, nil, err
 	}
-	client, cancel, err := newRefCounted(opts.Name, opts.WatchExpiryTimeout, opts.IdleChannelExpiryTimeout, opts.StreamBackoffAfterFailure)
-	return client, func() { bootstrap.UnsetFallbackBootstrapConfigForTesting(); cancel() }, err
+	return newRefCounted(opts.Name, opts.WatchExpiryTimeout, opts.IdleChannelExpiryTimeout, opts.StreamBackoffAfterFailure)


I feel that it's better to have tests clean up their own mutations/side-effects. This would prevent issues in which test fail only when run in a certain order. Since the BootstrapContentsForTesting ServerOption is public, adding a new param to it is also not preferable. I don't know the best way to do this 😕.

I had thought about making callers of NewForTesting ensure that bootstrap configuration is set either by setting one of the associated env vars or by calling SetFallbackBootstrapConfig. If we did that, we can remove the Contents field from OptionsForTesting and have the tests setup the bootstrap configuration and do the cleanup themselves. But the number of callsites for NewForTesting is enormous. So, I didn't want to do that as part of this PR.

But I agree with your concern, and maybe I can do that as a follow-up PR so that we take care of this flake first, and then cleanup the remaining tests.

What do you think?

I have to agree with Arjan here. Ever since I joined the team 3.5 years ago, Doug has always stressed to not set any globals amongst tests that persist over test iterations that would couple tests in any way.

Any time I have done this, he has always made me rewrite the test in order to be hermetic and not coupled with other tests. I agree with this because of the ordering thing Arjan mentioned, I think go test guarantees things are run serially but I forgot what the strict ordering requirements are, and also it's weird to have one test state affect another test state.

But since test is blocking development flow I'm fine merging this now and following up on it to address this issue.

I'm fine with doing the cleanup in a later PR.

Ack. Thanks.

xdsclient: ensure that a watchResource call terminates if it is unabl…

0c047ff

…e to schedule a callback to do the actual work

easwars added the Type: Bug label Nov 6, 2024

easwars added this to the 1.69 Release milestone Nov 6, 2024

arjan-bal self-requested a review November 6, 2024 17:05

unset fallback bootstrap config for another test

a4a3162

arjan-bal reviewed Nov 6, 2024

View reviewed changes

add more info to the log message

902a545

zasweq self-assigned this Nov 7, 2024

zasweq approved these changes Nov 7, 2024

View reviewed changes

zasweq assigned easwars and unassigned zasweq Nov 7, 2024

arjan-bal approved these changes Nov 7, 2024

View reviewed changes

easwars merged commit 5b40f07 into grpc:master Nov 7, 2024
15 checks passed

easwars mentioned this pull request Nov 7, 2024

xdsclient: make xdsclient.NewForTesting more hermetic #7821

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

xdsclient: fix flaky test TestServeAndCloseDoNotRace #7814

xdsclient: fix flaky test TestServeAndCloseDoNotRace #7814

easwars commented Nov 6, 2024

codecov bot commented Nov 6, 2024 •

edited

Loading

arjan-bal Nov 6, 2024

easwars Nov 6, 2024

zasweq Nov 7, 2024

arjan-bal Nov 7, 2024

easwars Nov 7, 2024

xdsclient: fix flaky test TestServeAndCloseDoNotRace #7814

xdsclient: fix flaky test TestServeAndCloseDoNotRace #7814

Conversation

easwars commented Nov 6, 2024

codecov bot commented Nov 6, 2024 • edited Loading

Codecov Report

arjan-bal Nov 6, 2024

Choose a reason for hiding this comment

easwars Nov 6, 2024

Choose a reason for hiding this comment

zasweq Nov 7, 2024

Choose a reason for hiding this comment

arjan-bal Nov 7, 2024

Choose a reason for hiding this comment

easwars Nov 7, 2024

Choose a reason for hiding this comment

codecov bot commented Nov 6, 2024 •

edited

Loading