Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xdsclient: fix flaky test TestServeAndCloseDoNotRace #7814

Merged
merged 3 commits into from
Nov 7, 2024

Conversation

easwars
Copy link
Contributor

@easwars easwars commented Nov 6, 2024

This PR addresses two issues:

  • A bug in the code:
    • In authority.watchResource, the watch is handled by scheduling a callback on the serializer and waiting for the callback to be executed by blocking on a done channel that is closed when the callback has completed its work.
    • In the event where we fail to schedule the callback on the serializer, we were not closing the done channel. This meant that the authority.watchResource call would block forever.
    • The same bug existed in autority.dumpResources as well.
  • A bug in the test:
    • xdsclient.NewForTesting accepts a bootstrap configuration as one of its options and sets the fallback bootstrap configuration (i.e. the configuration that gets used when the bootstrap env vars are not set). As part of the cleanup function that is returned, the fallback bootstrap configuration is unset.
    • In TestServeAndCloseDoNotRace, the following happens in a loop:
      • A new xDS-enabled gRPC server is created with the BootstrapContentsForTesting server option. This results in an xDS client being created (or refcount on an existing one being incremented) using xdsclient.NewForTesting with the provided bootstrap configuration.
      • A goroutine is invoked to call Serve.
      • A goroutine is invoked to call Stop. This results in the xDS client cleanup function being invoked.
      • This means that there could be a race between one iteration of the loop calling the cleanup function (and therefore erasing the fallback boostrap configuration) and one iteration of the loop creating the server (and therefore writing the fallback bootstrap configuration).
    • The fix involves not resetting the fallback bootstrap configuration from the cleanup function returned by xdsclient.NewForTesting, but instead leaving that up to the tests.

Ran on forge for 100K times without a flake.

Fixes #7627

RELEASE NOTES: none

…e to schedule a callback to do the actual work
@easwars easwars added this to the 1.69 Release milestone Nov 6, 2024
Copy link

codecov bot commented Nov 6, 2024

Codecov Report

Attention: Patch coverage is 80.00000% with 1 line in your changes missing coverage. Please review.

Project coverage is 81.77%. Comparing base (18d218d) to head (902a545).
Report is 2 commits behind head on master.

Files with missing lines Patch % Lines
xds/internal/xdsclient/authority.go 75.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #7814      +/-   ##
==========================================
- Coverage   82.00%   81.77%   -0.23%     
==========================================
  Files         373      373              
  Lines       37735    37848     +113     
==========================================
+ Hits        30945    30951       +6     
- Misses       5512     5587      +75     
- Partials     1278     1310      +32     
Files with missing lines Coverage Δ
xds/internal/xdsclient/client_new.go 85.54% <100.00%> (-0.35%) ⬇️
xds/internal/xdsclient/authority.go 75.32% <75.00%> (-7.18%) ⬇️

... and 22 files with indirect coverage changes

@arjan-bal arjan-bal self-requested a review November 6, 2024 17:05
xds/internal/xdsclient/authority.go Outdated Show resolved Hide resolved
@@ -152,8 +157,7 @@ func NewForTesting(opts OptionsForTesting) (XDSClient, func(), error) {
if err := bootstrap.SetFallbackBootstrapConfig(opts.Contents); err != nil {
return nil, nil, err
}
client, cancel, err := newRefCounted(opts.Name, opts.WatchExpiryTimeout, opts.IdleChannelExpiryTimeout, opts.StreamBackoffAfterFailure)
return client, func() { bootstrap.UnsetFallbackBootstrapConfigForTesting(); cancel() }, err
return newRefCounted(opts.Name, opts.WatchExpiryTimeout, opts.IdleChannelExpiryTimeout, opts.StreamBackoffAfterFailure)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel that it's better to have tests clean up their own mutations/side-effects. This would prevent issues in which test fail only when run in a certain order. Since the BootstrapContentsForTesting ServerOption is public, adding a new param to it is also not preferable. I don't know the best way to do this 😕.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had thought about making callers of NewForTesting ensure that bootstrap configuration is set either by setting one of the associated env vars or by calling SetFallbackBootstrapConfig. If we did that, we can remove the Contents field from OptionsForTesting and have the tests setup the bootstrap configuration and do the cleanup themselves. But the number of callsites for NewForTesting is enormous. So, I didn't want to do that as part of this PR.

But I agree with your concern, and maybe I can do that as a follow-up PR so that we take care of this flake first, and then cleanup the remaining tests.

What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to agree with Arjan here. Ever since I joined the team 3.5 years ago, Doug has always stressed to not set any globals amongst tests that persist over test iterations that would couple tests in any way.

Any time I have done this, he has always made me rewrite the test in order to be hermetic and not coupled with other tests. I agree with this because of the ordering thing Arjan mentioned, I think go test guarantees things are run serially but I forgot what the strict ordering requirements are, and also it's weird to have one test state affect another test state.

But since test is blocking development flow I'm fine merging this now and following up on it to address this issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with doing the cleanup in a later PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack. Thanks.

@zasweq zasweq self-assigned this Nov 7, 2024
@zasweq zasweq assigned easwars and unassigned zasweq Nov 7, 2024
@easwars easwars merged commit 5b40f07 into grpc:master Nov 7, 2024
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Flaky test: TestServeAndCloseDoNotRace
3 participants