Skip to content
This repository has been archived by the owner on Aug 2, 2021. It is now read-only.

Fix data races in fetcher tests #1121

Closed
wants to merge 3 commits into from
Closed

Fix data races in fetcher tests #1121

wants to merge 3 commits into from

Conversation

janos
Copy link
Member

@janos janos commented Jan 15, 2019

This PR addresses #1109 issue and fixes another set of data races.

This PR introduces no changes to function signatures, but adds one if statement to Fetcher.run with insignificant performance degradation on production code.

Fetcher.run function is controlled by the provided Context, but there is no synchronization between the termination of Fetcher.run goroutine and test termination. Some of the fetcher tests change the global variable searchTimeout and Fetcher.run reads it. If Fetcher.run goroutine is not terminated before serachTimeout is set back to the original value (with defer), or another tests starts, there is a race between reading and writing serachTimeout.

The first race is described in #1109, and the second type of race is go test -race -count 1 -v github.com/ethereum/go-ethereum/swarm/network

=== RUN   TestDiscovery
--- PASS: TestDiscovery (0.02s)
=== RUN   TestFetcherSingleRequest
--- PASS: TestFetcherSingleRequest (0.10s)
=== RUN   TestFetcherCancelStopsFetcher
--- PASS: TestFetcherCancelStopsFetcher (0.21s)
=== RUN   TestFetcherCancelStopsRequest
--- PASS: TestFetcherCancelStopsRequest (0.31s)
=== RUN   TestFetcherOfferUsesSource
--- PASS: TestFetcherOfferUsesSource (0.60s)
=== RUN   TestFetcherOfferAfterRequestUsesSourceFromContext
--- PASS: TestFetcherOfferAfterRequestUsesSourceFromContext (0.21s)
=== RUN   TestFetcherRetryOnTimeout
==================
WARNING: DATA RACE
Write at 0x000005136cf8 by goroutine 102:
  github.com/ethereum/go-ethereum/swarm/network.TestFetcherRetryOnTimeout()
      /Users/janos/go/src/github.com/ethereum/go-ethereum/swarm/network/fetcher_test.go:294 +0x3f7
  testing.tRunner()
      /usr/local/go/src/testing/testing.go:827 +0x162

Previous read at 0x000005136cf8 by goroutine 92:
  github.com/ethereum/go-ethereum/swarm/network.(*Fetcher).run()
      /Users/janos/go/src/github.com/ethereum/go-ethereum/swarm/network/fetcher.go:239 +0x4d5

Goroutine 102 (running) created at:
  testing.(*T).Run()
      /usr/local/go/src/testing/testing.go:878 +0x659
  testing.runTests.func1()
      /usr/local/go/src/testing/testing.go:1119 +0xa8
  testing.tRunner()
      /usr/local/go/src/testing/testing.go:827 +0x162
  testing.runTests()
      /usr/local/go/src/testing/testing.go:1117 +0x4ee
  testing.(*M).Run()
      /usr/local/go/src/testing/testing.go:1034 +0x2ee
  main.main()
      _testmain.go:106 +0x221

Goroutine 92 (finished) created at:
  github.com/ethereum/go-ethereum/swarm/network.TestFetcherCancelStopsRequest()
      /Users/janos/go/src/github.com/ethereum/go-ethereum/swarm/network/fetcher_test.go:153 +0x4b8
  testing.tRunner()
      /usr/local/go/src/testing/testing.go:827 +0x162
==================
--- FAIL: TestFetcherRetryOnTimeout (0.51s)
    testing.go:771: race detected during execution of test

where two goroutines from two different tests are racing on the same global value.

Synchronization is achieved by providing a test hook function that waits for a channel to be closed when Fetcher.run goroutine is done. The right order of defer functions in Fetcher tests is important to avoid races on changing searchTimeout and testHookFetcherRun. Function testHookFetcherRun is added to avoid changes to FetcherFactory.New function signature which calls Fetcher.runs in a goroutine in some of the tests.

With Fetcher.run goroutine synchronization, a deadlock in mockRequester.doRequest function was detected when context is canceled and nothing is receiving on requestC channel.

All races and problems handled in this PR are not affecting production code, just tests.

All Fetcher tests should finish without races:

go test -race -count 1 -v -run '^(TestFetcherSingleRequest|TestFetcherCancelStopsFetcher|TestFetcherCancelStopsRequest|TestFetcherOfferUsesSource|TestFetcherOfferAfterRequestUsesSourceFromContext|TestFetcherRetryOnTimeout|TestFetcherFactory|TestFetcherRequestQuitRetriesRequest|TestRequestSkipPeer|TestRequestSkipPeerExpired|TestRequestSkipPeerPermanent|TestFetcherMaxHopCount|TestSetTestHookFetcherRun)$' github.com/ethereum/go-ethereum/swarm/network

@frncmx frncmx self-assigned this Jan 16, 2019
Copy link
Contributor

@frncmx frncmx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch! I missed that multiple tests running after each other can cause an issue, too. As the previous test's goroutine did not stop yet.

I think the test hooks introduce a lot of complexity. I wonder if that necessary. Have you considered using sync.WaitGroup? Something like:

	// set searchTimeOut to low value so the test is quicker
	originalSearchTimeout := searchTimeout
	ctx, cancel := context.WithCancel(context.Background())
	var wg sync.WaitGroup
	defer func() {
		cancel()
		wg.Wait()
		searchTimeout = originalSearchTimeout
	}()
	searchTimeout = 250 * time.Millisecond

	wg.Add(1)
	go func() {
		fetcher.run(ctx, peersToSkip)
		wg.Done()
	}()

That is still a lot of code, but I guess familiarity helps to understand, i.e., wait groups and context should be familiar to people working with Go. Also, the above would not require production code change.

That being said, there would 1 case remain where it's impossible to use the above concept. Namely TestFetcherFactory. I think FetcherFactory.New must be changed as it starts a goroutine but does not provide any way for clear termination (possibility to wait for shutdown).

@janos
Copy link
Member Author

janos commented Jan 16, 2019

@frncmx Thanks for the suggestion, but you already answered why I did not use wait group approach. There has to be a change in FetcherFactory.New which would accept a synchronization argument (wait group or a channel) which would actually only be used in test, not in production. I do not think that test hook is complex, it is even used in standard library, and there is also a test for it in this pr. My first attempt was to actually pass a channel fetcher.run, but wait group is also a nice trick. Would you prefer to change FetcherFactory.New to accept the wait group and remove the test hook?

@frncmx
Copy link
Contributor

frncmx commented Jan 16, 2019

I looked into what effort would it take to change the interface of FetcherFactory.New. Well. Not simple; not cheap. I'm not sure that would worth it.

Clearly:

  • Fetcher's shutdown is not clean. The program may exit sooner after a call to cancel() than the goroutine finishes. I'm not sure, but I think that may not cause problems in this case.
  • contextdoes not provide a way to wait for cancellation as it is. There are some articles criticizing that, some going as far as Context isn't for cancellation. However, in the original/official context blogpost they use basically quit/done channels to signal back.
  • I'm on the same side with you about not liking production code changes for test, or at least really minimize them.

That being said what would you think about adding a quit channel to Fetcher?
That would require a trivial change in the NewFetcher() and on type Fetcher. Then when the context is done we can close that channel and in TestFetcherFactory we could listen to the event.

In code;

diff --git a/swarm/network/fetcher.go b/swarm/network/fetcher.go
index 6aed57e22..80524b4e4 100644
--- a/swarm/network/fetcher.go
+++ b/swarm/network/fetcher.go
@@ -215,6 +215,7 @@ func (f *Fetcher) run(ctx context.Context, peers *sync.Map) {
                        // all Fetcher context closed, can quit
                case <-ctx.Done():
                        log.Trace("terminate fetcher", "request addr", f.addr)
+                       close(f.quitC)
                        // TODO: send cancelations to all peers left over in peers map (i.e., those we requested from)
                        return
                }

@frncmx
Copy link
Contributor

frncmx commented Jan 16, 2019

mockNetFetcher actually has the same approach.
swarm/storage/netstore_test.go:37

@frncmx
Copy link
Contributor

frncmx commented Jan 16, 2019

I still don't like the hook that much even though I seen something similar in standard library once and I admit you tested the hook. I don't like the hook because the usage of the hook doesn't feel clean and simple to me in the tests.

I actually liked your first approach better. Changing a private function signature for a test (even though I know that should not be the default approach). But then I realized that did not fix all the data races.

So, please bare with me. :)

@janos
Copy link
Member Author

janos commented Jan 16, 2019

@frncmx thanks for the explanations, they are very helpful. Yes, quit channel is cleaner approach, but we will have that field in fetcher that is actually not used in production, only in tests. If it is ok to have that overhead in production, I will change the implementation, but we would still need to change FetcherFactory.New function to be able to wait in tests until internal fetcher quit channel is closed. That can be done in a few ways, by passing the channel as the argument, having a callback, or extending NetFetcher interface. What would you prefer? Or maybe you would like to take over this PR, given the amount of work you invested in it?

@janos
Copy link
Member Author

janos commented Jan 17, 2019

Rejected in favour to ethereum/go-ethereum#18469.

@janos janos closed this Jan 17, 2019
@janos janos deleted the race-fetcher branch January 17, 2019 16:01
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants