Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix data race between Agent.Run() and Agent.Stop() #1625

Merged
merged 2 commits into from
Jun 26, 2019
Merged

Fix data race between Agent.Run() and Agent.Stop() #1625

merged 2 commits into from
Jun 26, 2019

Conversation

tigrannajaryan
Copy link
Contributor

Which problem is this PR solving?

We had a data race with Agent.Stop() failing if called immediately
after Agent.Run() returns.

Short description of the changes

The root cause of the race was that Stop() called WaitGroup.Wait()
before Run() called WaitGroup.Add() for the first time which is
prohibited: https://golang.org/pkg/sync/#WaitGroup.Add

This change moves WaitGroup.Add() to earlier stage which guarantees
that WaitGroup.Wait() will be called after that.

Github issue: #1624

Testing done:

Added an automated test which was failing before the bug was fixed
and does not fail any more after the fix.

Also verified that make test passes.

Signed-off-by: Tigran Najaryan tigran@najaryan.net

@@ -22,7 +22,7 @@ import (
"github.com/uber/jaeger-lib/metrics"
"go.uber.org/zap"

"github.com/jaegertracing/jaeger/cmd/agent/app/customtransports"
customtransport "github.com/jaegertracing/jaeger/cmd/agent/app/customtransports"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is needed otherwise line 66 will not compile. I am not sure how it was compiling before. I may be missing something.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Rename the directory the the package is in or the package itself to match the directory name. Right now the package is named customtransport, which the directory is customertransports.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that would be the right approach, otherwise it breaks a common naming practice with no apparent good reason.

However, I am reluctant to make this change. If any other external code (not in this repo) imports this package renaming it will break that external code. I'd rather let this repo's maintainers decide if that's what they want to do.

I am reverting this change since apparently it compiles fine the way it was, not sure why I thought it doesn't.

@codecov
Copy link

codecov bot commented Jun 21, 2019

Codecov Report

❗ No coverage uploaded for pull request base (master@b18386f). Click here to learn what that means.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master    #1625   +/-   ##
=========================================
  Coverage          ?   98.74%           
=========================================
  Files             ?      191           
  Lines             ?     9182           
  Branches          ?        0           
=========================================
  Hits              ?     9067           
  Misses            ?       89           
  Partials          ?       26
Impacted Files Coverage Δ
cmd/agent/app/processors/thrift_processor.go 100% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b18386f...b1894be. Read the comment docs.

We had a data race with Agent.Stop() failing if called immediately
after Agent.Run() returns.

The root cause of the race was that Stop() called WaitGroup.Wait()
before Run() called WaitGroup.Add() for the first time which is
prohibited: https://golang.org/pkg/sync/#WaitGroup.Add

This change moves WaitGroup.Add() to earlier stage which guarantees
that WaitGroup.Wait() will be called after that.

Github issue: #1624

Testing done:

Added automated tests which was failing before the bug was fixed
and does not fail any more after the fix.

Also verified that `make test` passes.

Signed-off-by: Tigran Najaryan <tigran@najaryan.net>
@@ -147,6 +147,47 @@ func withRunningAgent(t *testing.T, testcase func(string, chan error)) {
t.Fatal("Expecting server exit log")
}

func TestStartStopRace(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't reproduce this on my mac open-telemetry/opentelemetry-collector#43

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think wg.Wait() doesn't block if count=0, not sure if i missed something

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't reproduce this on my mac open-telemetry/opentelemetry-collector#43

Yes, unfortunately it is a non-deterministic race, no guarantee that it will be reproducible. The test may catch the problem and it does sometimes when running on my machine.

I think wg.Wait() doesn't block if count=0, not sure if i missed something

The documentation of WaitGroup.Add() says:

Note that calls with a positive delta that occur when the counter is zero must happen before a Wait.

This means it is not valid to call Wait with zero counter before Add is called. However this is precisely what happens in our code.

You are right that in production mode Wait() does not block when count=0 however it is considered a mistake and go test race detector catches this mistake, see this comment https://github.com/golang/go/blob/master/src/sync/waitgroup.go#L124-L127

@@ -76,12 +76,12 @@ func NewThriftProcessor(
numProcessors: numProcessors,
}
metrics.Init(&res.metrics, mFactory, nil)
res.processing.Add(res.numProcessors)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a hunch that deadlock might've been in func processBuffer due to not executing s.processing.Done(), maybe it should be defer s.processing.Done()?

Copy link
Contributor

@jan25 jan25 Jun 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, and thats not the case unless Serve() was called

Copy link
Contributor Author

@tigrannajaryan tigrannajaryan Jun 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not seeing a deadlock, I am observing a data race. Here is the output of go test -v -race github.com/jaegertracing/jaeger/cmd/agent/app without my fix applied (but with my new test included):

==================
WARNING: DATA RACE
Write at 0x00c000346118 by goroutine 19:
  sync.(*WaitGroup).Wait()
      /usr/local/go/src/internal/race/race.go:41 +0xef
  github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).Stop()
      /home/tigran/go/src/github.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:102 +0x11d

Previous read at 0x00c000346118 by goroutine 18:
  sync.(*WaitGroup).Add()
      /usr/local/go/src/internal/race/race.go:37 +0x169
  github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).Serve()
      /home/tigran/go/src/github.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:84 +0x62

Goroutine 19 (running) created at:
  github.com/jaegertracing/jaeger/cmd/agent/app.(*Agent).Stop()
      /home/tigran/go/src/github.com/jaegertracing/jaeger/cmd/agent/app/agent.go:88 +0xa1
  github.com/jaegertracing/jaeger/cmd/agent/app.TestStartStopRace()
      /home/tigran/go/src/github.com/jaegertracing/jaeger/cmd/agent/app/agent_test.go:180 +0x4ca
  testing.tRunner()
      /usr/local/go/src/testing/testing.go:865 +0x163

Goroutine 18 (running) created at:
  github.com/jaegertracing/jaeger/cmd/agent/app.(*Agent).Run()
      /home/tigran/go/src/github.com/jaegertracing/jaeger/cmd/agent/app/agent.go:75 +0x2bf
  github.com/jaegertracing/jaeger/cmd/agent/app.TestStartStopRace()
      /home/tigran/go/src/github.com/jaegertracing/jaeger/cmd/agent/app/agent_test.go:176 +0x442
  testing.tRunner()
      /usr/local/go/src/testing/testing.go:865 +0x163
==================
--- FAIL: TestStartStopRace (0.01s)
    testing.go:809: race detected during execution of test

It shows a race between sync.(*WaitGroup).Wait() and sync.(*WaitGroup).Add() which is prohibited by WaitGroup documentation. The documentation says Add() must be called before Wait().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a hunch that deadlock might've been in func processBuffer due to not executing s.processing.Done(), maybe it should be defer s.processing.Done()?

I only see one exit path in processBuffer(), so s.processing.Done() is guaranteed be called when the function returns (I am ignoring the possible panic case, since that is not happening in this particular case).

I agree with you it would be more idiomatic to use defer s.processing.Done() which would make it more resilient if in the future additional exit path is added to processBuffer(), but I do not think in this particular case it cases any problems.

@@ -76,12 +76,12 @@ func NewThriftProcessor(
numProcessors: numProcessors,
}
metrics.Init(&res.metrics, mFactory, nil)
res.processing.Add(res.numProcessors)
Copy link

@freeformz freeformz Jun 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion....

res.processing.Add(res.numProcessors)
go func() {
  s.processBuffer()
  s.processing.Done()
}()

Why? It consolidates Add() and Done() close together.

the processBuffer goroutines will immediately block as there is nothing in the server's data channel.

This also nicely removes the ability to accidentally spin up multiple sets of processors for the same ThriftProcessor value that aren't tracked in the waitgroup.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the suggestion, it makes WaitGroup operation more visible. It needs to be slightly modified to account for numProcessors:

	res.processing.Add(res.numProcessors)
	for i := 0; i < res.numProcessors; i++ {
		go func() {
			res.processBuffer()
			res.processing.Done()
		}()
	}

I'll push an update to this PR.

@tigrannajaryan
Copy link
Contributor Author

Reviewers, please check again.

Signed-off-by: Tigran Najaryan <tigran@najaryan.net>
@tigrannajaryan
Copy link
Contributor Author

@yurishkuro @pavolloffay can you review this, please?

@pavolloffay
Copy link
Member

thanks @tigrannajaryan @freeformz !

@pavolloffay pavolloffay merged commit 2adc99c into jaegertracing:master Jun 26, 2019
tigrannajaryan pushed a commit to tigrannajaryan/opentelemetry-collector that referenced this pull request Jul 3, 2019
Added a Sleep to work around a data race bug in Jaeger
(jaegertracing/jaeger#1625) caused
by stopping immediately after starting.

Without this Sleep we were observing this bug on our side:
open-telemetry#43
The Sleep ensures that Jaeger Start() is fully completed before
we call Jaeger Stop().

TODO: Jaeger bug is already fixed, remove this once we update Jaeger
to latest version.

Testing done: make test
songy23 pushed a commit to open-telemetry/opentelemetry-collector that referenced this pull request Jul 3, 2019
Added a Sleep to work around a data race bug in Jaeger
(jaegertracing/jaeger#1625) caused
by stopping immediately after starting.

Without this Sleep we were observing this bug on our side:
#43
The Sleep ensures that Jaeger Start() is fully completed before
we call Jaeger Stop().

TODO: Jaeger bug is already fixed, remove this once we update Jaeger
to latest version.

Testing done: make test
Copy link
Member

@yurishkuro yurishkuro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grouping Add() and Done() closer is nice, but doesn't have a functional impact.

How does moving the go-routines from Serve() to constructor solve the data race?

@tigrannajaryan
Copy link
Contributor Author

@yurishkuro the race was happening when Stop() was called immediately after Run(). It was happening because when Run() was returning there was no guarantee that res.processing.Add() would be already called. So a subsequent Stop() call would try to call s.processing.Wait() while res.processing.Add() was still in progress. This is incorrect and is not allowed by WaitGroup implementation.

This change guarantees that res.processing.Add() is finished when Run() returns. Thus when Stop() is called after Run() it is guaranteed that s.processing.Wait() will never race with res.processing.Add(). This fixes the race between Add() and Wait() of WaitGroup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants