Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix data race in drain scheduler #86

Merged
merged 2 commits into from
Aug 21, 2020
Merged

Conversation

cezarsa
Copy link
Contributor

@cezarsa cezarsa commented Aug 20, 2020

This commit fixes a data race when calling HasSchedule. This happens because the time.AfterFunc function is called by a new goroutine and it updates fields which may be read in parallel by HasSchedule.

A new test was added to more easily trigger the data race condition however it also happens during normal draino operation because the event handler may call HasSchedule multiple times for an already scheduled node.

The running return value from HasSchedule was removed to simplify the code as it wasn't used anywhere.

Data race sample:

$ go test -race
Now: 2020-08-20T11:00:04-03:00
==================
WARNING: DATA RACE
Write at 0x00c00026c098 by goroutine 15:
  github.com/planetlabs/draino/internal/kubernetes.(*DrainSchedules).newSchedule.func1()
      /Users/cezarsa/code/draino/internal/kubernetes/drainSchedule.go:125 +0x235

Previous read at 0x00c00026c098 by goroutine 12:
  github.com/planetlabs/draino/internal/kubernetes.(*DrainSchedules).HasSchedule()
      /Users/cezarsa/code/draino/internal/kubernetes/drainSchedule.go:59 +0x142
  github.com/planetlabs/draino/internal/kubernetes.TestDrainSchedules_Schedule_Polling()
      /Users/cezarsa/code/draino/internal/kubernetes/drainSchedule_test.go:106 +0x54a
  testing.tRunner()
      /usr/local/Cellar/go/1.14.5/libexec/src/testing/testing.go:991 +0x1eb

Goroutine 15 (running) created at:
  time.goFunc()
      /usr/local/Cellar/go/1.14.5/libexec/src/time/sleep.go:168 +0x51

Goroutine 12 (running) created at:
  testing.(*T).Run()
      /usr/local/Cellar/go/1.14.5/libexec/src/testing/testing.go:1042 +0x660
  testing.runT  testing.tRunner()
      /usr/local/Cellar/go/1.14.5/libexec/src/testing/testing.go:991 +0x1eb
  testing.runTests()
      /usr/local/Cellar/go/1.14.5/libexec/src/testing/testing.go:1282 +0x527
  testing.(*M).Run()
      /usr/local/Cellar/go/1.14.5/libexec/src/testing/testing.go:1199 +0x2ff
  main.main()
      _testmain.go:66 +0x223
==================
==================
WARNING: DATA RACE
Write at 0x00c00026c099 by goroutine 15:
  github.com/planetlabs/draino/internal/kubernetes.(*DrainSchedules).newSchedule.func1()
      /Users/cezarsa/code/draino/internal/kubernetes/drainSchedule.go:131 +0x65b

Previous read at 0x00c00026c099 by goroutine 12:
  github.com/planetlabs/draino/internal/kubernetes.(*DrainSchedules).HasSchedule()
      /Users/cezarsa/code/draino/internal/kubernetes/drainSchedule.go:59 +0x15f
  github.com/planetlabs/draino/internal/kubernetes.TestDrainSchedules_Schedule_Polling()
      /Users/cezarsa/code/draino/internal/kubernetes/drainSchedule_test.go:106 +0x54a
  testing.tRunner()
      /usr/local/Cellar/go/1.14.5/libexec/src/testing/testing.go:991 +0x1eb

Goroutine 15 (running) created at:
  time.goFunc()
      /usr/local/Cellar/go/1.14.5/libexec/src/time/sleep.go:168 +0x51

Goroutine 12 (running) created at:
  testing.(*T).Run()
      /usr/local/Cellar/go/1.14.5/libexec/src/testing/testing.go:1042 +0x660
  testing.runTests.func1()
      /usr/local/Cellar/go/1.14.5/libexec/src/testing/testing.go:1284 +0xa6
  testing.tRunner()
      /usr/local/Cellar/go/1.14.5/libexec/src/testing/testing.go:991 +0x1eb
  testing.runTests()
      /usr/local/Cellar/go/1.14.5/libexec/src/testing/testing.go:1282 +0x527
  testing.(*M).Run()
      /usr/local/Cellar/go/1.14.5/libexec/src/testing/testing.go:1199 +0x2ff
  main.main()
      _testmain.go:66 +0x223
==================
--- FAIL: TestDrainSchedules_Schedule_Polling (11.04s)
    testing.go:906: race detected during execution of test
FAIL
exit status 1
FAIL	github.com/planetlabs/draino/internal/kubernetes	13.921s

@codecov
Copy link

codecov bot commented Aug 20, 2020

Codecov Report

Merging #86 into master will increase coverage by 4.07%.
The diff coverage is 90.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #86      +/-   ##
==========================================
+ Coverage   69.84%   73.92%   +4.07%     
==========================================
  Files           7        7              
  Lines         461      464       +3     
==========================================
+ Hits          322      343      +21     
+ Misses        130      113      -17     
+ Partials        9        8       -1     
Impacted Files Coverage Δ
internal/kubernetes/eventhandler.go 43.75% <0.00%> (ø)
internal/kubernetes/drainSchedule.go 69.66% <100.00%> (+21.98%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 14fd977...49918a3. Read the comment docs.

if !hasSchedule {
t.Fatalf("Missing schedule record for node %v", node.Name)
}
if failed {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So - we want failed when the test passes (counter-intuitive on the first read) because we're inducing an error with the failDrainer. The reason we're doing this is because we specifically want to exercise the atomic setFailed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest just adding a brief comment before the break statement.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the main reason for adding a test case here is to prevent future races when changing HasSchedule. The goal of this test is to ensure we're calling HasSchedule in parallel with the function passed to time.AfterFunc being executed. Using the failed as a condition was an easy way to ensure that the test has reached its goal of having the timer-triggered function running.
I'll add a comment here to better explain this.

}
select {
case <-time.After(time.Second):
case <-timeout:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When do we hit the timeout condition vs. falling through via the initial case statement on 114?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 114 case <-time.After(time.Second) acts only as a sleep to ensure that the test is not continuously consuming CPU cycles. Line 115 case <-timeout is a stop condition to prevent the test from hanging forever if for some reason/bug the scheduled function passed to time.AfterFunc was never called. I think some comments would also be nice here, working on it.

@jacobstr
Copy link
Contributor

To confirm: the data race you saw occurred after adding the test case, but before making the material changes to the implementation of setFailed?

@cezarsa
Copy link
Contributor Author

cezarsa commented Aug 20, 2020

Yes you're right, the data race was triggered running only the new test case before any changes to drainSchedule.go

The new test helps exposing a current race condition while also helping
preventing similar races in calls `HasSchedule` interacting with
in-progress draining operations.

The race condition also happens during normal draino operation because
the event handler may call `HasSchedule` multiple times for an already
scheduled node.

Data race sample:

```
$ go test -race
Now: 2020-08-20T16:22:10-03:00
==================
WARNING: DATA RACE
Write at 0x00c0000c8098 by goroutine 16:
  github.com/planetlabs/draino/internal/kubernetes.(*DrainSchedules).newSchedule.func1()
      /Users/cezarsa/code/draino/internal/kubernetes/drainSchedule.go:125 +0x235

Previous read at 0x00c0000c8098 by goroutine 12:
  github.com/planetlabs/draino/internal/kubernetes.(*DrainSchedules).HasSchedule()
      /Users/cezarsa/code/draino/internal/kubernetes/drainSchedule.go:59 +0x142
  github.com/planetlabs/draino/internal/kubernetes.TestDrainSchedules_HasSchedule_Polling()
      /Users/cezarsa/code/draino/internal/kubernetes/drainSchedule_test.go:108 +0x54a
  testing.tRunner()
      /usr/local/Cellar/go/1.14.5/libexec/src/testing/testing.go:991 +0x1eb

Goroutine 16 (running) created at:
  time.goFunc()
      /usr/local/Cellar/go/1.14.5/libexec/src/time/sleep.go:168 +0x51

Goroutine 12 (running) created at:
  testing.(*T).Run()
      /usr/local/Cellar/go/1.14.5/libexec/src/testing/testing.go:1042 +0x660
  testing.runTests.func1()
      /usr/local/Cellar/go/1.14.5/libexec/src/testing/testing.go:1284 +0xa6
  testing.tRunner()
      /usr/local/Cellar/go/1.14.5/libexec/src/testing/testing.go:991 +0x1eb
  testing.runTests()
      /usr/local/Cellar/go/1.14.5/libexec/src/testing/testing.go:1282 +0x527
  testing.(*M).Run()
      /usr/local/Cellar/go/1.14.5/libexec/src/testing/testing.go:1199 +0x2ff
  main.main()
      _testmain.go:66 +0x223
==================
==================
WARNING: DATA RACE
Write at 0x00c0000c8099 by goroutine 16:
  github.com/planetlabs/draino/internal/kubernetes.(*DrainSchedules).newSchedule.func1()
      /Users/cezarsa/code/draino/internal/kubernetes/drainSchedule.go:131 +0x65b

Previous read at 0x00c0000c8099 by goroutine 12:
  github.com/planetlabs/draino/internal/kubernetes.(*DrainSchedules).HasSchedule()
      /Users/cezarsa/code/draino/internal/kubernetes/drainSchedule.go:59 +0x15f
  github.com/planetlabs/draino/internal/kubernetes.TestDrainSchedules_HasSchedule_Polling()
      /Users/cezarsa/code/draino/internal/kubernetes/drainSchedule_test.go:108 +0x54a
  testing.tRunner()
      /usr/local/Cellar/go/1.14.5/libexec/src/testing/testing.go:991 +0x1eb

Goroutine 16 (running) created at:
  time.goFunc()
      /usr/local/Cellar/go/1.14.5/libexec/src/time/sleep.go:168 +0x51

Goroutine 12 (running) created at:
  testing.(*T).Run()
      /usr/local/Cellar/go/1.14.5/libexec/src/testing/testing.go:1042 +0x660
  testing.runTests.func1()
      /usr/local/Cellar/go/1.14.5/libexec/src/testing/testing.go:1284 +0xa6
  testing.tRunner()
      /usr/local/Cellar/go/1.14.5/libexec/src/testing/testing.go:991 +0x1eb
  testing.runTests()
      /usr/local/Cellar/go/1.14.5/libexec/src/testing/testing.go:1282 +0x527
  testing.(*M).Run()
      /usr/local/Cellar/go/1.14.5/libexec/src/testing/testing.go:1199 +0x2ff
  main.main()
      _testmain.go:66 +0x223
==================
--- FAIL: TestDrainSchedules_HasSchedule_Polling (11.04s)
    testing.go:906: race detected during execution of test
FAIL
exit status 1
FAIL	github.com/planetlabs/draino/internal/kubernetes	13.829s
```
This commit fixes a data race when calling HasSchedule. This happens because
the `time.AfterFunc` function is called by a new goroutine and it updates
fields which may be read in parallel by `HasSchedule`.

The `running` return value from `HasSchedule` was removed to simplify the code
as it wasn't used anywhere.
@cezarsa
Copy link
Contributor Author

cezarsa commented Aug 20, 2020

I've just pushed some updates adding some comments to the code. I've also split the work in 2 commits, the first one includes only the new test case failing with a data race and the second one has the fix.

@jacobstr jacobstr merged commit ead5f1c into planetlabs:master Aug 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants