-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix data race in drain scheduler #86
Conversation
Codecov Report
@@ Coverage Diff @@
## master #86 +/- ##
==========================================
+ Coverage 69.84% 73.92% +4.07%
==========================================
Files 7 7
Lines 461 464 +3
==========================================
+ Hits 322 343 +21
+ Misses 130 113 -17
+ Partials 9 8 -1
Continue to review full report at Codecov.
|
if !hasSchedule { | ||
t.Fatalf("Missing schedule record for node %v", node.Name) | ||
} | ||
if failed { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So - we want failed
when the test passes (counter-intuitive on the first read) because we're inducing an error with the failDrainer
. The reason we're doing this is because we specifically want to exercise the atomic setFailed
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest just adding a brief comment before the break statement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the main reason for adding a test case here is to prevent future races when changing HasSchedule
. The goal of this test is to ensure we're calling HasSchedule
in parallel with the function passed to time.AfterFunc
being executed. Using the failed
as a condition was an easy way to ensure that the test has reached its goal of having the timer-triggered function running.
I'll add a comment here to better explain this.
} | ||
select { | ||
case <-time.After(time.Second): | ||
case <-timeout: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When do we hit the timeout
condition vs. falling through via the initial case statement on 114?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line 114 case <-time.After(time.Second)
acts only as a sleep to ensure that the test is not continuously consuming CPU cycles. Line 115 case <-timeout
is a stop condition to prevent the test from hanging forever if for some reason/bug the scheduled function passed to time.AfterFunc
was never called. I think some comments would also be nice here, working on it.
To confirm: the data race you saw occurred after adding the test case, but before making the material changes to the implementation of |
Yes you're right, the data race was triggered running only the new test case before any changes to |
The new test helps exposing a current race condition while also helping preventing similar races in calls `HasSchedule` interacting with in-progress draining operations. The race condition also happens during normal draino operation because the event handler may call `HasSchedule` multiple times for an already scheduled node. Data race sample: ``` $ go test -race Now: 2020-08-20T16:22:10-03:00 ================== WARNING: DATA RACE Write at 0x00c0000c8098 by goroutine 16: github.com/planetlabs/draino/internal/kubernetes.(*DrainSchedules).newSchedule.func1() /Users/cezarsa/code/draino/internal/kubernetes/drainSchedule.go:125 +0x235 Previous read at 0x00c0000c8098 by goroutine 12: github.com/planetlabs/draino/internal/kubernetes.(*DrainSchedules).HasSchedule() /Users/cezarsa/code/draino/internal/kubernetes/drainSchedule.go:59 +0x142 github.com/planetlabs/draino/internal/kubernetes.TestDrainSchedules_HasSchedule_Polling() /Users/cezarsa/code/draino/internal/kubernetes/drainSchedule_test.go:108 +0x54a testing.tRunner() /usr/local/Cellar/go/1.14.5/libexec/src/testing/testing.go:991 +0x1eb Goroutine 16 (running) created at: time.goFunc() /usr/local/Cellar/go/1.14.5/libexec/src/time/sleep.go:168 +0x51 Goroutine 12 (running) created at: testing.(*T).Run() /usr/local/Cellar/go/1.14.5/libexec/src/testing/testing.go:1042 +0x660 testing.runTests.func1() /usr/local/Cellar/go/1.14.5/libexec/src/testing/testing.go:1284 +0xa6 testing.tRunner() /usr/local/Cellar/go/1.14.5/libexec/src/testing/testing.go:991 +0x1eb testing.runTests() /usr/local/Cellar/go/1.14.5/libexec/src/testing/testing.go:1282 +0x527 testing.(*M).Run() /usr/local/Cellar/go/1.14.5/libexec/src/testing/testing.go:1199 +0x2ff main.main() _testmain.go:66 +0x223 ================== ================== WARNING: DATA RACE Write at 0x00c0000c8099 by goroutine 16: github.com/planetlabs/draino/internal/kubernetes.(*DrainSchedules).newSchedule.func1() /Users/cezarsa/code/draino/internal/kubernetes/drainSchedule.go:131 +0x65b Previous read at 0x00c0000c8099 by goroutine 12: github.com/planetlabs/draino/internal/kubernetes.(*DrainSchedules).HasSchedule() /Users/cezarsa/code/draino/internal/kubernetes/drainSchedule.go:59 +0x15f github.com/planetlabs/draino/internal/kubernetes.TestDrainSchedules_HasSchedule_Polling() /Users/cezarsa/code/draino/internal/kubernetes/drainSchedule_test.go:108 +0x54a testing.tRunner() /usr/local/Cellar/go/1.14.5/libexec/src/testing/testing.go:991 +0x1eb Goroutine 16 (running) created at: time.goFunc() /usr/local/Cellar/go/1.14.5/libexec/src/time/sleep.go:168 +0x51 Goroutine 12 (running) created at: testing.(*T).Run() /usr/local/Cellar/go/1.14.5/libexec/src/testing/testing.go:1042 +0x660 testing.runTests.func1() /usr/local/Cellar/go/1.14.5/libexec/src/testing/testing.go:1284 +0xa6 testing.tRunner() /usr/local/Cellar/go/1.14.5/libexec/src/testing/testing.go:991 +0x1eb testing.runTests() /usr/local/Cellar/go/1.14.5/libexec/src/testing/testing.go:1282 +0x527 testing.(*M).Run() /usr/local/Cellar/go/1.14.5/libexec/src/testing/testing.go:1199 +0x2ff main.main() _testmain.go:66 +0x223 ================== --- FAIL: TestDrainSchedules_HasSchedule_Polling (11.04s) testing.go:906: race detected during execution of test FAIL exit status 1 FAIL github.com/planetlabs/draino/internal/kubernetes 13.829s ```
This commit fixes a data race when calling HasSchedule. This happens because the `time.AfterFunc` function is called by a new goroutine and it updates fields which may be read in parallel by `HasSchedule`. The `running` return value from `HasSchedule` was removed to simplify the code as it wasn't used anywhere.
I've just pushed some updates adding some comments to the code. I've also split the work in 2 commits, the first one includes only the new test case failing with a data race and the second one has the fix. |
This commit fixes a data race when calling HasSchedule. This happens because the
time.AfterFunc
function is called by a new goroutine and it updates fields which may be read in parallel byHasSchedule
.A new test was added to more easily trigger the data race condition however it also happens during normal draino operation because the event handler may call
HasSchedule
multiple times for an already scheduled node.The
running
return value fromHasSchedule
was removed to simplify the code as it wasn't used anywhere.Data race sample: