-
Notifications
You must be signed in to change notification settings - Fork 523
Data race in tests: TestApplicationsUpgradeOverREST race #2844
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data race in tests: TestApplicationsUpgradeOverREST race #2844
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2844 +/- ##
=======================================
Coverage 47.33% 47.33%
=======================================
Files 351 351
Lines 56523 56523
=======================================
+ Hits 26754 26756 +2
+ Misses 26760 26758 -2
Partials 3009 3009
Continue to review full report at Codecov.
|
The expected round number is captured before the node stops. However, it is likely that the node advances to the next round before it is stopped. When this happens, the test will fail. This change gets the most up-to-date round number after the node is stopped, but before inducePartitionTime timeout is waited. inducePartitionTime is the wait to make sure the expected behavior is obtained. The round number is captured before this wait. However, I could not identify in this PR why TestBasicPartitionRecovery has failed. Could not find anything in the test, and the failure logs have nothing. I suspect that the failure in the other tests triggered the failure, and fixed the other tests, but cannot be sure. As for the data race, it is fixed in #2844. Fixes #2384 and #2545
|
Based on the details you've provided, I don't believe that the proposed solution would fix the issue: |
|
You are right. My changes addressed only failures caused by the test, but not failures coming from the fixture. |
tsachiherman
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good. make sure to fix
Go files must be formatted with gofmt. Please run:
gofmt -w /home/circleci/project/test/e2e-go/upgrades/application_support_test.goWhen multiple threads use the synchTest object, and one fails, all call FailNow. After the first call, the main test routine reads t.finished to terminate the test. While the main test routine is reading t.finished, another thread calls FailNow, which in turn writes to t.finished, causing the data race. While FailNow called from synchTest are guarded by a mutex, the main testing object has the TestingTB object, which does not use the same mutex. The solution is to call FailNow from synchTest object only the first time.
call FailNow. After the first call, the main test routine reads t.finished to terminate the test. While the main test routine is reading t.finished, another thread calls FailNow, which in turn writes to t.finished, causing the data race. While FailNow called from synchTest are guarded by a mutex, the main testing object has the TestingTB object, which does not use the same mutex. The solution is to call FailNow from synchTest object only the first time.
b7543c1 to
74e55f1
Compare
tsachiherman
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
loos good.
Summary
When multiple threads use the synchTest object, and one fails, all call FailNow.
After the first call, the main test routine reads t.finished to terminate the test.
While the main test routine is reading t.finished, another thread calls FailNow, which in turn writes to t.finished, causing the data race.
While FailNow called from synchTest are guarded by a mutex, the main testing object has the TestingTB object, which does not use the same mutex.
The solution is to call FailNow from synchTest object only the first time.
fixes #2507
More details:
The problem with the string of data race conditions we are observing in different tests when the test fails is the following:
When the test fails, FailNow in go/src/testing/testing.go sets
c.finished, which is the "Previous write" part of the data racethis call comes from different threads, possibly multiple ones, forked from (not exclusively)
go-algorand/nodecontrol/algodControl.go
where,
ExitErrorCallbackis:and NoError, in case of an error, calls
FailNow.f.thappens to besynchTest, which has a mutex guarding theFailNowcall, and hence, prevents simultaneous accesst.finished.However, there is one more read access to
t.finished, which comes from the following location, which does not use synchTest object, hence, does not have the mutex used for the write:go/src/testing/testing.go
In other words, when there are multiple threads, and all are fine, no thread will write to
t.finished. When something goes wrong andFailNowis initiated, the other threads will be notified of the failure, and, in their turn, also reportrequire.NoError, which callsFailNow, and writest.finished = true.Test Plan