-
Notifications
You must be signed in to change notification settings - Fork 4k
storage: make TestRaftRemoveRace race harder #10637
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Reviewed 1 of 1 files at r1. pkg/storage/client_raft_test.go, line 2009 at r2 (raw file):
This is what the code was doing before and the previous setup is common practice. I'm not sure why you felt this should change. Comments from Reviewable |
|
There's probably no rush to merge this - the panic actually occurs more readily without this patch, since the other failure cases are prevented (see #9037 (comment)). Review status: 0 of 2 files reviewed at latest revision, 1 unresolved discussion, some commit checks failed. pkg/storage/client_raft_test.go, line 2009 at r2 (raw file):
|
An alternate theory is that with two stores, the replication changes are more synchronized: the second store's participation in the quorum is required. With three stores, the two stores that are left alone form a quorum and commit the replication changes as fast as they can, leaving the third store to find out about the changes in racier ways. Review status: 0 of 2 files reviewed at latest revision, 1 unresolved discussion, some commit checks failed. Comments from Reviewable |
|
Rebased this - some new failures are now exposed, though perhaps those are expected? @petermattis triage ping |
This produces instant failures under stress: ``` 0 runs so far, 0 failures, over 5s --- FAIL: TestRaftRemoveRace (4.94s) client_test.go:915: change replicas of range 1 failed: result is ambiguous (Raft re-proposal failed: write at timestamp 0.000000123,1228 too old; wrote at 0.000000123,1229) FAIL ``` This one is not new, and is tracked in #10969. ``` 0 runs so far, 0 failures, over 5s --- FAIL: TestRaftRemoveRace (7.24s) client_raft_test.go:1863: [n1,s1,r1/1:/M{in-ax}]: change replicas aborted due to failed preemptive snapshot: range=1: remote couldn't accept snapshot with error: [n2,s2],r1: cannot apply snapshot: [n2,s2]: canApplySnapshotLocked: cannot add placeholder, have an existing placeholder range=1 [/Min-/Max) (placeholder) FAIL ``` This one needs to be investigated. When the above failures are ignored, the test sometimes times out, even when the timeout is increased to 5 minutes (the test normally takes less than 10 seconds). ``` $ make stress STRESSFLAGS='-p 100 -stderr -maxfails 1 -ignore "(have an existing placeholder|result is ambiguous)"' PKG=./pkg/storage TESTS=TestRaftRemoveRace ```
Store.{add,remove}Placeholder need to be called while holding
Replica.raftMu. This wasn't being done in Store.processReady which
allowed a placeholder to be removed prematurely.
Closes cockroachdb#10637
This change trivially reproduces #9017 under stress, and it would be good to figure out why.
Note that changing this test to use only 2 stores causes it to never fail; this suggests that the 3rd store (neither the initiator the ChangeReplicas nor its target) is involved in producing this failure, perhaps via unexpected raft messages to the ChangeReplicas target.
Assigning to @petermattis because he's most familiar with this code, though @bdarnell, @jordanlewis, and @arjunravinarayan might also have some ideas.
This change is