storage: make TestRaftRemoveRace race harder #10637

tamird · 2016-11-11T19:43:30Z

This change trivially reproduces #9017 under stress, and it would be good to figure out why.

Note that changing this test to use only 2 stores causes it to never fail; this suggests that the 3rd store (neither the initiator the ChangeReplicas nor its target) is involved in producing this failure, perhaps via unexpected raft messages to the ChangeReplicas target.

Assigning to @petermattis because he's most familiar with this code, though @bdarnell, @jordanlewis, and @arjunravinarayan might also have some ideas.

This change is

petermattis · 2016-11-11T22:27:27Z

Can we mark the test as skipped and merge? This definitely needs to be investigated, especially the panic.

Reviewed 1 of 1 files at r1.
Review status: 0 of 2 files reviewed at latest revision, 1 unresolved discussion, some commit checks failed.

pkg/storage/client_raft_test.go, line 2009 at r2 (raw file):

          storeIndexes = append(storeIndexes, i)
      }
      mtc.replicateRange(rangeID, storeIndexes...)

This is what the code was doing before and the previous setup is common practice. I'm not sure why you felt this should change.

Comments from Reviewable

tamird · 2016-11-11T23:57:09Z

There's probably no rush to merge this - the panic actually occurs more readily without this patch, since the other failure cases are prevented (see #9037 (comment)).

Review status: 0 of 2 files reviewed at latest revision, 1 unresolved discussion, some commit checks failed.

pkg/storage/client_raft_test.go, line 2009 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

This is what the code was doing before and the previous setup is common practice. I'm not sure why you felt this should change.

I was playing with the number of nodes used in the test to see if it would change the outcome. I'm still contemplating running this test in a few configurations via sub-tests.

Comments from Reviewable

bdarnell · 2016-11-13T06:25:32Z

Note that changing this test to use only 2 stores causes it to never fail; this suggests that the 3rd store (neither the initiator the ChangeReplicas nor its target) is involved in producing this failure, perhaps via unexpected raft messages to the ChangeReplicas target.

An alternate theory is that with two stores, the replication changes are more synchronized: the second store's participation in the quorum is required. With three stores, the two stores that are left alone form a quorum and commit the replication changes as fast as they can, leaving the third store to find out about the changes in racier ways.

Review status: 0 of 2 files reviewed at latest revision, 1 unresolved discussion, some commit checks failed.

Comments from Reviewable

tamird · 2016-11-22T20:55:47Z

Rebased this - some new failures are now exposed, though perhaps those are expected?

@petermattis triage ping

This produces instant failures under stress: ``` 0 runs so far, 0 failures, over 5s --- FAIL: TestRaftRemoveRace (4.94s) client_test.go:915: change replicas of range 1 failed: result is ambiguous (Raft re-proposal failed: write at timestamp 0.000000123,1228 too old; wrote at 0.000000123,1229) FAIL ``` This one is not new, and is tracked in #10969. ``` 0 runs so far, 0 failures, over 5s --- FAIL: TestRaftRemoveRace (7.24s) client_raft_test.go:1863: [n1,s1,r1/1:/M{in-ax}]: change replicas aborted due to failed preemptive snapshot: range=1: remote couldn't accept snapshot with error: [n2,s2],r1: cannot apply snapshot: [n2,s2]: canApplySnapshotLocked: cannot add placeholder, have an existing placeholder range=1 [/Min-/Max) (placeholder) FAIL ``` This one needs to be investigated. When the above failures are ignored, the test sometimes times out, even when the timeout is increased to 5 minutes (the test normally takes less than 10 seconds). ``` $ make stress STRESSFLAGS='-p 100 -stderr -maxfails 1 -ignore "(have an existing placeholder|result is ambiguous)"' PKG=./pkg/storage TESTS=TestRaftRemoveRace ```

Store.{add,remove}Placeholder need to be called while holding Replica.raftMu. This wasn't being done in Store.processReady which allowed a placeholder to be removed prematurely. Closes cockroachdb#10637

tamird assigned petermattis Nov 11, 2016

This was referenced Nov 22, 2016

storage: TestRaftRemoveRace triggers log.Fatal in (*Store).removeReplicaImpl #9017

Closed

storage: misc cleanup #10971

Merged

tamird added 2 commits November 22, 2016 18:09

storage: add debug logging

3a528ce

petermattis mentioned this pull request Nov 28, 2016

storage: fix a bug in placeholder locking #11657

Merged

petermattis closed this in #11657 Nov 28, 2016

tamird deleted the placeholder-fatal branch November 28, 2016 15:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

storage: make TestRaftRemoveRace race harder #10637

storage: make TestRaftRemoveRace race harder #10637

Uh oh!

tamird commented Nov 11, 2016 •

edited by rjnn

Loading

Uh oh!

petermattis commented Nov 11, 2016

Uh oh!

tamird commented Nov 11, 2016

Uh oh!

bdarnell commented Nov 13, 2016

Uh oh!

tamird commented Nov 22, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

storage: make TestRaftRemoveRace race harder #10637

storage: make TestRaftRemoveRace race harder #10637

Uh oh!

Conversation

tamird commented Nov 11, 2016 • edited by rjnn Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

petermattis commented Nov 11, 2016

Uh oh!

tamird commented Nov 11, 2016

Uh oh!

bdarnell commented Nov 13, 2016

Uh oh!

tamird commented Nov 22, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tamird commented Nov 11, 2016 •

edited by rjnn

Loading