Skip to content

Conversation

@tamird
Copy link
Contributor

@tamird tamird commented Nov 11, 2016

This change trivially reproduces #9017 under stress, and it would be good to figure out why.

Note that changing this test to use only 2 stores causes it to never fail; this suggests that the 3rd store (neither the initiator the ChangeReplicas nor its target) is involved in producing this failure, perhaps via unexpected raft messages to the ChangeReplicas target.

Assigning to @petermattis because he's most familiar with this code, though @bdarnell, @jordanlewis, and @arjunravinarayan might also have some ideas.


This change is Reviewable

@petermattis
Copy link
Collaborator

:lgtm: Can we mark the test as skipped and merge? This definitely needs to be investigated, especially the panic.


Reviewed 1 of 1 files at r1.
Review status: 0 of 2 files reviewed at latest revision, 1 unresolved discussion, some commit checks failed.


pkg/storage/client_raft_test.go, line 2009 at r2 (raw file):

          storeIndexes = append(storeIndexes, i)
      }
      mtc.replicateRange(rangeID, storeIndexes...)

This is what the code was doing before and the previous setup is common practice. I'm not sure why you felt this should change.


Comments from Reviewable

@tamird
Copy link
Contributor Author

tamird commented Nov 11, 2016

There's probably no rush to merge this - the panic actually occurs more readily without this patch, since the other failure cases are prevented (see #9037 (comment)).


Review status: 0 of 2 files reviewed at latest revision, 1 unresolved discussion, some commit checks failed.


pkg/storage/client_raft_test.go, line 2009 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

This is what the code was doing before and the previous setup is common practice. I'm not sure why you felt this should change.

I was playing with the number of nodes used in the test to see if it would change the outcome. I'm still contemplating running this test in a few configurations via sub-tests.

Comments from Reviewable

@bdarnell
Copy link
Contributor

Note that changing this test to use only 2 stores causes it to never fail; this suggests that the 3rd store (neither the initiator the ChangeReplicas nor its target) is involved in producing this failure, perhaps via unexpected raft messages to the ChangeReplicas target.

An alternate theory is that with two stores, the replication changes are more synchronized: the second store's participation in the quorum is required. With three stores, the two stores that are left alone form a quorum and commit the replication changes as fast as they can, leaving the third store to find out about the changes in racier ways.


Review status: 0 of 2 files reviewed at latest revision, 1 unresolved discussion, some commit checks failed.


Comments from Reviewable

@tamird
Copy link
Contributor Author

tamird commented Nov 22, 2016

Rebased this - some new failures are now exposed, though perhaps those are expected?

@petermattis triage ping

This produces instant failures under stress:
```
0 runs so far, 0 failures, over 5s

--- FAIL: TestRaftRemoveRace (4.94s)
	client_test.go:915: change replicas of range 1 failed: result is ambiguous (Raft re-proposal failed: write at timestamp 0.000000123,1228 too old; wrote at 0.000000123,1229)
FAIL
```
This one is not new, and is tracked in #10969.
```
0 runs so far, 0 failures, over 5s

--- FAIL: TestRaftRemoveRace (7.24s)
	client_raft_test.go:1863: [n1,s1,r1/1:/M{in-ax}]: change replicas aborted due to failed preemptive snapshot: range=1: remote couldn't accept snapshot with error: [n2,s2],r1: cannot apply snapshot: [n2,s2]: canApplySnapshotLocked: cannot add placeholder, have an existing placeholder range=1 [/Min-/Max) (placeholder)
FAIL
```
This one needs to be investigated.

When the above failures are ignored, the test sometimes times out, even
when the timeout is increased to 5 minutes (the test normally takes
less than 10 seconds).

```
$ make stress STRESSFLAGS='-p 100 -stderr -maxfails 1 -ignore "(have an existing placeholder|result is ambiguous)"' PKG=./pkg/storage TESTS=TestRaftRemoveRace

```
petermattis added a commit to petermattis/cockroach that referenced this pull request Nov 28, 2016
Store.{add,remove}Placeholder need to be called while holding
Replica.raftMu. This wasn't being done in Store.processReady which
allowed a placeholder to be removed prematurely.

Closes cockroachdb#10637
@tamird tamird deleted the placeholder-fatal branch November 28, 2016 15:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants