Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 18 additions & 6 deletions test/e2e-go/features/partitionRecovery/partitionRecovery_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -135,6 +135,10 @@ func runTestWithStaggeredStopStart(t *testing.T, fixture *fixtures.RestClientFix
// Stop Node1
nc1.FullStop()

status, err := fixture.LibGoalClient.Status()
a.NoError(err)
roundAfterStop := status.LastRound

time.Sleep(inducePartitionTime)

// Use the fixture to start the node again so it supplies the correct peer addresses
Expand All @@ -152,10 +156,10 @@ func runTestWithStaggeredStopStart(t *testing.T, fixture *fixtures.RestClientFix
a.NoError(err)

// Now wait for us to make progress again.
status, err := fixture.LibGoalClient.Status()
status, err = fixture.LibGoalClient.Status()
a.NoError(err)

a.Equal(waitForRound, status.LastRound, "We should not have made progress since stopping the first node")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was the bug that sometimes we would get to waitForRound+1, so waitForRound did not equal status.LastRound?

It seems like there is still a race condition between resuming node 1 and shutting down node 2.

Could we move fixture.StartNode(nc1.GetDataDir()) to after nc2.FullStop() and before time.Sleep(20 * time.Second)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe that race condition is the point of the test? Not sure exactly what is intended by the name runTestWithStaggeredStopStart, the comments in the other 2 tests you changed make the test a bit easier to review. Maybe you could add a similar comment to this test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intend of the test is to make sure that after stopping Node1, the network does not make any progress.

If the test proceeds in a timely manner, the node will be before round 3 when ClientWaitForRoundWithTimeout is called, and will be at round 3 when Node1 is stopped.
If this happens, there is no problem.

However, if the test thread is slow, and by the time Node1 is stopped, nothing prevents Node1 from getting all the way to round e.g. 5. This may happen before ClientWaitForRoundWithTimeout, or after ClientWaitForRoundWithTimeout and before FullStop. This can be tested by adding in any of these position sleep of 5 seconds.

By reading the round number after FullStop, we get the round number the network was after Node1 stopped. It might be a little after it stopped, but it is okay. As long as we make sure the network is not making any progress during the inducePartitionTime sleep.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation. I think there is still a race condition, here is what the test does:

n1 - FullStop
roundAfterStop
n1.Start
                        <--- race condition here, n1 and n2 are running
n2.FullStop
roundAfterStall

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in theory yes, you are correct. However, this is unlikely to happen, because at this point the network is stalled, and it will take a long time for it to recover to make any progress. By the time that may happen, n2 is already stopped.

However, in the previous situation, it was possible even in a fraction of a second to change the round.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll defer whether or not this point is blocking for the PR. Seems like we may as well stop n2 before starting n1, but maybe it isn't a problem.

a.Equal(roundAfterStop, status.LastRound, "We should not have made progress since stopping the first node")

err = fixture.WaitForRound(status.LastRound+1, partitionRecoveryTime)
a.NoError(err)
Expand Down Expand Up @@ -193,6 +197,10 @@ func TestBasicPartitionRecoveryPartOffline(t *testing.T) {
// Stop Node1
nc1.FullStop()

status, err := fixture.LibGoalClient.Status()
a.NoError(err)
roundAfterStop := status.LastRound

// Stop the 2nd node and give network a chance to stall
nc2, err := fixture.GetNodeController("Node2")
a.NoError(err)
Expand All @@ -205,10 +213,10 @@ func TestBasicPartitionRecoveryPartOffline(t *testing.T) {
a.NoError(err)

// Now wait for us to make progress again.
status, err := fixture.LibGoalClient.Status()
status, err = fixture.LibGoalClient.Status()
a.NoError(err)

a.Equal(waitForRound, status.LastRound, "We should not have made progress since stopping the first node")
a.Equal(roundAfterStop, status.LastRound, "We should not have made progress since stopping the first node")

err = fixture.WaitForRound(status.LastRound+1, partitionRecoveryTime)
a.NoError(err)
Expand Down Expand Up @@ -253,12 +261,16 @@ func TestPartitionHalfOffline(t *testing.T) {
a.NoError(err)
nc3.FullStop()

status, err := client.Status()
a.NoError(err)
roundAfterStop := status.LastRound

time.Sleep(inducePartitionTime)

// Get main client to monitor
status, err := client.Status()
status, err = client.Status()
a.NoError(err)
a.Equal(waitForRound, status.LastRound, "We should not have made progress since stopping the nodes")
a.Equal(roundAfterStop, status.LastRound, "We should not have made progress since stopping the nodes")

// Start 40 of 50% of the stake
_, err = fixture.StartNode(nc1.GetDataDir())
Expand Down