Skip to content

Conversation

@benesch
Copy link
Contributor

@benesch benesch commented Jul 4, 2018

Add tests to verify the correct behavior of transactions that span a
merge. In the process, resolve TODOs about a) copying the abort span
from the RHS to the LHS during a merge, and b) clearing the RHS's
transaction wait queue.

Fixes #27091.

@benesch benesch requested review from a team, bdarnell and tbg July 4, 2018 05:37
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 5 of 5 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/storage/client_merge_test.go, line 439 at r1 (raw file):

			t.Fatal(err)
		}
		mtc.replicateRange(bDesc.RangeID, 1)

Add a paragraph explaining what these three lines achieve.


pkg/storage/client_merge_test.go, line 447 at r1 (raw file):

		txn := client.NewTxn(store.DB(), 0 /* gatewayNodeID */, client.RootTxn)
		// Put the key on the RHS side first so the transaction record will need to
		// be copied to the LHS range during the merge.

Txn records don't get copied, they're keyed on txn.Key. You mean that it changes ownership?


pkg/storage/client_merge_test.go, line 454 at r1 (raw file):

			t.Fatal(err)
		}
		args := adminMergeArgs(roachpb.KeyMin)

Why KeyMin?


pkg/storage/client_merge_test.go, line 488 at r1 (raw file):

		// realize until after the merge.
		txn1 := client.NewTxn(store.DB(), 0 /* gatewayNodeID */, client.RootTxn)
		// Put the key on the RHS side so the transaction record and abort span

Ditto about txn record.


pkg/storage/client_merge_test.go, line 538 at r1 (raw file):

		// Create a transaction that won't complete until after the merge.
		txn1 := client.NewTxn(store.DB(), 0 /* gatewayNodeID */, client.RootTxn)
		// Put the key on the RHS side so the transaction record and abort span

Ditto


pkg/storage/client_merge_test.go, line 552 at r1 (raw file):

		// Complete the merge.
		time.Sleep(50 * time.Millisecond)

What's the Sleep for? To give txn2 time to get queued? I assume you needed this to get the test to fail without the queue clearing? Even on testrace? Add a comment.


pkg/storage/client_merge_test.go, line 569 at r1 (raw file):

		}

		// Now that txn1 has commited, txn2's put operation should complete.

committed


pkg/storage/batcheval/cmd_end_transaction.go, line 989 at r1 (raw file):

	}

	var ignoredMS enginepb.MVCCStats

These can't be ignored. By copying an abortspan into another nonempty and possibly overlapping abortspan you create a nontrivial stats delta. Unfortunately you're going to have to run some math here. Perhaps the code from the previous abort span copy that I ripped out at this very spot recently did this correctly and can be resuscitated (but I doubt it).
It would also be good to have a test for this, where the LHS and RHS both have an abortspan record for a given txn (plus some that are thrown away, etc), and you simply recompute the stats after the merge.

@benesch benesch force-pushed the merge-in-flight-txns branch 4 times, most recently from 2086b56 to 9be8e23 Compare July 4, 2018 23:06
Copy link
Contributor Author

@benesch benesch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/storage/client_merge_test.go, line 439 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Add a paragraph explaining what these three lines achieve.

Done.


pkg/storage/client_merge_test.go, line 447 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Txn records don't get copied, they're keyed on txn.Key. You mean that it changes ownership?

Ah, I'm referring to an implementation detail here. I mean that they get copied into a snapshot in GetSnapshotForMerge on the RHS and then applied to the LHS. Updated the wording to be more precise.


pkg/storage/client_merge_test.go, line 454 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Why KeyMin?

shrug it's what all the tests in this file use to address the first range. I've updated this test, at least, to be more clear.


pkg/storage/client_merge_test.go, line 488 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Ditto about txn record.

Done.


pkg/storage/client_merge_test.go, line 538 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Ditto

Done.


pkg/storage/client_merge_test.go, line 552 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

What's the Sleep for? To give txn2 time to get queued? I assume you needed this to get the test to fail without the queue clearing? Even on testrace? Add a comment.

Exactly. I've made this more robust by actually peeking into the txn wait queue.


pkg/storage/client_merge_test.go, line 569 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

committed

Done.


pkg/storage/batcheval/cmd_end_transaction.go, line 989 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

These can't be ignored. By copying an abortspan into another nonempty and possibly overlapping abortspan you create a nontrivial stats delta. Unfortunately you're going to have to run some math here. Perhaps the code from the previous abort span copy that I ripped out at this very spot recently did this correctly and can be resuscitated (but I doubt it).
It would also be good to have a test for this, where the LHS and RHS both have an abortspan record for a given txn (plus some that are thrown away, etc), and you simply recompute the stats after the merge.

D'oh. Good point. Fixed. I managed to avoid doing any math myself. :)

Copy link
Contributor

@bdarnell bdarnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 4 of 4 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)


pkg/storage/batcheval/cmd_end_transaction.go, line 989 at r1 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

D'oh. Good point. Fixed. I managed to avoid doing any math myself. :)

Did CI fail prior to the fix? (Hopefully it did, or we'll have to add something somewhere).

@benesch
Copy link
Contributor Author

benesch commented Jul 6, 2018

Yep, the additions I made to TestStoreRangeMergeStats reliably failed the test when I had the abort span stats computation wrong.

@benesch
Copy link
Contributor Author

benesch commented Jul 6, 2018

Hilariously I just spent 15m debugging a stress failure here that immediately disappeared as soon as I rebased on top of #27061 (merge lock safety improvements). Looks like we've got now test coverage for that previously-theoretical issue!

@benesch benesch force-pushed the merge-in-flight-txns branch from 9be8e23 to dca62f7 Compare July 6, 2018 20:56
@benesch
Copy link
Contributor Author

benesch commented Jul 6, 2018

bors r=tschottdorf,bdarnell

@craig
Copy link
Contributor

craig bot commented Jul 6, 2018

Merge conflict

@benesch benesch force-pushed the merge-in-flight-txns branch 2 times, most recently from a244f37 to 166e666 Compare July 8, 2018 15:56
Add tests to verify the correct behavior of transactions that span a
merge. In the process, resolve TODOs about a) copying the abort span
from the RHS to the LHS during a merge, and b) clearing the RHS's
transaction wait queue.
@benesch benesch force-pushed the merge-in-flight-txns branch from 166e666 to 9ea260f Compare July 8, 2018 16:00
@benesch
Copy link
Contributor Author

benesch commented Jul 8, 2018

bors r=tschottdorf,bdarnell

craig bot pushed a commit that referenced this pull request Jul 8, 2018
27149: storage: test transactions that span merges r=tschottdorf,bdarnell a=benesch

Add tests to verify the correct behavior of transactions that span a
merge. In the process, resolve TODOs about a) copying the abort span
from the RHS to the LHS during a merge, and b) clearing the RHS's
transaction wait queue.

Fixes #27091.

Co-authored-by: Nikhil Benesch <nikhil.benesch@gmail.com>
@craig
Copy link
Contributor

craig bot commented Jul 8, 2018

Build succeeded

@craig craig bot merged commit 9ea260f into cockroachdb:master Jul 8, 2018
@benesch benesch deleted the merge-in-flight-txns branch July 10, 2018 14:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

github.com/cockroachdb/cockroach/pkg/storage: TestStoreRangeMergeWithData/colocate=false failed under stress

4 participants