-
Notifications
You must be signed in to change notification settings - Fork 4k
sql: unblock user table changes from long-running transactions #150747
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR? 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
f887695
to
0ca28d3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
going to give this a deeper review soon, but in the meantime, check out this example of the kind of test we could use here:
cockroach/pkg/sql/tests/allow_role_memberships_to_change_during_transaction_test.go
Line 27 in 5d6ebe1
func TestAllowRoleMembershipsToChangeDuringTransaction(t *testing.T) { |
it's a "go unit test" but we have a lot of these kinds of integrationy tests that rely on serverutils.StartServer
to start an in-memory cockroachdb testserver, and run specific commands with control over multiple sessions and interleaving operations.
basically for this test, we'd want:
- one session for a non-root user that starts a transaction, then accesses some table, but does not commit yet.
- another session running as root that creates a new user. this operation should complete before the first transaction commits.
i don't think we actually need to create any background goroutines for this test, but in other tests that do, we launch them using a ctxgroup
.
Reviewable status:
complete! 0 of 0 LGTMs obtained
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained
pkg/sql/catalog/lease/lease.go
line 521 at r3 (raw file):
// WaitForCurrentVersionPropagated returns once all leaseholders of any version // of the descriptor hold a lease of the current version. func (m *Manager) WaitForCurrentVersionPropagated(
super nit: a name like WaitForNewVersionPropagated
(or even just WaitForNewVersion
) feels more intuitive to me.
pkg/sql/catalog/lease/lease.go
line 526 at r3 (raw file):
retryOpts retry.Options, regions regionliveness.CachedDatabaseRegions, ) (catalog.Descriptor, error) {
we also should be able to add a more targeted test (e.g. TestWaitForCurrentVersionPropagated
) in pkg/sql/catalog/lease/lease_test.go. this test also uses an in-memory test server, but rather than just testing with SQL commands, it also provides a way to access the LeaseManager of each node and make finer-grained assertions. take a look at TestLeaseManagerDrain
for example.
pkg/sql/catalog/lease/lease.go
line 533 at r3 (raw file):
// session: (session in Prev => session in Curr)` for the set theory // enjoyers). for r := retry.Start(retryOpts); r.Next(); {
todo for myself: study this logic more closely
pkg/sql/catalog/descs/collection.go
line 345 at r4 (raw file):
} // Provides a defer-friendly function that updates the version bump only flag for
super nit: a golang convention is to begin function-level comments with the name of the function:
// MaybeMarkVersionBump provides a defer-friendly...
pkg/sql/catalog/lease/lease.go
line 175 at r1 (raw file):
}) return descs, err
super nit: this was from before your PR, but as long as we have a cleanup commit, let's bias towards adding an explicit check before returning here:
if err != nil {
return nil, err
}
return descs, nil
this is slightly better, since if this code gets refactored later to add additional logic in this function, it would be easy for a future author to forget about this error checking.
pkg/sql/catalog/lease/lease.go
line 361 at r2 (raw file):
// On single region clusters we can query everything at once. if regionMap == nil { sessionIDs, err := getSessionsHoldingDescriptor(ctx, txn, schemaID, nil, "")
nit: a convention when passing nil or empty objects (or a flag) is to add the parameter name as an inline comment. for example:
false, /* forceDecisionWithoutStats */ |
we should have had that for the empty string region argument that was here before too. could you clean up both?
Previously, rafiss (Rafi Shamim) wrote…
I definitely didn't refactor the name return value out then decide I should reduce splash. Put up a separate fix. |
8fd1a6a
to
4083c75
Compare
4083c75
to
f8f073c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @rafiss)
pkg/sql/catalog/descs/collection.go
line 345 at r4 (raw file):
Previously, rafiss (Rafi Shamim) wrote…
super nit: a golang convention is to begin function-level comments with the name of the function:
// MaybeMarkVersionBump provides a defer-friendly...
Done.
pkg/sql/catalog/lease/lease.go
line 521 at r3 (raw file):
Previously, rafiss (Rafi Shamim) wrote…
super nit: a name like
WaitForNewVersionPropagated
(or even justWaitForNewVersion
) feels more intuitive to me.
Done.
pkg/sql/catalog/lease/lease.go
line 526 at r3 (raw file):
Previously, rafiss (Rafi Shamim) wrote…
we also should be able to add a more targeted test (e.g.
TestWaitForCurrentVersionPropagated
) in pkg/sql/catalog/lease/lease_test.go. this test also uses an in-memory test server, but rather than just testing with SQL commands, it also provides a way to access the LeaseManager of each node and make finer-grained assertions. take a look atTestLeaseManagerDrain
for example.
Added a basic test on the blocking behavior. Let me know what you think.
f8f073c
to
a2c2b8d
Compare
a2c2b8d
to
3cda84e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 14 of 14 files at r13, 1 of 1 files at r14, 1 of 1 files at r15, 14 of 14 files at r16, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @rafiss)
pkg/sql/table.go
line 372 at r16 (raw file):
versionBump := p.descCollection.MaybeMarkVersionBump(tableDesc, isVersionBump) defer versionBump()
Maybe a comment here why its needed and lets simplify it to
defer p.descCollection.MaybeMarkVersionBump(tableDesc, isVersionBump)()
pkg/sql/catalog/lease/lease.go
line 592 at r15 (raw file):
if hasTimeout, timeout := prober.GetProbeTimeout(); hasTimeout { err = timeutil.RunWithTimeout(ctx, "active-descriptor-leases-by-region", timeout, func(ctx context.Context) error { currSessionIDs, err = getSessionsHoldingDescriptor(ctx, txn, descriptorId, &currVersion, region)
I think we can get the same results with a SQL count query per-region that does a join? It might be cleaner all the post processing we need here
pkg/sql/catalog/lease/lease_test.go
line 600 at r16 (raw file):
t.expectLeases(descID, "/1/1 /1/2 /2/1 /2/2 /2/3") timeoutCtx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
Maybe do no timeout, I wonder if this could flake under stress.
3cda84e
to
c4e04e6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @fqazi and @rafiss)
pkg/sql/table.go
line 372 at r16 (raw file):
Previously, fqazi (Faizan Qazi) wrote…
Maybe a comment here why its needed and lets simplify it to
defer p.descCollection.MaybeMarkVersionBump(tableDesc, isVersionBump)()
Done.
pkg/sql/catalog/lease/lease.go
line 592 at r15 (raw file):
Previously, fqazi (Faizan Qazi) wrote…
I think we can get the same results with a SQL count query per-region that does a join? It might be cleaner all the post processing we need here
Done.
pkg/sql/catalog/lease/lease_test.go
line 600 at r16 (raw file):
Previously, fqazi (Faizan Qazi) wrote…
Maybe do no timeout, I wonder if this could flake under stress.
Done.
9eda24e
to
74975f6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great work! all the main components look good, and my comments are mostly about testing, comments, and error handling.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @fqazi)
pkg/sql/tests/allow_user_create_during_transaction_test.go
line 68 at r18 (raw file):
require.Equal(t, 0, count) barTx, err := barDB.BeginTx(ctx, nil)
nit: let's not use a transaction for the barDB
operations. we can just do barDB.Exec("CREATE USER biz")
that will let us get rid of the special case handling for "unexpected transaction status idle"
also, there's no need to use a goroutine/go
here. since the operation doesn't block, it should just work in the main thread. (if it does block due to something that causes CRDB to revert to the behavior pre-this-PR, then that statement will hang, and this test will fail anyway.)
pkg/sql/tests/allow_user_create_during_transaction_test.go
line 96 at r18 (raw file):
// TODO(#150858): add testing for blocking when there is more than a version // bump on a table in the transaction t.Run("create user is blocked in transaction with blocking operations", func(t *testing.T) {
i think we don't really need this test case.
pkg/sql/catalog/lease/lease.go
line 533 at r18 (raw file):
now := m.storage.clock.Now() descs := []IDVersion{NewIDVersionPrev(desc.GetName(), desc.GetID(), desc.GetVersion())} detail, err := countLeasesWithDetail(ctx, m.storage.db, m.Codec(), regions, m.settings, descs, now, false)
nit: keep the /*forAnyVersion*/
comment
pkg/sql/catalog/lease/lease.go
line 562 at r18 (raw file):
) (catalog.Descriptor, error) { if retryOpts.MaxRetries != 0 { return nil, errors.New("The MaxRetries option shouldn't be set")
nit: let's make it say "shouldn't be set in WaitForNewVersion" in this error and the one below so it's easier to debug.
pkg/sql/catalog/lease/lease.go
line 616 at r18 (raw file):
if !success { // quit early if region == "" { log.Infof(ctx, "Region has %d sessions holding stale descriptor", regionStaleSessionCount)
nit: let's just remove "region" from the log message in this case
pkg/sql/catalog/lease/lease_test.go
line 564 at r18 (raw file):
SQLLeaseManager: &lease.ManagerTestingKnobs{ LeaseStoreTestingKnobs: lease.StorageTestingKnobs{ LeaseAcquiredEvent: nil, // TODO
just curious what we want here?
pkg/bench/rttanalysis/testdata/benchmark_expectations
line 30 at r18 (raw file):
11,AlterTableUnsplit/alter_table_unsplit_at_2_values 14,AlterTableUnsplit/alter_table_unsplit_at_3_values 2,Audit/select_from_an_audit_table
nit: keep this one as 2-4
, it will make it less flaky to allow a range.
pkg/bench/rttanalysis/testdata/benchmark_expectations
line 64 at r18 (raw file):
12,GrantRole/grant_1_role 16,GrantRole/grant_2_roles 13,Jobs/cancel_job
nit: let's not adjust any of the Jobs
expectations, they shouldn't be affected.
pkg/sql/table.go
line 312 at r18 (raw file):
} return p.writeTableDesc(ctx, tableDesc, true)
super nit: add an inline comment, like: p.writeTableDescToBatch(ctx, tableDesc, b, true /* isVersionBump */)
another approach would be to make writeTableDesc
take a varargs "options" parameter and make an option for opting into the "version bump only" behavior.
an example is here:
Lines 1949 to 1956 in 0c8f3ac
// Txn is used to run queries with internal executor in a transactional | |
// manner. | |
func (ief *InternalDB) Txn( | |
ctx context.Context, f func(context.Context, isql.Txn) error, opts ...isql.TxnOption, | |
) error { | |
wrapped := func(ctx context.Context, txn *internalTxn) error { return f(ctx, txn) } | |
return ief.txn(ctx, wrapped, opts...) | |
} |
see the isql.TxnOption
interface. a writeup of this pattern is here: https://cockroachlabs.atlassian.net/wiki/spaces/CRDB/pages/1881669848/Go+function+argument+patterns
up to you if you want to just add an inline comment or use the optional argument pattern.
pkg/sql/table.go
line 326 at r18 (raw file):
tableDesc.Name) } return p.writeTableDescToBatch(ctx, tableDesc, b, false)
ditto about an inline comment or using an optional argument
pkg/sql/table.go
line 337 at r18 (raw file):
} } return p.writeTableDesc(ctx, tableDesc, false)
ditto about an inline comment or using an optional argument
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, just one more thing I noticed.
Reviewed 14 of 14 files at r17, 14 of 14 files at r18, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @bghal and @rafiss)
pkg/sql/catalog/lease/lease.go
line 275 at r18 (raw file):
b := strings.Builder{} // Counts sessions that have previous version of the descriptor but not the current version
Nice!
pkg/sql/catalog/lease/lease.go
line 286 at r18 (raw file):
WHERE l2.desc_id = l1.desc_id AND l2.session_id = l1.session_id AND l2.version = %d
We need the region filter for l2, otherwise I think this is a full scan of the table, not just the prefix for the region. Probably can be confirmed by running an EXPLAIN on the query.
This is related to the regionliveness prober logic in the caller, we shouldn't be scanning data from unrelated regions.
7286e65
to
1546513
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dismissed @fqazi and @rafiss from 12 discussions.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @fqazi and @rafiss)
pkg/sql/table.go
line 312 at r18 (raw file):
Previously, rafiss (Rafi Shamim) wrote…
super nit: add an inline comment, like:
p.writeTableDescToBatch(ctx, tableDesc, b, true /* isVersionBump */)
another approach would be to make
writeTableDesc
take a varargs "options" parameter and make an option for opting into the "version bump only" behavior.an example is here:
Lines 1949 to 1956 in 0c8f3ac
// Txn is used to run queries with internal executor in a transactional // manner. func (ief *InternalDB) Txn( ctx context.Context, f func(context.Context, isql.Txn) error, opts ...isql.TxnOption, ) error { wrapped := func(ctx context.Context, txn *internalTxn) error { return f(ctx, txn) } return ief.txn(ctx, wrapped, opts...) } see the
isql.TxnOption
interface. a writeup of this pattern is here: https://cockroachlabs.atlassian.net/wiki/spaces/CRDB/pages/1881669848/Go+function+argument+patternsup to you if you want to just add an inline comment or use the optional argument pattern.
Went with the options approach.
pkg/sql/table.go
line 326 at r18 (raw file):
Previously, rafiss (Rafi Shamim) wrote…
ditto about an inline comment or using an optional argument
Done.
pkg/sql/table.go
line 337 at r18 (raw file):
Previously, rafiss (Rafi Shamim) wrote…
ditto about an inline comment or using an optional argument
Done.
pkg/sql/catalog/lease/lease.go
line 286 at r18 (raw file):
Previously, fqazi (Faizan Qazi) wrote…
We need the region filter for l2, otherwise I think this is a full scan of the table, not just the prefix for the region. Probably can be confirmed by running an EXPLAIN on the query.
This is related to the regionliveness prober logic in the caller, we shouldn't be scanning data from unrelated regions.
Done.
pkg/sql/catalog/lease/lease.go
line 616 at r18 (raw file):
Previously, rafiss (Rafi Shamim) wrote…
nit: let's just remove "region" from the log message in this case
Pet peeve of mine is beginning sentences with digits.
pkg/sql/catalog/lease/lease_test.go
line 564 at r18 (raw file):
Previously, rafiss (Rafi Shamim) wrote…
just curious what we want here?
Meant to cleanup was using that in a previous version.
pkg/sql/tests/allow_user_create_during_transaction_test.go
line 68 at r18 (raw file):
Previously, rafiss (Rafi Shamim) wrote…
nit: let's not use a transaction for the
barDB
operations. we can just dobarDB.Exec("CREATE USER biz")
that will let us get rid of the special case handling for "unexpected transaction status idle"
also, there's no need to use a goroutine/
go
here. since the operation doesn't block, it should just work in the main thread. (if it does block due to something that causes CRDB to revert to the behavior pre-this-PR, then that statement will hang, and this test will fail anyway.)
Yeah did that to fail fast but fine to drop/
pkg/sql/tests/allow_user_create_during_transaction_test.go
line 96 at r18 (raw file):
Previously, rafiss (Rafi Shamim) wrote…
i think we don't really need this test case.
Done.
Currently, the `CREATE USER` and the `GRANT role` operations use a bumped descriptor to invalidate caches of the user table. This blocked those operations until long-running transactions using the old table version committed. This change adds special handling of version bumps to wait for visibility of the change rather than convergence across the cluster. It no longer creates (blocking) jobs for the version bumps. Epic: CRDB-49398 Fixes: cockroachdb#138691 Release note (sql change): The `CREATE USER` and `GRANT role` operations now wait for full-cluster visibility of the new user table version rather than blocking on convergence.
1546513
to
65c953b
Compare
Realizing this Code quote: defer p.descCollection.MaybeMarkVersionBump(tableDesc, options.isVersionBump)()
return p.Descriptors().WriteDescToBatch( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great work! very exciting that we'll finally remove a highly visible source of latency, and i think we'll probably end up being able to use the new WaitForNewVersion
function in a few other cases longer term.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @bghal and @fqazi)
pkg/sql/table.go
line 394 at r20 (raw file):
Previously, bghal wrote…
Realizing this
MaybeMarkVersionBump
behavior might make sense as an option forWriteDescToBatch
lmk if you think I should move that down there.
yeah, i think that's a good idea -- WriteDescToBatch
is exported, so it's usable by other packages, and it would be nice to have this option available. this means that the isVersionBump
also should become exported (upper-cased).
pkg/sql/catalog/lease/lease.go
line 526 at r3 (raw file):
Previously, bghal wrote…
Added a basic test on the blocking behavior. Let me know what you think.
it looks good!
bors r+ |
The test on `WaitForNewVersion` was flaking due to timeouts. The tests around the `allow_role_memberships_to_change_during_transaction` session variable were no longer valid from cockroachdb#150747. Fixes: cockroachdb#152050, cockroachdb#152051 Informs: cockroachdb#152094
The test on `WaitForNewVersion` was flaking due to timeouts. The tests around the `allow_role_memberships_to_change_during_transaction` session variable were no longer valid from cockroachdb#150747. Fixes: cockroachdb#152050, cockroachdb#152051 Informs: cockroachdb#152094 Release note: None
The test on `WaitForNewVersion` was flaking due to timeouts. The tests around the `allow_role_memberships_to_change_during_transaction` session variable were no longer valid from cockroachdb#150747. Fixes: cockroachdb#152050, cockroachdb#152051, cockroachdb#152043 Informs: cockroachdb#152094 Release note: None
The test on `WaitForNewVersion` was flaking due to timeouts. The tests around the `allow_role_memberships_to_change_during_transaction` session variable were no longer valid from cockroachdb#150747. Fixes: cockroachdb#152050, cockroachdb#152051, cockroachdb#152043 Informs: cockroachdb#152094 Release note: None
The test on `WaitForNewVersion` was flaking due to timeouts. The tests around the `allow_role_memberships_to_change_during_transaction` session variable were no longer valid from cockroachdb#150747. Fixes: cockroachdb#152050, cockroachdb#152051, cockroachdb#152043 Informs: cockroachdb#152094 Release note: None
The test on `WaitForNewVersion` was flaking due to timeouts. The tests around the `allow_role_memberships_to_change_during_transaction` session variable were no longer valid from cockroachdb#150747. Fixes: cockroachdb#152050, cockroachdb#152051, cockroachdb#152043 Informs: cockroachdb#152094 Release note: None
sql: unblock user table changes from long-running transactions
Currently, the
CREATE USER
and theGRANT role
operations use abumped descriptor to invalidate caches of the user table.
This blocked those operations until long-running transactions using the
old table version committed.
This change adds special handling of version bumps to wait for
visibility of the change rather than convergence across the cluster. It
no longer creates (blocking) jobs for the version bumps.
Epic: CRDB-49398
Fixes: #138691
Release note (sql change): The
CREATE USER
andGRANT role
operationsnow wait for full-cluster visibility of the new user table version
rather than blocking on convergence.