Panic when running UNION on SHOW BACKUP queries as non-root user #88385

dasrirez · 2022-09-21T20:11:07Z

Describe the problem

If a query issued by a non-root user UNIONs the result of multiple SHOW BACKUP queries, then CRDB crashes with an exit code of 2.

To Reproduce

Create a user and grant perms.

root@localhost:26257/defaultdb> create user test;
CREATE ROLE


Time: 25ms total (execution 25ms / network 0ms)

root@localhost:26257/defaultdb> grant admin to test;
GRANT


Time: 29ms total (execution 29ms / network 0ms)

Take a backup

test@localhost:26257/defaultdb> BACKUP INTO 'userfile:///backups';
        job_id       |  status   | fraction_completed | rows | index_entries | bytes
---------------------+-----------+--------------------+------+---------------+--------
  798619757264404481 | succeeded |                  1 |   24 |             9 |  1636
(1 row)


Time: 466ms total (execution 466ms / network 0ms)

test@localhost:26257/defaultdb> SHOW BACKUPS in 'userfile:///backups';
          path
-------------------------
  /2022/09/21-195008.34
(1 row)


Time: 23ms total (execution 23ms / network 0ms)

Union the result of 2 SHOW BACKUP queries, this might take a few tries to get the race condition to occur.

test@localhost:26257/defaultdb> SELECT * FROM [SHOW BACKUP LATEST IN 'userfile:///backups'] UNION SELECT * FROM [SHOW BACKUP LATEST IN 'userfile:///backups'];
ERROR: unexpected EOF
warning: error retrieving the transaction status: connection closed unexpectedly: conn closed
warning: connection lost!
opening new connection: all session settings will be lost
warning: error retrieving the database name: failed to connect to `host=localhost user=test database=`: dial error (dial tcp [::1]:26257: connect: connection refused)

CRDB should have crashed. The interesting stack traces should be the ones that include the backup call.

I220921 13:50:12.719035 1 util/log/flags.go:211  [-] 1  stderr capture started
fatal error: concurrent map read and map write

goroutine 11141 [running]:
github.com/cockroachdb/cockroach/pkg/sql/catalog/internal/catkv.(*StoredCatalog).UpdateValidationLevel(0xc005dc0580, {0x7fd447d2a088, 0xc0000d5ef0}, 0x7)
	github.com/cockroachdb/cockroach/pkg/sql/catalog/internal/catkv/stored_catalog.go:374 +0x94
github.com/cockroachdb/cockroach/pkg/sql/catalog/descs.(*Collection).finalizeDescriptors(0xc005dc04e0, {0x5f95168, 0xc001b42720}, 0x0?, {0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, ...)
	github.com/cockroachdb/cockroach/pkg/sql/catalog/descs/descriptor.go:395 +0x45b
github.com/cockroachdb/cockroach/pkg/sql/catalog/descs.(*Collection).getByName(0xc005dc04e0, {0x5f95168, 0xc001b42720}, 0xc0028de2c0, {0x0?, 0x0?}, {0x0?, 0x0?}, {0x510253f, 0x6}, ...)
	github.com/cockroachdb/cockroach/pkg/sql/catalog/descs/descriptor.go:324 +0x5c5
github.com/cockroachdb/cockroach/pkg/sql/catalog/descs.(*Collection).getDatabaseByName(0x203000?, {0x5f95168?, 0xc001b42720?}, 0x0?, {0x510253f, 0x6}, {0x0, 0x0, 0x0, 0x0, ...})
	github.com/cockroachdb/cockroach/pkg/sql/catalog/descs/database.go:64 +0x88
github.com/cockroachdb/cockroach/pkg/sql/catalog/descs.(*Collection).GetImmutableDatabaseByName(...)
	github.com/cockroachdb/cockroach/pkg/sql/catalog/descs/database.go:45
github.com/cockroachdb/cockroach/pkg/sql/catalog/descs.(*Collection).getObjectByNameIgnoringRequiredAndType(0xc005dc04e0, {0x5f95168, 0xc001b42720}, 0xc005868f58?, {0x510253f, 0x6}, {0x5101fc9, 0x6}, {0x511b386, 0xc}, ...)
	github.com/cockroachdb/cockroach/pkg/sql/catalog/descs/object.go:160 +0x13d
github.com/cockroachdb/cockroach/pkg/sql/catalog/descs.(*Collection).getObjectByName(0x2?, {0x5f95168, 0xc001b42720}, 0x0?, {0x510253f?, 0xc45f80?}, {0x5101fc9?, 0x1945786?}, {0x511b386, 0xc}, ...)
	github.com/cockroachdb/cockroach/pkg/sql/catalog/descs/object.go:65 +0x1b8
github.com/cockroachdb/cockroach/pkg/sql/catalog/descs.(*Collection).getTableByName(0x1000000000000?, {0x5f95168, 0xc001b42720}, 0x17303c0?, {0x5fc7720, 0x9809800}, {{0x1, 0x0, 0x0, 0x0, ...}, ...})
	github.com/cockroachdb/cockroach/pkg/sql/catalog/descs/table.go:53 +0x119
github.com/cockroachdb/cockroach/pkg/sql/catalog/descs.(*Collection).GetImmutableTableByName(0xc002e9ce90?, {0x5f95168?, 0xc001b42720?}, 0xc003207400?, {0x5fc7720?, 0x9809800?}, {{0x1, 0x0, 0x0, 0x0, ...}, ...})
	github.com/cockroachdb/cockroach/pkg/sql/catalog/descs/table.go:44 +0x4e
github.com/cockroachdb/cockroach/pkg/sql.MemberOfWithAdminOption({0x5f95168?, 0xc001b42720}, 0xc0013c0a00, {0x5fe4430?, 0xc0005a33b0}, 0xc0058690a8?, 0xc0028de2c0, {{0xc00180c040?, 0x0?}})
	github.com/cockroachdb/cockroach/pkg/sql/authorization.go:462 +0x14c
github.com/cockroachdb/cockroach/pkg/sql.(*planner).MemberOfWithAdminOption(0x0?, {0x5f95168?, 0xc001b42720?}, {{0xc00180c040?, 0x0?}})
	github.com/cockroachdb/cockroach/pkg/sql/authorization.go:433 +0x65
github.com/cockroachdb/cockroach/pkg/sql.(*planner).UserHasAdminRole(0x0?, {0x5f95168?, 0xc001b42720?}, {{0xc00180c040?, 0x0?}})
	github.com/cockroachdb/cockroach/pkg/sql/authorization.go:393 +0x88
github.com/cockroachdb/cockroach/pkg/cloud/cloudprivilege.CheckDestinationPrivileges({0x5f95168, 0xc001b42720}, {0x6025240, 0xc001b1d0d0}, {0xc000529820, 0x1, 0x0?})
	github.com/cockroachdb/cockroach/pkg/cloud/cloudprivilege/privileges.go:29 +0x94
github.com/cockroachdb/cockroach/pkg/ccl/backupccl.showBackupPlanHook.func1({0x5f95168, 0xc0083a2db0}, {0x0?, 0x0?, 0xf0f00c00a3cbf68?}, 0x5f95168?)
	github.com/cockroachdb/cockroach/pkg/ccl/backupccl/show.go:279 +0x30a
github.com/cockroachdb/cockroach/pkg/sql.(*hookFnNode).startExec.func1()
	github.com/cockroachdb/cockroach/pkg/sql/planhook.go:184 +0xa4
created by github.com/cockroachdb/cockroach/pkg/sql.(*hookFnNode).startExec
	github.com/cockroachdb/cockroach/pkg/sql/planhook.go:182 +0x1b2

Expected behavior

There should be no crash!

Additional data / screenshots

Log directory: https://upload.cockroachlabs.com/receive/?packageCode=M1KruQE98L65Cm1oy6GPshIXN29XnAdZeaRNvTreTgw#keycode=GgD909dEchER8yB1YbmTUiksTlrL24aaRknToo7MrXk

Environment:

Build Tag:        v22.2.0-alpha.3-265-gedeefa9120
Build Time:       2022/09/20 07:10:37
Distribution:     CCL
Platform:         linux amd64 (x86_64-pc-linux-gnu)
Go Version:       go1.19.1
C Compiler:       gcc 6.5.0
Build Commit ID:  edeefa9120e86739808ca43fc52d4cbac813fab3
Build Type:       release

Additional context

This was discovered while running some performance tests on backup queries.

Jira issue: CRDB-19797

The text was updated successfully, but these errors were encountered:

postamar · 2022-09-21T21:01:08Z

Thanks for reporting this issue, it's a very interesting bug.

postamar · 2022-09-22T21:08:32Z

I think the fundamental problem here is that:

these SHOW BACKUP statements run in a hookFnNode plan node, because that's how CCL statements are handled,
their startExec methods spawn a goroutine to do the work
the work involves accessing the collection and doing stuff that's generally not thread-safe
this doesn't have an impact most of the time because we usually don't deal with multiple CCL statements concurrently
except here and what's still a mystery to me is why didn't this fail earlier!

In any case synchronizing the goroutines to make sure they run in a mutually-exclusive manner for each transaction seems to be the correct thing to do.

ajwerner · 2022-09-23T14:27:30Z

Okay, this one is good!

The root cause, in effect, is the introduction in 22.2 to check the privilege on the cloud URL here:

cockroach/pkg/ccl/backupccl/show.go

Lines 279 to 281 in ef4656f

    
           if err := cloudprivilege.CheckDestinationPrivileges(ctx, p, dest); err != nil { 
        
           	return err 
        
           }

The problem is that this privilege check happens in the returned callback and not in the setup function. The returned callback is called in parallel with other parts of execution, and is not safe to concurrently resolve descriptors. Resolving descriptors needs to happen during the planning part of the plan hook.

Now, there's some complexity here in two dimensions:

We do some lazy computation I don't fully understand to resolve the urls into strings and we do that in the async callback. We'll need to figure out how to get the URLs we want to authenticate before the callback gets run -- or we'll need to resolve the complete set of privileges earlier
The privilege checking code is far too lazy and composed. I filed an issue about this in catalog: break eval dependency #88400. The problem is that we reach back around to the planner from the catalog layer in an unsatisfying way that hides access to a non-thread-safe resource. What we should do instead, as described in that issue, is explicitly resolve the ExternalConnectionPrivileges into a privilege descriptor. We should do that in the plan hook itself and not in the returned closure.

If we did that, then we'd be accessing the collection during planning time and not execution time, as it was intended. I don't have enough context above on how these inCol and toFn callbacks work, so we'll need some help from @cockroachdb/disaster-recovery (maybe @adityamaru or @dt) to figure out how to get our hands on URLs to authorize during planning and then @RichardJCai, your ongoing work should make it straightforward to lift this privilege checking.

blathers-crl · 2022-09-23T14:41:44Z

cc @cockroachdb/disaster-recovery

dt · 2022-09-26T14:34:47Z

figure out how to get our hands on URLs to authorize during planning

I believe this is impossible: URIs can appear in placeholders, so we don't have their values until eval?

ajwerner · 2022-09-26T15:22:08Z

It seems to me like if we lease the relevant descriptors (which are known and finite) then, if I understand correctly, we won't need to resolve anything new during execution.

dt · 2022-09-26T22:28:10Z

How do we know the descriptors though? like, just lease all possible synthetic / privilege descriptors? Until we look at the URI, we don't know if it is a named sink, or a userfile? If it ends up userfile, we need to do priv checks on the table being written to, which can be any table name so it seems like we'd need to have leased every table?

ajwerner · 2022-09-26T22:59:35Z

The synthetic privileges aren't themselves descriptors. They are data in a table and there's a cache associated with that data. We bump the table descriptor version for the table when we write to it in order to achieve cache coherence. This is the same pattern followed also for users and user_roles. There's a constant number of descriptors to lease (2?)

dt · 2022-09-27T04:05:45Z

But what if the uri is to a userfile location? Until we parse the host we don't even know what table name to resolve let along priv check

RichardJCai · 2022-09-27T14:30:00Z

But what if the uri is to a userfile location? Until we parse the host we don't even know what table name to resolve let along priv check

We always have to resolve the same two tables system.privileges and system.external_connections

ajwerner · 2022-09-27T15:44:49Z

But what if the uri is to a userfile location? Until we parse the host we don't even know what table name to resolve let along priv check

Okay, that's a problem. I think @dt is noting that the issue here is that we also need to resolve the userfile table. I don't really know how that happens. Fundamentally, we have an issue here. The planning layer is not safe for concurrent access and the catalog is part of the planning layer.

ajwerner · 2022-09-27T15:48:25Z

The way that the optimizer works here is that it stores all of the dependencies in its prepared plan and then it checks those dependencies and re-plans in the context of the planner. I think the good, or bad, news is that we already had the userfile bug and it's pretty deep, and people don't seem to use userfile much, let alone in the context of a UNION in SHOW BACKUPS. If they did, they'll hit this bug. I think we're better off starting with a partial fix for the common case and then coming back to evaluate this more general, but nevertheless severe problem.

Prior to this change, plan hooks would be invokved twice, once during prepare and again during execution. At prepare time, the placeholder values are not known, but, critically, the placeholders were being type-checked. This type checking informs the sql layer how to interpret the placeholder values when we are trying to bind them. Importantly, the sql layer does not ever use the `planNode` produced via the `PlanHookRowFn` returned from the planning invocation. The above dance meant that these plan hooks had quite a bit of complexity to eagerly type check but lazily evaluate expressions passed in via the statement. This complexity became a problem more recently: one cannot and should not access many planner data structures from the row func because such functions can, theoretically, run concurrently and these data structures are not thread safe. Note that the above problem is not commonly a real problem. Most plan hook statements don't permit concurrency and those than do, via, say, `UNION ALL` don't commonly get executed in such a setting. In some ways, the real reason for this change is to unwind some unneeded complexity and clarify the bounds of what one can do in a plan hook. The big change which unearthed the problem is that we now support privileges on external URLs. These URLs may be provided via placeholders or expressions. If we couldn't access the expressions until we were in the row fn, then we'd be at risk of performing privilege checking concurrently, which is not safe. One thing that this commit does is introduce a new package `evalexpr` with helpers for type checking. It is so named because I'd like to also remove the methods from the `PlanHookState` which we now use for expr evaluation and place them in this new package. I don't love the name. Fixes cockroachdb#88385 Release note: None

Prior to this change, plan hooks would be invokved twice, once during prepare and again during execution. At prepare time, the placeholder values are not known, but, critically, the placeholders were being type-checked. This type checking informs the sql layer how to interpret the placeholder values when we are trying to bind them. Importantly, the sql layer does not ever use the `planNode` produced via the `PlanHookRowFn` returned from the planning invocation. The above dance meant that these plan hooks had quite a bit of complexity to eagerly type check but lazily evaluate expressions passed in via the statement. This complexity became a problem more recently: one cannot and should not access many planner data structures from the row func because such functions can, theoretically, run concurrently and these data structures are not thread safe. Note that the above problem is not commonly a real problem. Most plan hook statements don't permit concurrency and those than do, via, say, `UNION ALL` don't commonly get executed in such a setting. In some ways, the real reason for this change is to unwind some unneeded complexity and clarify the bounds of what one can do in a plan hook. The big change which unearthed the problem is that we now support privileges on external URLs. These URLs may be provided via placeholders or expressions. If we couldn't access the expressions until we were in the row fn, then we'd be at risk of performing privilege checking concurrently, which is not safe. One thing that this commit does is introduce a new package `exprutil` with helpers for type checking. It is so named because I'd like to also remove the methods from the `PlanHookState` which we now use for expr evaluation and place them in this new package. Fixes cockroachdb#88385 Release note: None

90254: sql,*: rework plan hooks to type check separately from execution r=ajwerner a=ajwerner Prior to this change, plan hooks would be invoked twice, once during prepare and again during execution. At prepare time, the placeholder values are not known, but, critically, the placeholders were being type-checked. This type checking informs the sql layer how to interpret the placeholder values when we are trying to bind them. Importantly, the sql layer does not ever use the `planNode` produced via the `PlanHookRowFn` returned from the planning invocation. The above dance meant that these plan hooks had quite a bit of complexity to eagerly type check but lazily evaluate expressions passed in via the statement. This complexity became a problem more recently: one cannot and should not access many planner data structures from the row func because such functions can, theoretically, run concurrently and these data structures are not thread safe. Note that the above problem is not commonly a real problem. Most plan hook statements don't permit concurrency and those than do, via, say, `UNION ALL` don't commonly get executed in such a setting. In some ways, the real reason for this change is to unwind some unneeded complexity and clarify the bounds of what one can do in a plan hook. The big change which unearthed the problem is that we now support privileges on external URLs. These URLs may be provided via placeholders or expressions. If we couldn't access the expressions until we were in the row fn, then we'd be at risk of performing privilege checking concurrently, which is not safe. One thing that this commit does is introduce a new package `exprutil` with helpers for type checking. It is so named because I'd like to also remove the methods from the `PlanHookState` which we now use for expr evaluation and place them in this new package. Fixes #88385 Release note: None Co-authored-by: Andrew Werner <awerner32@gmail.com>

Prior to this change, plan hooks would be invokved twice, once during prepare and again during execution. At prepare time, the placeholder values are not known, but, critically, the placeholders were being type-checked. This type checking informs the sql layer how to interpret the placeholder values when we are trying to bind them. Importantly, the sql layer does not ever use the `planNode` produced via the `PlanHookRowFn` returned from the planning invocation. The above dance meant that these plan hooks had quite a bit of complexity to eagerly type check but lazily evaluate expressions passed in via the statement. This complexity became a problem more recently: one cannot and should not access many planner data structures from the row func because such functions can, theoretically, run concurrently and these data structures are not thread safe. Note that the above problem is not commonly a real problem. Most plan hook statements don't permit concurrency and those than do, via, say, `UNION ALL` don't commonly get executed in such a setting. In some ways, the real reason for this change is to unwind some unneeded complexity and clarify the bounds of what one can do in a plan hook. The big change which unearthed the problem is that we now support privileges on external URLs. These URLs may be provided via placeholders or expressions. If we couldn't access the expressions until we were in the row fn, then we'd be at risk of performing privilege checking concurrently, which is not safe. One thing that this commit does is introduce a new package `exprutil` with helpers for type checking. It is so named because I'd like to also remove the methods from the `PlanHookState` which we now use for expr evaluation and place them in this new package. Fixes cockroachdb#88385 Release note: None

dasrirez added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Sep 21, 2022

postamar self-assigned this Sep 21, 2022

postamar added GA-blocker branch-release-22.2 Used to mark GA and release blockers, technical advisories, and bugs for 22.2 labels Sep 21, 2022

blathers-crl bot added the T-sql-schema-deprecated Use T-sql-foundations instead label Sep 21, 2022

cockroachdb deleted a comment from blathers-crl bot Sep 21, 2022

postamar removed their assignment Sep 22, 2022

blathers-crl bot added the T-disaster-recovery label Sep 23, 2022

RichardJCai mentioned this issue Sep 27, 2022

sql: refactor privilege to remove eval dep from catalog #88787

Closed

postamar assigned RichardJCai Oct 4, 2022

celiala added the blocks-22.2.0-rc.1 label Oct 17, 2022

postamar assigned postamar and unassigned RichardJCai Oct 17, 2022

ajwerner mentioned this issue Oct 19, 2022

sql,*: rework plan hooks to type check separately from execution #90254

Merged

celiala added the branch-release-22.2.0 label Oct 20, 2022

craig bot closed this as completed in c83dda7 Oct 20, 2022

ajwerner mentioned this issue Oct 20, 2022

release-22.2.0: sql,*: rework plan hooks to type check separately from execution #90414

Merged

ajwerner mentioned this issue Oct 20, 2022

release-22.2: sql,*: rework plan hooks to type check separately from execution #90415

Merged

exalate-issue-sync bot removed the T-disaster-recovery label Oct 21, 2022

healthy-pod added T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) and removed T-sql-schema-deprecated Use T-sql-foundations instead labels May 17, 2023

exalate-issue-sync bot added branch-release-22.2.0 and removed todo-deprecate.branch-release-22.2.0 labels May 23, 2023

github-project-automation bot added this to Disaster Recovery Backlog Aug 28, 2024

github-project-automation bot moved this to Done in Disaster Recovery Backlog Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Panic when running UNION on SHOW BACKUP queries as non-root user #88385

Panic when running UNION on SHOW BACKUP queries as non-root user #88385

dasrirez commented Sep 21, 2022 •

edited by jlinder

Loading

postamar commented Sep 21, 2022

postamar commented Sep 22, 2022

ajwerner commented Sep 23, 2022

blathers-crl bot commented Sep 23, 2022

dt commented Sep 26, 2022

ajwerner commented Sep 26, 2022

dt commented Sep 26, 2022

ajwerner commented Sep 26, 2022

dt commented Sep 27, 2022

RichardJCai commented Sep 27, 2022

ajwerner commented Sep 27, 2022 •

edited

Loading

ajwerner commented Sep 27, 2022

Panic when running UNION on SHOW BACKUP queries as non-root user #88385

Panic when running UNION on SHOW BACKUP queries as non-root user #88385

Comments

dasrirez commented Sep 21, 2022 • edited by jlinder Loading

postamar commented Sep 21, 2022

postamar commented Sep 22, 2022

ajwerner commented Sep 23, 2022

blathers-crl bot commented Sep 23, 2022

dt commented Sep 26, 2022

ajwerner commented Sep 26, 2022

dt commented Sep 26, 2022

ajwerner commented Sep 26, 2022

dt commented Sep 27, 2022

RichardJCai commented Sep 27, 2022

ajwerner commented Sep 27, 2022 • edited Loading

ajwerner commented Sep 27, 2022

dasrirez commented Sep 21, 2022 •

edited by jlinder

Loading

ajwerner commented Sep 27, 2022 •

edited

Loading