Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sqlliveness: one multi-tenant server can delete another server's instance ID #90576

Closed
andreimatei opened this issue Oct 24, 2022 · 3 comments · Fixed by #90736
Closed

sqlliveness: one multi-tenant server can delete another server's instance ID #90576

andreimatei opened this issue Oct 24, 2022 · 3 comments · Fixed by #90736
Assignees
Labels
a-sql-liveness Related to the sqlliveness and sqlinstance packages. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.

Comments

@andreimatei
Copy link
Contributor

andreimatei commented Oct 24, 2022

There's a race between a server B stealing server A's instance ID and server A deleting its session ID. If the former wins the race, then A will delete B's instance ID record. What makes the race more likely is the fact that there's no tolerance for clock skew in the liveness code. So, if B has a fast clock, it will consider B's session expired and steal its instance potentially before A considers its own session expired.

This will cause the instances table to become out of sync with reality.

We should probably add an ownership check to the instance deletion transaction.

Jira issue: CRDB-20842

@andreimatei andreimatei added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-sql-observability a-sql-liveness Related to the sqlliveness and sqlinstance packages. labels Oct 24, 2022
@andreimatei andreimatei self-assigned this Oct 24, 2022
@ajwerner
Copy link
Contributor

#90427 may indirectly help here.

@ajwerner
Copy link
Contributor

We should probably add an ownership check to the instance deletion transaction.

Another option is to use a CPut with a nil value, which is a deletion tombstone in cockroach sql. If we kept track of the value we used for the initial write it shouldn't be hard.

craig bot pushed a commit that referenced this issue Oct 26, 2022
90547: roachtest: import-cancellation: use a fixed seed; smaller import r=jbowens a=nicktrav

Currently, due to the randomness present in the test, there may be significant variance in test runs. In some situations, the nature of the random import cancellations may result in the test timing out. This makes the test flaky.

Use a fixed seed for the test's PRNG to make the test more deterministic.

Additionally, reduce the size of the overall import by lowering the number of files used to generate the `lineitem` table. This should also help stabilize the runtime.

Fix #90510.

Release note: None.

90720: Revert fmtsafe.go output r=knz a=smg260

TestLint/TestRoachVet is a unit test which runs roachvet and greps various messages from its output - the reverted message being one of them.

Release note: none.
Epic: none

90736: sqlinstance: don't pretend to release the instance ID on shutdown r=andreimatei a=andreimatei

Before this patch, the code looked like tenant SQL servers release their instance IDs on shutdown, but in fact they didn't. Instead of releasing on server shutdown, releasing the instance ID was done in a callback that was running under very specific circumstances - i.e. when the session heartbeat loop found the session record to have been deleted. Additionally, that callback also ran unreliably - it short-circuited if the session is not expired according to the local clock.

This patch does away with releasing the instance ID, on two arguments:
1. The instance ID was rarely and inconsistently released to begin with, as explained above.
2. The instance ID release code is buggy - it blindly deletes the instance ID, even if another server had stolen it. In fact, given when we were releasing, I think we were particularly likely to release IDs that had been stolen. This is very dangerous. Better to never do it than run this risk.

I've considered properly releasing the instance ID on shutdown, but I had trouble working out exactly when and where this can be done: you can't do it after stopper quiescing, and you don't want to do it too soon before that while the instance ID is still used by the server. I think properly releasing the ID will have to be coupled with a way of rejecting further uses of the ID. In the meantime I've left a TODO.

Fixes #90576

Release note: None

Co-authored-by: Nick Travers <travers@cockroachlabs.com>
Co-authored-by: Miral Gadani <miral@cockroachlabs.com>
Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>
andreimatei added a commit to andreimatei/cockroach that referenced this issue Oct 26, 2022
Before this patch, the code looked like tenant SQL servers release their
instance IDs on shutdown, but in fact they didn't. Instead of releasing
on server shutdown, releasing the instance ID was done in a callback
that was running under very specific circumstances - i.e. when the
session heartbeat loop found the session record to have been deleted.
Additionally, that callback also ran unreliably - it short-circuited if
the session is not expired according to the local clock.

This patch does away with releasing the instance ID, on two arguments:
1. The instance ID was rarely and inconsistently released to begin with,
   as explained above.
2. The instance ID release code is buggy - it blindly deletes the
   instance ID, even if another server had stolen it. In fact, given
   when we were releasing, I think we were particularly likely to release
   IDs that had been stolen. This is very dangerous. Better to never do it
   than run this risk.

I've considered properly releasing the instance ID on shutdown, but I
had trouble working out exactly when and where this can be done: you
can't do it after stopper quiescing, and you don't want to do it too
soon before that while the instance ID is still used by the server. I
think properly releasing the ID will have to be coupled with a way of
rejecting further uses of the ID. In the meantime I've left a TODO.

Fixes cockroachdb#90576

Release note: None
@craig craig bot closed this as completed in 5948e4e Oct 26, 2022
@jaylim-crl
Copy link
Collaborator

jaylim-crl commented Oct 27, 2022

Hm, re: clock skew argument. Wouldn't it be the same case during SQL pod startup as well? Consider the case where we already have SQL pod A started up. If B starts up with a faster clock, there could be a possibility where it consider A's session to have expired, and then grab its instance ID. If at the moment, A's session gets extended, wouldn't we be in a bad state?

EDIT: Discussed offline.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a-sql-liveness Related to the sqlliveness and sqlinstance packages. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants