-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
catalog/lease: block liveness heartbeats if leases cannot be written #119739
Conversation
b0410b8
to
a2d8bf6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @fqazi)
pkg/sql/catalog/lease/storage.go
line 235 at r1 (raw file):
// Run a retry loop to deal with AmbiguousResultErrors. All other error types // are propagated up to the caller. maxRetryDuration := s.jitteredLeaseDuration()
nit: i don't follow why this defaults to s.jitteredLeaseDuration()
but the else
block sets this to 0. also, why does it subtract the ttl time? some comments could help
(same for the release
logic below)
pkg/sql/sqlliveness/slinstance/slinstance.go
line 131 at r1 (raw file):
// blockExtensions indicates if extensions are disallowed because of availability // issues on dependent tables. blockExtensions int
nit: maybe call it blockedExtensions
to make it more clear it's a count
pkg/sql/sqlliveness/slinstance/slinstance.go
line 257 at r1 (raw file):
// If extensions are disallowed we are only going to heartbeat the same // timestamp.
is it important to hearbeat the same timestamp, rather than just short-circuit return here?
pkg/sql/sqlliveness/slinstance/slinstance.go
line 466 at r1 (raw file):
firstToBlock := l.mu.blockExtensions == 0 l.mu.blockExtensions++ defer l.mu.Unlock()
nit: the defer should be immediately after the Lock. (same for the next function)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @rafiss)
pkg/sql/catalog/lease/storage.go
line 235 at r1 (raw file):
Previously, rafiss (Rafi Shamim) wrote…
nit: i don't follow why this defaults to
s.jitteredLeaseDuration()
but theelse
block sets this to 0. also, why does it subtract the ttl time? some comments could help(same for the
release
logic below)
Added a comment to clarify and fixed the variable names:
// Compute the maximum time we will retry ambiguous replica errors before
// disabling the SQL liveness heartbeat. The time chosen will guarantee that
// the sqlliveness TTL expires once the lease duration has surpassed.
pkg/sql/sqlliveness/slinstance/slinstance.go
line 257 at r1 (raw file):
Previously, rafiss (Rafi Shamim) wrote…
is it important to hearbeat the same timestamp, rather than just short-circuit return here?
The update logic also detects if the session has been deleted, so thats why I was heart beating the same timestamp. Updated the comment to:
// If extensions are disallowed we are only going to heartbeat the same
// timestamp, so that we can detect if the sqlliveness row was removed.
Thank you for updating your pull request. My owl senses detect your PR is good for review. Please keep an eye out for any test failures in CI. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
9755f01
to
bc419b4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @fqazi)
pkg/sql/catalog/lease/lease_test.go
line 3740 at r2 (raw file):
// failure, the retry happens, and there is just one lease. selectErr := make(chan error) go func() {
could this one use the ctxgroup also?
Previously, if we had any type of partition failure where a single node could not write / delete leases it could cause the schema changes to be blocked on descriptors held by it. This was made worse with the session based model where leases no longer expire, so any failure of this nature will last until the problematic node is restarted. To address this, this patch will prevent the sqlliveness from extending the expiration if the leases table cannot be accessed any longer. Fixes: cockroachdb#119654 Release note: None
bc419b4
to
3fe7bf8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @rafiss)
pkg/sql/catalog/lease/lease_test.go
line 3740 at r2 (raw file):
Previously, rafiss (Rafi Shamim) wrote…
could this one use the ctxgroup also?
Done.
@rafiss TFTR! bors r+ |
Build succeeded: |
Previously, if we had any type of partition failure where a single node could not write / delete leases it could cause the schema changes to be blocked on descriptors held by it. This was made worse with the session based model where leases no longer expire, so any failure of this nature will last until the problematic node is restarted. To address this, this patch will prevent the sqlliveness from extending the expiration if the leases table cannot be accessed any longer.
Fixes: #119654
Fixes: #119347
Release note: None