Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

catalog/lease: block liveness heartbeats if leases cannot be written #119739

Merged
merged 1 commit into from
Mar 1, 2024

Conversation

fqazi
Copy link
Collaborator

@fqazi fqazi commented Feb 28, 2024

Previously, if we had any type of partition failure where a single node could not write / delete leases it could cause the schema changes to be blocked on descriptors held by it. This was made worse with the session based model where leases no longer expire, so any failure of this nature will last until the problematic node is restarted. To address this, this patch will prevent the sqlliveness from extending the expiration if the leases table cannot be accessed any longer.

Fixes: #119654
Fixes: #119347

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@fqazi fqazi force-pushed the leaseSqllivenessRenewal branch 2 times, most recently from b0410b8 to a2d8bf6 Compare February 29, 2024 19:09
@fqazi fqazi marked this pull request as ready for review February 29, 2024 19:09
@fqazi fqazi requested a review from a team as a code owner February 29, 2024 19:09
Copy link
Collaborator

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @fqazi)


pkg/sql/catalog/lease/storage.go line 235 at r1 (raw file):

	// Run a retry loop to deal with AmbiguousResultErrors. All other error types
	// are propagated up to the caller.
	maxRetryDuration := s.jitteredLeaseDuration()

nit: i don't follow why this defaults to s.jitteredLeaseDuration() but the else block sets this to 0. also, why does it subtract the ttl time? some comments could help

(same for the release logic below)


pkg/sql/sqlliveness/slinstance/slinstance.go line 131 at r1 (raw file):

		// blockExtensions indicates if extensions are disallowed because of availability
		// issues on dependent tables.
		blockExtensions int

nit: maybe call it blockedExtensions to make it more clear it's a count


pkg/sql/sqlliveness/slinstance/slinstance.go line 257 at r1 (raw file):

	// If extensions are disallowed we are only going to heartbeat the same
	// timestamp.

is it important to hearbeat the same timestamp, rather than just short-circuit return here?


pkg/sql/sqlliveness/slinstance/slinstance.go line 466 at r1 (raw file):

	firstToBlock := l.mu.blockExtensions == 0
	l.mu.blockExtensions++
	defer l.mu.Unlock()

nit: the defer should be immediately after the Lock. (same for the next function)

Copy link
Collaborator Author

@fqazi fqazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @rafiss)


pkg/sql/catalog/lease/storage.go line 235 at r1 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

nit: i don't follow why this defaults to s.jitteredLeaseDuration() but the else block sets this to 0. also, why does it subtract the ttl time? some comments could help

(same for the release logic below)

Added a comment to clarify and fixed the variable names:

// Compute the maximum time we will retry ambiguous replica errors before
// disabling the SQL liveness heartbeat. The time chosen will guarantee that
// the sqlliveness TTL expires once the lease duration has surpassed.


pkg/sql/sqlliveness/slinstance/slinstance.go line 257 at r1 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

is it important to hearbeat the same timestamp, rather than just short-circuit return here?

The update logic also detects if the session has been deleted, so thats why I was heart beating the same timestamp. Updated the comment to:

// If extensions are disallowed we are only going to heartbeat the same
// timestamp, so that we can detect if the sqlliveness row was removed.

@fqazi fqazi requested a review from rafiss March 1, 2024 15:29
Copy link

blathers-crl bot commented Mar 1, 2024

Thank you for updating your pull request.

My owl senses detect your PR is good for review. Please keep an eye out for any test failures in CI.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@blathers-crl blathers-crl bot added the O-community Originated from the community label Mar 1, 2024
@fqazi fqazi force-pushed the leaseSqllivenessRenewal branch 3 times, most recently from 9755f01 to bc419b4 Compare March 1, 2024 15:44
Copy link
Collaborator

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @fqazi)


pkg/sql/catalog/lease/lease_test.go line 3740 at r2 (raw file):

	// failure, the retry happens, and there is just one lease.
	selectErr := make(chan error)
	go func() {

could this one use the ctxgroup also?

Previously, if we had any type of partition failure where a single node
could not write / delete leases it could cause the schema changes to be
blocked on descriptors held by it. This was made worse with the session
based model where leases no longer expire, so any failure of this nature
will last until the problematic node is restarted. To address this, this
patch will prevent the sqlliveness from extending the expiration if the
leases table cannot be accessed any longer.

Fixes: cockroachdb#119654

Release note: None
Copy link
Collaborator Author

@fqazi fqazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @rafiss)


pkg/sql/catalog/lease/lease_test.go line 3740 at r2 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

could this one use the ctxgroup also?

Done.

@fqazi
Copy link
Collaborator Author

fqazi commented Mar 1, 2024

@rafiss TFTR!

bors r+

@craig
Copy link
Contributor

craig bot commented Mar 1, 2024

Build succeeded:

@craig craig bot merged commit 96ccd91 into cockroachdb:master Mar 1, 2024
8 of 18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
O-community Originated from the community
Projects
None yet
3 participants