Skip to content

release-23.1: merge v23.1.9 tag into release branch #110246

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 111 commits into from
Sep 8, 2023

Conversation

dt
Copy link
Member

@dt dt commented Sep 8, 2023

v23.1.9 was tagged from a sub-branch off of release-23.1; this merges it back into the release branch so it appears in the linage of subsequent commits.

Epic: none.
Release justification: noop.

nvanbenschoten and others added 30 commits July 27, 2023 18:05
Fixes cockroachdb#107352.

We allow rollbacks to be sent on a txn at any time. Before this change,
a rollback that encountered a COMMITTED txn would loudly log an error to
the log. This commit removes this logging in such cases, retaining it
only for unexpected cases.

Release note: None
As observed in cockroachdb#108348, a test failed because
it timed out reading a row.  Yet, this flake
could not be reproduced in over 50k runs.
Bump timeout period to make this flake even
less likely.

Fixes cockroachdb#108348

Release note: None
The lease preferences roachtest could occasionally fail, if the liveness
leaseholder were on a stopped node. We should address this issue,
for now, pin the liveness lease to a live node to prevent flakes.

Informs: cockroachdb#108512
Resolves: cockroachdb#108425
Release note: None
This commit removes the datadriven.RunTest call from
TestUnavailableZip, as this unit test is meant to test the
ability of debug zip to output while nodes are unavailable,
and not meant to assert the entire output is deterministic.

The commit also refactors the test cases to test output for
a cluster with a partially unavailable node and a fully
unavailable node. In the first case, we block replica
requests from the node to table data, simulating SQL
unavailability. In the second case, we block all replica
requests from the node, extending the unavailability to
liveness and status endpoints. To ensure the test doesn't
take too long, we reduce the debug zip timeout from 0.5s
to 0.25s.

Fixes cockroachdb#106961.

Release note: None
….9-rc-107199

release-23.1.9-rc: cli: simplify output assertion in TestUnavailableZip
`TestReplicateQueueLeasePreferencePurgatoryError` waits for every lease
to be on one store, before changing the lease preference, and asserting
the leaseholders end up in purgatory.

The test could occasionally time out before every lease was on the
initial store. The test forced replicate queue processing to speed this
up, however only on one store, when all stores may need to transfer
lease(s).

Force replication scan and process on every store, instead of just the
initially preferred node. This speeds up this stage of the test.

Fixes: cockroachdb#108624
Release note: None
Before, the TestSQLStatsRegions test failed for secondary tenant
caused by the split and lease transfer (for lease preferences)
because we had to wait for the test table range(s) split,
for each regional row, and each range's lease moved to the
row's primary region.
With introduced change, we check that at least one leaseholder
is present in every region, if not force enqueue split for all
ranges of test table.

Release note: None
…108634

release-23.1.9-rc: sql: fix TestSQLStatsRegions test
protobuf.js sets the default value of Long types to 0,
instead of Long.fromInt(0).

This is a workaround until we have a more comprehensive solution.

See https://github.com/cockroachdb/cockroach/blob/165960a9e0588e7046da44a4a4a45b78b9d4541f/pkg/ui/workspaces/cluster-ui/src/util/fixLong.ts#L13

Resolves cockroachdb#107905
Release note (bug fix): Fixes errors on the Sessions page UI
when a session's memory usage is zero bytes.
…onumber-fix

ui: fix occurrence of x.toNumber is not a function
logtags.AddTag calls context.WithValue which creates a new
context. Further, here we were re-assigning this new context to the
original context passed to the Resume function. As a result, I believe
that we were ending up with a constantly growing chain of contexts.

We believe this is the cause of slowly growing CPU usage.

Release note (bug fix): Fix bug that could cause CPU usage to increase
over time.
Previously, the cross-joiner wouldn't advance its internal state when
its left input projected no columns. This would result in an infinite
loop as the right rows were repeatedly emitted. This patch advances
the state when there are no left columns as if values from the left
side were emitted.

Fixes cockroachdb#105882

Release note (bug fix): Fixed a bug introduced in v22.1 that could cause
a join to infinite-loop in rare cases when (1) the join filter is not an
equality and (2) no columns from the left input are returned.
Replicas were enqueued into the replicate queue, upon the store
receiving a span config update which could affect the replica. The
replicate queue shouldQueue is relatively more expensive than other
queues.
Introduce the cluster setting
kv.enqueue_in_replicate_queue_on_span_config_update.enabled, which when
set to true, enables queuing up replicas on span config updates; when
set to false, disables queuing replicas on span config updates.
By default, this settings is set to false.

Resolves: cockroachdb#108724

Release note (ops change): Introduce the
kv.enqueue_in_replicate_queue_on_span_config_update.enabled cluster
setting. When set to true, stores in the cluster will enqueue replicas
for replication changes, upon receiving config updates which could
affect the replica. This setting is off by default. Enabling this
setting speeds up how quickly config triggered replication changes
begin, but adds additional CPU overhead. The overhead scales with the
number of leaseholders.
…rc-105931

release-23.1.9-rc: colexec: don't infinite loop cross-join with zero-column left input
Five years ago, in cockroachdb#26881, we changed import to retry on worker
failures, which made imports much more resilient to transient failures
like nodes going down. As part of this work we created
`TestImportWorkerFailure` which shuts down one node during an import,
and checks that the import succeeded. Unfortunately, this test was
checked-in skipped, because though imports were much more resilient to
node failures, they were not completely resilient in every possible
scenario, making the test flakey.

Two months ago, in cockroachdb#105712, we unskipped this test and discovered that
in some cases the import statement succeeded but only imported a partial
dataset. This non-atomicity seems like a bigger issue than whether the
import is able to succeed in every possible transient failure scenario,
and is tracked separately in cockroachdb#108547.

This PR changes `TestImportWorkerFailure` to remove successful import as
a necessary condition for test success. Instead, the test now only
checks whether the import was atomic; that is, whether a successful
import imported all data or a failed import imported none. This is more
in line with what we can guarantee about imports today.

Fixes: cockroachdb#102839

Release note: None
An issue in roachtest cockroachdb#108530 prevents clean test
termination when calling Wait() on a test monitor
that did not have at least 1 task started.

This cause `cdc/kafka-oauth` test to hang.
Add a '30s' duration to the tpcc task to go around
this problem.

Fixes cockroachdb#108507

Release note: None
…108626

release-23.1.9-rc: importer: only check import *atomicity* in TestImportWorkerFailure
Previously, the `CheckKeyCountIncludingTombstoned` call in
TestDropTableWhileUpgradingFormat flaked with a key count
mismatch error (expected: 100, actual: 0). This was odd
because this key counting failure was directly after rows
were committed. To address this, this patch adds a retry
for the `CheckKeyCountIncludingTombstoned` logic in case
this is due to a race condition.

Epic: none
Fixes: cockroachdb#108340

Release note (sql change): deflake TestDropTableWhileUpgradingFormat
…1.9-rc-backport-kafka-oauth

release 23.1.9 rc: roachtest: Ensure tpcc workloads runs for a bit
…eached

This commit introduces the cluster setting `sql.stats.limit_table_size.enabled`
which controls whether or not we enforce the row limit set by `sql.stats.persited_rows.max`
in the system.{statement,transaction} statistics tables.
Previously, when the sql stats tables reach the row limit set by
`sql.stats.persisted_rows.max`, we disable flushing any more stats to the tables.
This restriction involved issuing a query during the flush process to check
the number of rows currently in the tables. This query can often contend with
the compaction job queries and so we should allow a method to bypass this check
to alleviate the contention.

Furhtermore, in some cases we'd like to allow the system tables to grow in order
to not have gaps in sql stats observability. In these cases we can attempt to
modify the compaction settings to allow the job to catch up to the target number
of rows.

Fixes: cockroachdb#108771

Release note (sql change):
This commit introduces the cluster setting `sql.stats.limit_table_size.enabled`
which controls whether or not we enforce the row limit set by `sql.stats.persited_rows.max`
in the system.{statement,transaction} statistics tables.
This change adds option env vars which can be passed to `tests.Chaos`
that are applied to nodes that are restarted.

Release note: None
Epic: None
Previously, this roachtest could fail due to slow retries causing the
changefeed to make no progress. See cockroachdb#108896 (comment)
for more info.

This tests sets `COCKROACH_CHANGEFEED_TESTING_FAST_RETRY` to avoid slow
retries, but this environment variable is unset when nodes restart. This
change ensures it is preserved.

Release note: None
Epic: None
Informs: cockroachdb#108896
…108835

release-23.1.9-rc: persistedsqlstats: add cluster setting to enable flush on row limit r…
Both of these tests are being worked on upstream to resolve the
flakiness. Since CockroachDB itself is not the cause of the flakes, we
ignore them so that these tests don't create unnecessary issues.

Release note: None
…ort-release-23.1.9-rc-108939

release-23.1.9-rc: roachtest: ignore upstream flakes in activerecord and sequelize
CREATE USER and ALTER USER create schema change jobs in order to bump
the descriptor version on the users and role_options tables in order
to refresh the authentication cache.

Adding these users in a single transaction means we only need to wait
for 3 such jobs rather than 3000 jobs.

This is particularly useful since we occasionally observe massive slow
downs in processing these jobs. While we should get to the bottom of
such slow downs, let's not slow down CI.

Supersedes cockroachdb#81276

Epic: none

Release note: None
In recent CI runs, we've been seeing TestFullClusterBackup fail with:

    full_cluster_backup_restore_test.go:143: error executing 'COMMIT':
    pq: restart transaction: TransactionRetryWithProtoRefreshError:
    TransactionAbortedError(ABORT_REASON_ABORTED_RECORD_FOUND): "sql
    txn" meta={id=ec1de683 key=/Table/4/1/"maxroach0"/0
    iso=Serializable pri=0.03058324 epo=0 ts=1686254742.907452755,1
    min=1686254688.329813751,0 seq=8023} lock=true stat=ABORTED
    rts=1686254742.907452755,1 wto=false gul=1686254688.829813751,0
    int=2006

This looks to be the result of a recent change to issue all of the
CREATE USER statements in a single transaction. Since explicit
transactions are only automatically retried in specific circumstances,
we now have to deal with the retry error.

Here, we introduce a small helper that retries anything that looks
like a restart error.

Note that I haven't been able to reproduce the above locally as the
test is very heavyweight and difficult to stress.

Epic: none

Release note: None
In CI we've seen this test fail with:

    backup_test.go:670: error scanning '&{<nil> 0xc019c5b580}': pq:
    restart transaction: TransactionRetryWithProtoRefreshError:
    TransactionRetryError: retry txn (RETRY_SERIALIZABLE - failed
    preemptive refresh): "sql txn" meta={id=77054b51 key=/Table/121/1
    pri=0.03587869 epo=0 ts=1689916873.568949604,1
    min=1689916873.436580640,0 seq=1000} lock=true stat=PENDING
    rts=1689916873.436580640,0 wto=false gul=1689916873.936580640,0

The `RETURNING` clauses on these two `UPDATE` statements prevent
automatic transaction retries.

Here we wrap the queries in an explicit transaction with a retry loop
which should prevent the test failure test, assuming that the update
isn't so contended it won't ever complete.

I am unable to stress this enough locally to reproduce this error.

Probably Fixes cockroachdb#107330

Epic: none

Release note: None
rafiss and others added 11 commits August 31, 2023 10:57
…09517

release-23.1.9-rc: sql: avoid stampede on synthetic privilege cache in vtable population
…ort-release-23.1.9-rc-109713

release-23.1.9-rc: backupccl: avoid creating spans that start or end on a .Next() key
Restore may use unsafe keys as split points, which may cause unsafe splits
between column families, which may cause SQL to fail when reading the row, or
worse, return wrong resutls.

This commit avoids splitting on keys that might be unsafe.

See the issue for more info.

Epic: none
Informs: cockroachdb#109483

Release note: None.
…ort-release-23.1.9-rc-109378

release-23.1.9-rc: backupccl: avoid splitting if the split point might be unsafe
…blathers/backport-release-23.1.9-rc-109378

Revert "release-23.1.9-rc: backupccl: avoid splitting if the split point might be unsafe"
Node 16 no longer is supported, so installation of it fails. Now we
update to a supported version.

Release note: None
This is required to make a test pass.

Release note: None
…ort-release-23.1.9-rc-109896

release-23.1.9-rc: roachtest: update version of nodejs used by ORM tests
…ort-release-23.1.9-rc-109750

release-23.1.9-rc: backupccl: during restore, do not .Next() any keys in makeSimpleImportSpans
@dt dt requested a review from rail September 8, 2023 12:57
@blathers-crl
Copy link

blathers-crl bot commented Sep 8, 2023

Thanks for opening a backport.

Please check the backport criteria before merging:

  • Patches should only be created for serious issues or test-only changes.
  • Patches should not break backwards-compatibility.
  • Patches should change as little code as possible.
  • Patches should not change on-disk formats or node communication protocols.
  • Patches should not add new functionality.
  • Patches must not add, edit, or otherwise modify cluster versions; or add version gates.
If some of the basic criteria cannot be satisfied, ensure that the exceptional criteria are satisfied within.
  • There is a high priority need for the functionality that cannot wait until the next release and is difficult to address in another way.
  • The new functionality is additive-only and only runs for clusters which have specifically “opted in” to it (e.g. by a cluster setting).
  • New code is protected by a conditional check that is trivial to verify and ensures that it only runs for opt-in clusters.
  • The PM and TL on the team that owns the changed code have signed off that the change obeys the above rules.

Add a brief release justification to the body of your PR to justify this backport.

Some other things to consider:

  • What did we do to ensure that a user that doesn’t know & care about this backport, has no idea that it happened?
  • Will this work in a cluster of mixed patch versions? Did we test that?
  • If a user upgrades a patch version, uses this feature, and then downgrades, what happens?

@blathers-crl blathers-crl bot added the backport Label PR's that are backports to older release branches label Sep 8, 2023
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@dt dt requested a review from celiala September 8, 2023 12:57
@dt dt removed the backport Label PR's that are backports to older release branches label Sep 8, 2023
Copy link
Collaborator

@celiala celiala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.