-
Notifications
You must be signed in to change notification settings - Fork 3.9k
release-23.1: merge v23.1.9 tag into release branch #110246
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Fixes cockroachdb#107352. We allow rollbacks to be sent on a txn at any time. Before this change, a rollback that encountered a COMMITTED txn would loudly log an error to the log. This commit removes this logging in such cases, retaining it only for unexpected cases. Release note: None
As observed in cockroachdb#108348, a test failed because it timed out reading a row. Yet, this flake could not be reproduced in over 50k runs. Bump timeout period to make this flake even less likely. Fixes cockroachdb#108348 Release note: None
The lease preferences roachtest could occasionally fail, if the liveness leaseholder were on a stopped node. We should address this issue, for now, pin the liveness lease to a live node to prevent flakes. Informs: cockroachdb#108512 Resolves: cockroachdb#108425 Release note: None
This commit removes the datadriven.RunTest call from TestUnavailableZip, as this unit test is meant to test the ability of debug zip to output while nodes are unavailable, and not meant to assert the entire output is deterministic. The commit also refactors the test cases to test output for a cluster with a partially unavailable node and a fully unavailable node. In the first case, we block replica requests from the node to table data, simulating SQL unavailability. In the second case, we block all replica requests from the node, extending the unavailability to liveness and status endpoints. To ensure the test doesn't take too long, we reduce the debug zip timeout from 0.5s to 0.25s. Fixes cockroachdb#106961. Release note: None
….9-rc-107199 release-23.1.9-rc: cli: simplify output assertion in TestUnavailableZip
`TestReplicateQueueLeasePreferencePurgatoryError` waits for every lease to be on one store, before changing the lease preference, and asserting the leaseholders end up in purgatory. The test could occasionally time out before every lease was on the initial store. The test forced replicate queue processing to speed this up, however only on one store, when all stores may need to transfer lease(s). Force replication scan and process on every store, instead of just the initially preferred node. This speeds up this stage of the test. Fixes: cockroachdb#108624 Release note: None
Before, the TestSQLStatsRegions test failed for secondary tenant caused by the split and lease transfer (for lease preferences) because we had to wait for the test table range(s) split, for each regional row, and each range's lease moved to the row's primary region. With introduced change, we check that at least one leaseholder is present in every region, if not force enqueue split for all ranges of test table. Release note: None
…108634 release-23.1.9-rc: sql: fix TestSQLStatsRegions test
protobuf.js sets the default value of Long types to 0, instead of Long.fromInt(0). This is a workaround until we have a more comprehensive solution. See https://github.com/cockroachdb/cockroach/blob/165960a9e0588e7046da44a4a4a45b78b9d4541f/pkg/ui/workspaces/cluster-ui/src/util/fixLong.ts#L13 Resolves cockroachdb#107905 Release note (bug fix): Fixes errors on the Sessions page UI when a session's memory usage is zero bytes.
…onumber-fix ui: fix occurrence of x.toNumber is not a function
logtags.AddTag calls context.WithValue which creates a new context. Further, here we were re-assigning this new context to the original context passed to the Resume function. As a result, I believe that we were ending up with a constantly growing chain of contexts. We believe this is the cause of slowly growing CPU usage. Release note (bug fix): Fix bug that could cause CPU usage to increase over time.
…ort-release-23.1.9-rc-108703
Previously, the cross-joiner wouldn't advance its internal state when its left input projected no columns. This would result in an infinite loop as the right rows were repeatedly emitted. This patch advances the state when there are no left columns as if values from the left side were emitted. Fixes cockroachdb#105882 Release note (bug fix): Fixed a bug introduced in v22.1 that could cause a join to infinite-loop in rare cases when (1) the join filter is not an equality and (2) no columns from the left input are returned.
Replicas were enqueued into the replicate queue, upon the store receiving a span config update which could affect the replica. The replicate queue shouldQueue is relatively more expensive than other queues. Introduce the cluster setting kv.enqueue_in_replicate_queue_on_span_config_update.enabled, which when set to true, enables queuing up replicas on span config updates; when set to false, disables queuing replicas on span config updates. By default, this settings is set to false. Resolves: cockroachdb#108724 Release note (ops change): Introduce the kv.enqueue_in_replicate_queue_on_span_config_update.enabled cluster setting. When set to true, stores in the cluster will enqueue replicas for replication changes, upon receiving config updates which could affect the replica. This setting is off by default. Enabling this setting speeds up how quickly config triggered replication changes begin, but adds additional CPU overhead. The overhead scales with the number of leaseholders.
…rc-105931 release-23.1.9-rc: colexec: don't infinite loop cross-join with zero-column left input
Five years ago, in cockroachdb#26881, we changed import to retry on worker failures, which made imports much more resilient to transient failures like nodes going down. As part of this work we created `TestImportWorkerFailure` which shuts down one node during an import, and checks that the import succeeded. Unfortunately, this test was checked-in skipped, because though imports were much more resilient to node failures, they were not completely resilient in every possible scenario, making the test flakey. Two months ago, in cockroachdb#105712, we unskipped this test and discovered that in some cases the import statement succeeded but only imported a partial dataset. This non-atomicity seems like a bigger issue than whether the import is able to succeed in every possible transient failure scenario, and is tracked separately in cockroachdb#108547. This PR changes `TestImportWorkerFailure` to remove successful import as a necessary condition for test success. Instead, the test now only checks whether the import was atomic; that is, whether a successful import imported all data or a failed import imported none. This is more in line with what we can guarantee about imports today. Fixes: cockroachdb#102839 Release note: None
An issue in roachtest cockroachdb#108530 prevents clean test termination when calling Wait() on a test monitor that did not have at least 1 task started. This cause `cdc/kafka-oauth` test to hang. Add a '30s' duration to the tpcc task to go around this problem. Fixes cockroachdb#108507 Release note: None
…108626 release-23.1.9-rc: importer: only check import *atomicity* in TestImportWorkerFailure
Previously, the `CheckKeyCountIncludingTombstoned` call in TestDropTableWhileUpgradingFormat flaked with a key count mismatch error (expected: 100, actual: 0). This was odd because this key counting failure was directly after rows were committed. To address this, this patch adds a retry for the `CheckKeyCountIncludingTombstoned` logic in case this is due to a race condition. Epic: none Fixes: cockroachdb#108340 Release note (sql change): deflake TestDropTableWhileUpgradingFormat
…1.9-rc-backport-kafka-oauth release 23.1.9 rc: roachtest: Ensure tpcc workloads runs for a bit
…eached This commit introduces the cluster setting `sql.stats.limit_table_size.enabled` which controls whether or not we enforce the row limit set by `sql.stats.persited_rows.max` in the system.{statement,transaction} statistics tables. Previously, when the sql stats tables reach the row limit set by `sql.stats.persisted_rows.max`, we disable flushing any more stats to the tables. This restriction involved issuing a query during the flush process to check the number of rows currently in the tables. This query can often contend with the compaction job queries and so we should allow a method to bypass this check to alleviate the contention. Furhtermore, in some cases we'd like to allow the system tables to grow in order to not have gaps in sql stats observability. In these cases we can attempt to modify the compaction settings to allow the job to catch up to the target number of rows. Fixes: cockroachdb#108771 Release note (sql change): This commit introduces the cluster setting `sql.stats.limit_table_size.enabled` which controls whether or not we enforce the row limit set by `sql.stats.persited_rows.max` in the system.{statement,transaction} statistics tables.
This change adds option env vars which can be passed to `tests.Chaos` that are applied to nodes that are restarted. Release note: None Epic: None
Previously, this roachtest could fail due to slow retries causing the changefeed to make no progress. See cockroachdb#108896 (comment) for more info. This tests sets `COCKROACH_CHANGEFEED_TESTING_FAST_RETRY` to avoid slow retries, but this environment variable is unset when nodes restart. This change ensures it is preserved. Release note: None Epic: None Informs: cockroachdb#108896
…108835 release-23.1.9-rc: persistedsqlstats: add cluster setting to enable flush on row limit r…
Both of these tests are being worked on upstream to resolve the flakiness. Since CockroachDB itself is not the cause of the flakes, we ignore them so that these tests don't create unnecessary issues. Release note: None
…ort-release-23.1.9-rc-108939 release-23.1.9-rc: roachtest: ignore upstream flakes in activerecord and sequelize
CREATE USER and ALTER USER create schema change jobs in order to bump the descriptor version on the users and role_options tables in order to refresh the authentication cache. Adding these users in a single transaction means we only need to wait for 3 such jobs rather than 3000 jobs. This is particularly useful since we occasionally observe massive slow downs in processing these jobs. While we should get to the bottom of such slow downs, let's not slow down CI. Supersedes cockroachdb#81276 Epic: none Release note: None
In recent CI runs, we've been seeing TestFullClusterBackup fail with: full_cluster_backup_restore_test.go:143: error executing 'COMMIT': pq: restart transaction: TransactionRetryWithProtoRefreshError: TransactionAbortedError(ABORT_REASON_ABORTED_RECORD_FOUND): "sql txn" meta={id=ec1de683 key=/Table/4/1/"maxroach0"/0 iso=Serializable pri=0.03058324 epo=0 ts=1686254742.907452755,1 min=1686254688.329813751,0 seq=8023} lock=true stat=ABORTED rts=1686254742.907452755,1 wto=false gul=1686254688.829813751,0 int=2006 This looks to be the result of a recent change to issue all of the CREATE USER statements in a single transaction. Since explicit transactions are only automatically retried in specific circumstances, we now have to deal with the retry error. Here, we introduce a small helper that retries anything that looks like a restart error. Note that I haven't been able to reproduce the above locally as the test is very heavyweight and difficult to stress. Epic: none Release note: None
In CI we've seen this test fail with: backup_test.go:670: error scanning '&{<nil> 0xc019c5b580}': pq: restart transaction: TransactionRetryWithProtoRefreshError: TransactionRetryError: retry txn (RETRY_SERIALIZABLE - failed preemptive refresh): "sql txn" meta={id=77054b51 key=/Table/121/1 pri=0.03587869 epo=0 ts=1689916873.568949604,1 min=1689916873.436580640,0 seq=1000} lock=true stat=PENDING rts=1689916873.436580640,0 wto=false gul=1689916873.936580640,0 The `RETURNING` clauses on these two `UPDATE` statements prevent automatic transaction retries. Here we wrap the queries in an explicit transaction with a retry loop which should prevent the test failure test, assuming that the update isn't so contended it won't ever complete. I am unable to stress this enough locally to reproduce this error. Probably Fixes cockroachdb#107330 Epic: none Release note: None
…09517 release-23.1.9-rc: sql: avoid stampede on synthetic privilege cache in vtable population
…ort-release-23.1.9-rc-109713 release-23.1.9-rc: backupccl: avoid creating spans that start or end on a .Next() key
Restore may use unsafe keys as split points, which may cause unsafe splits between column families, which may cause SQL to fail when reading the row, or worse, return wrong resutls. This commit avoids splitting on keys that might be unsafe. See the issue for more info. Epic: none Informs: cockroachdb#109483 Release note: None.
…ort-release-23.1.9-rc-109378 release-23.1.9-rc: backupccl: avoid splitting if the split point might be unsafe
…int might be unsafe"
…blathers/backport-release-23.1.9-rc-109378 Revert "release-23.1.9-rc: backupccl: avoid splitting if the split point might be unsafe"
Node 16 no longer is supported, so installation of it fails. Now we update to a supported version. Release note: None
This is required to make a test pass. Release note: None
…ort-release-23.1.9-rc-109896 release-23.1.9-rc: roachtest: update version of nodejs used by ORM tests
…ort-release-23.1.9-rc-109750 release-23.1.9-rc: backupccl: during restore, do not .Next() any keys in makeSimpleImportSpans
Thanks for opening a backport. Please check the backport criteria before merging:
If some of the basic criteria cannot be satisfied, ensure that the exceptional criteria are satisfied within.
Add a brief release justification to the body of your PR to justify this backport. Some other things to consider:
|
rail
approved these changes
Sep 8, 2023
celiala
approved these changes
Sep 8, 2023
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you!
This was referenced Sep 9, 2023
Open
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
v23.1.9 was tagged from a sub-branch off of release-23.1; this merges it back into the release branch so it appears in the linage of subsequent commits.
Epic: none.
Release justification: noop.