Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: disk-stalled/wal-failover/among-stores failed #129922

Open
cockroach-teamcity opened this issue Aug 30, 2024 · 11 comments
Open

roachtest: disk-stalled/wal-failover/among-stores failed #129922

cockroach-teamcity opened this issue Aug 30, 2024 · 11 comments
Assignees
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. P-3 Issues/test failures with no fix SLA T-storage Storage Team

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Aug 30, 2024

roachtest.disk-stalled/wal-failover/among-stores failed with artifacts on master @ dafb6dd507b38fb3d6eb8b7e2493c7b8abed34d2:

(disk_stall.go:172).runDiskStalledWALFailover: unexpectedly high p99.99 latency 2.281735339s at 2024-08-30T10:25:00Z
(cluster.go:2436).Run: context canceled
test artifacts and logs in: /artifacts/disk-stalled/wal-failover/among-stores/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=16
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_runtimeAssertionsBuild=false
  • ROACHTEST_ssd=2
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

/cc @cockroachdb/storage

This test on roachdash | Improve this report!

Jira issue: CRDB-41774

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-storage Storage Team labels Aug 30, 2024
@blathers-crl blathers-crl bot added the A-storage Relating to our storage engine (Pebble) on-disk storage. label Aug 30, 2024
@github-project-automation github-project-automation bot moved this to Incoming in Storage Aug 30, 2024
@cockroach-teamcity
Copy link
Member Author

roachtest.disk-stalled/wal-failover/among-stores failed with artifacts on master @ 4142920c2d5c50c0520c124764aeeda94ba043ae:

(disk_stall.go:172).runDiskStalledWALFailover: unexpectedly high p99.99 latency 1.326598366s at 2024-09-03T11:14:00Z
(cluster.go:2444).Run: context canceled
test artifacts and logs in: /artifacts/disk-stalled/wal-failover/among-stores/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=16
  • ROACHTEST_encrypted=true
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_runtimeAssertionsBuild=false
  • ROACHTEST_ssd=2
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

@sumeerbhola sumeerbhola self-assigned this Sep 3, 2024
@sumeerbhola sumeerbhola removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Sep 3, 2024
@cockroach-teamcity
Copy link
Member Author

roachtest.disk-stalled/wal-failover/among-stores failed with artifacts on master @ fa9c0528fc0d06be1b4cfc534ec0501448111fbe:

(disk_stall.go:159).runDiskStalledWALFailover: process exited unexectedly
(cluster.go:2451).Run: context canceled
test artifacts and logs in: /artifacts/disk-stalled/wal-failover/among-stores/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=16
  • ROACHTEST_encrypted=true
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_runtimeAssertionsBuild=false
  • ROACHTEST_ssd=2
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

@sumeerbhola
Copy link
Collaborator

sumeerbhola commented Sep 11, 2024

The second failure #129922 (comment) is a test flake due to injecting too long a stall. The test attempts to inject a 30s stall, and a 60s stall would result in a fatal error in the node (COCKROACH_LOG_MAX_SYNC_DURATION is set to 60s). But we see the test injecting a longer stall from 11:17:45 to 11:19:02:
2024/09/07 11:16:50 disk_stall.go:126: test status: pausing 54.985888517s before next simulated disk stall on n1
2024/09/07 11:17:45 cluster.go:2471: running cmd sudo dmsetup suspend --nofl... on nodes [:1]
2024/09/07 11:17:45 cluster.go:2473: details in run_111745.066909574_n1_sudo-dmsetup-suspend.log
2024/09/07 11:19:02 cluster.go:2471: running cmd sudo dmsetup resume data1 on nodes [:1]

And n1 dies due to this stall:
F240907 11:18:46.126128 989637 storage/pebble.go:1530 â‹® [n1,s1,pebble] 1727 disk stall detected: disk slowness detected: syncdata on file 008404.log has been ongoing for 60.2s

@sumeerbhola
Copy link
Collaborator

In the first failure n1 loses leases, has no disk reads, has slot exhaustion.

failure:
2024/08/30 11:09:40 test_impl.go:423: test failure #1: full stack retained in failure_1.log: (disk_stall.go:172).runDiskStalledWALFailover: unexpectedly high p99.99 latency 2.281735339s at 2024-08-30T10:25:00Z

corresponding stall:
2024/08/30 10:24:39 cluster.go:2456: running cmd sudo dmsetup suspend --nofl... on nodes [:1]
2024/08/30 10:24:39 cluster.go:2458: details in run_102439.469185248_n1_sudo-dmsetup-suspend.log
2024/08/30 10:25:10 cluster.go:2456: running cmd sudo dmsetup resume data1 on nodes [:1]
2024/08/30 10:25:10 cluster.go:2458: details in run_102510.223616708_n1_sudo-dmsetup-resume-.log

This is similar to the failure in #124399 (comment)

One thing to note is that multiple stalls have a p100 of 10+s. The failure happens due to a stall where lower percentiles are also slow. That suggests that our disk read bytes (which are always 0) are not telling the whole story of what gets stuck, since if there was nothing getting stuck, even the p100 would consistently stay low.
Screenshot 2024-09-11 at 3 24 05 PM

@sumeerbhola
Copy link
Collaborator

p99.99 for Raft logcommit is also 10+s during the stall that caused the failure. But the write_and_sync latency for the WAL writer has a p100 ~150ms (due to WAL failover). Which suggests some code above the WAL writer (in Pebble or CockroachDB) is observing the stall, and it isn't necessarily reads (since Raft logcommit does not do reads).
Screenshot 2024-09-13 at 8 14 43 AM

Screenshot 2024-09-13 at 8 29 48 AM

@cockroach-teamcity
Copy link
Member Author

Note: This build has runtime assertions enabled. If the same failure was hit in a run without assertions enabled, there should be a similar failure without this message. If there isn't one, then this failure is likely due to an assertion violation or (assertion) timeout.

roachtest.disk-stalled/wal-failover/among-stores failed with artifacts on master @ 833dadd212fa4b12b1442ae8e00e85ee80a8cdce:

(cluster.go:2336).Start: COMMAND_PROBLEM: exit status 1
(cluster.go:2449).Run: context canceled
test artifacts and logs in: /artifacts/disk-stalled/wal-failover/among-stores/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=16
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_runtimeAssertionsBuild=true
  • ROACHTEST_ssd=2
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.disk-stalled/wal-failover/among-stores failed with artifacts on master @ 472ea07a5232c98536293d13bb46cca59f9f2cd0:

(cluster.go:2336).Start: COMMAND_PROBLEM: exit status 1
(cluster.go:2449).Run: context canceled
test artifacts and logs in: /artifacts/disk-stalled/wal-failover/among-stores/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=16
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_runtimeAssertionsBuild=false
  • ROACHTEST_ssd=2
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.disk-stalled/wal-failover/among-stores failed with artifacts on master @ 472ea07a5232c98536293d13bb46cca59f9f2cd0:

(cluster.go:2336).Start: COMMAND_PROBLEM: exit status 1
(cluster.go:2449).Run: context canceled
test artifacts and logs in: /artifacts/disk-stalled/wal-failover/among-stores/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=16
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_runtimeAssertionsBuild=false
  • ROACHTEST_ssd=2
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.disk-stalled/wal-failover/among-stores failed with artifacts on master @ 472ea07a5232c98536293d13bb46cca59f9f2cd0:

(cluster.go:2336).Start: COMMAND_PROBLEM: exit status 1
(cluster.go:2449).Run: context canceled
test artifacts and logs in: /artifacts/disk-stalled/wal-failover/among-stores/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=16
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_runtimeAssertionsBuild=false
  • ROACHTEST_ssd=2
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

This test on roachdash | Improve this report!

jbowens added a commit to jbowens/cockroach that referenced this issue Oct 21, 2024
Update the default logging configuration used for roachprod clusters to disable
auditable logs on logs going to file sinks. Some roachtests use the
buffered:true configuration to withstand disk stall events. This setting is
incompatible with auditable logs on file sinks and recently introduced
validation (cockroachdb#132742) prohibits the settings from being used together.

Release note: none
Informs cockroachdb#129922.
Informs cockroachdb#132988.
Epic: none
craig bot pushed a commit that referenced this issue Oct 21, 2024
132916: kvserver: clear rac2 token metrics prior to integration testing r=sumeerbhola a=kvoli

`TestFlowControl.*V2` tests assert on exact counters. This can be problematic if benign deltas occur while setting up the test, such a send queue forming when adding a new learner, but being quickly resolved.

Clear the token metrics prior to commencing these tests, in order to prevent flakes that result from such deltas in setup.

Fixes: #132642
Release note: None

133089: roachprod: update default CockroachDB logging configuration r=dhartunian a=jbowens

Update the default logging configuration used for roachprod clusters to disable auditable logs on logs going to file sinks. Some roachtests use the buffered:true configuration to withstand disk stall events. This setting is incompatible with auditable logs on file sinks and recently introduced validation (#132742) prohibits the settings from being used together.

Release note: none
Informs #129922.
Informs #132988.
Epic: none

Co-authored-by: Austen McClernon <austen@cockroachlabs.com>
Co-authored-by: Jackson Owens <jackson@cockroachlabs.com>
blathers-crl bot pushed a commit that referenced this issue Oct 21, 2024
Update the default logging configuration used for roachprod clusters to disable
auditable logs on logs going to file sinks. Some roachtests use the
buffered:true configuration to withstand disk stall events. This setting is
incompatible with auditable logs on file sinks and recently introduced
validation (#132742) prohibits the settings from being used together.

Release note: none
Informs #129922.
Informs #132988.
Epic: none
@cockroach-teamcity
Copy link
Member Author

roachtest.disk-stalled/wal-failover/among-stores failed with artifacts on master @ 1e5b3c212b45419c960038718c48a5dd75a111a0:

(cluster.go:2336).Start: COMMAND_PROBLEM: exit status 1
(cluster.go:2449).Run: context canceled
test artifacts and logs in: /artifacts/disk-stalled/wal-failover/among-stores/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=16
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_runtimeAssertionsBuild=false
  • ROACHTEST_ssd=2
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.disk-stalled/wal-failover/among-stores failed with artifacts on master @ 787f2e3fe5f73b33fcd65485908cbb71e0991222:

(cluster.go:2336).Start: COMMAND_PROBLEM: exit status 1
(cluster.go:2449).Run: context canceled
test artifacts and logs in: /artifacts/disk-stalled/wal-failover/among-stores/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=16
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_runtimeAssertionsBuild=false
  • ROACHTEST_ssd=2
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

This test on roachdash | Improve this report!

jbowens added a commit to jbowens/cockroach that referenced this issue Oct 23, 2024
In roachprod clusters, default to using buffering in file sinks. This is
required by a subsequent change that will default to using WAL failover in
roachprod clusters.

Informs cockroachdb#133248
Informs cockroachdb#129922
Epic: CRDB-37534
Release note: none
craig bot pushed a commit that referenced this issue Oct 23, 2024
133130: sqlccl: deflake TestExplainGist when run with concurrent ALTER PK r=rafiss,michae2 a=spilchen

TestExplainGist occasionally fails when a query using a secondary index tries to fetch a column not included in that index (see issue #130282). This change doesn’t address the root cause, but instead ignores the error when it occurs. I've also created a more reliable reproducer in the TestDMLInjectionTest, which we can use to validate the eventual fix (#133129).

Epic: none
Closes #130282
Release note: none

133256: roachprod: default to buffering file sinks in roachprod r=jbowens a=jbowens

In roachprod clusters, default to using buffering in file sinks. This is required by a subsequent change that will default to using WAL failover in roachprod clusters.

Informs #133248
Informs #129922
Epic: CRDB-37534
Release note: none

Co-authored-by: Matt Spilchen <matt.spilchen@cockroachlabs.com>
Co-authored-by: Jackson Owens <jackson@cockroachlabs.com>
blathers-crl bot pushed a commit that referenced this issue Oct 23, 2024
In roachprod clusters, default to using buffering in file sinks. This is
required by a subsequent change that will default to using WAL failover in
roachprod clusters.

Informs #133248
Informs #129922
Epic: CRDB-37534
Release note: none
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. P-3 Issues/test failures with no fix SLA T-storage Storage Team
Projects
Status: Tests (failures, skipped, flakes)
Development

No branches or pull requests

3 participants