Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: ldr/kv0/workload=both/network_partition failed #133801

Open
cockroach-teamcity opened this issue Oct 30, 2024 · 2 comments
Open

roachtest: ldr/kv0/workload=both/network_partition failed #133801

cockroach-teamcity opened this issue Oct 30, 2024 · 2 comments
Assignees
Labels
A-disaster-recovery branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. P-3 Issues/test failures with no fix SLA T-disaster-recovery

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Oct 30, 2024

roachtest.ldr/kv0/workload=both/network_partition failed with artifacts on master @ ffe1b9fed2b7ee3b8d53d6d943038c358b8eb5a6:

(assertions.go:363).Fail: 
	Error Trace:	pkg/cmd/roachtest/tests/logical_data_replication.go:780
	            				pkg/cmd/roachtest/tests/logical_data_replication.go:560
	Error:      	Received unexpected error:
	            	expected DLQ to be empty, but found 25 rows
	Test:       	ldr/kv0/workload=both/network_partition
(require.go:1357).NoError: FailNow called
test artifacts and logs in: /artifacts/ldr/kv0/workload=both/network_partition/cpu_arch=arm64/run_1

Parameters:

  • ROACHTEST_arch=arm64
  • ROACHTEST_cloud=aws
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=8
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=false
  • ROACHTEST_runtimeAssertionsBuild=false
  • ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

Grafana is not yet available for aws clusters

/cc @cockroachdb/disaster-recovery

This test on roachdash | Improve this report!

Jira issue: CRDB-43763

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-disaster-recovery labels Oct 30, 2024
@exalate-issue-sync exalate-issue-sync bot added the P-3 Issues/test failures with no fix SLA label Nov 4, 2024
@dt dt removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Nov 4, 2024
@msbutler
Copy link
Collaborator

msbutler commented Nov 5, 2024

looks like we lost quorum on the source cluster, which could certainly lead to DLQ'd data. here's one example log line that indicates this:

2.unredacted/cockroach.log:I241030 08:01:58.414817 59441 ccl/crosscluster/logical/logical_replication_writer_processor.go:938 ⋮ [T1,Vsystem,n2,f‹57de2c57›,job=1016478570046554115,distsql.gateway=2,distsql.appname=‹$ internal-resume-job-1016478570046554115›,src-node=‹2›,proc=3] 3387  DLQ'ing row update due to result is ambiguous: error=replica unavailable: (n2,s2):2 unable to serve request to r120:‹/Table/106/1/-3{561500093438972892-378859063006205056}› [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=12, sticky=9223372036.854775807,2147483647]: closed timestamp: 1730275150.050701720,0 (2024-10-30 07:59:10); raft status: {"id":"2","term":7,"vote":"2","commit":1554,"lead":"0","leadEpoch":"0","raftState":"StatePreCandidate","applied":1554,"progress":{},"leadtransferee":"0"}: encountered poisoned latch ‹/Table/106/1/-3484778125609029896/0›@1730275154.199053622,0 [exhausted] (last error: ‹failed to send RPC›: sending to all replicas failed; last error: replica unavailable: (n3,s3):3 unable to serve request to r120:‹/Table/106/1/-3{561500093438972892-378859063006205056}› [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=12, sticky=9223372036.854775807,2147483647]: lost quorum (down: (n1,s1):1,(n2,s2):2); closed timestamp: 1730275149.850442014,0 (2024-10-30 07:59:09); raft status: {"id":"3","term":7,"vote":"2","commit":1554,"lead":"0","leadEpoch":"0","raftState":"StatePreCandidate","applied":1554,"progress":{},"leadtransferee":"0"}: have been waiting 61.00s for slow proposal RequestLease [/Table/106/1/‹-3561500093438972892›]) (‹age limit›): ‹Row{table: 106 family: 0}{k: -3484778125609029896, v: '\xde0680dbddd5213cf3a0a93c35c0ed9f512c95997a7b13fdb89b4390b6bb4a6ca2053319930010e8e6ef4dce3e05e58c57e922d95e63d1b713c07a4f13a504c51029bb24d393f0c830ec9bd685a7c616776611d629555d88d700bd54217f31a9c9f0f6bd1699afe8bc83837cdff82f254b899d164ea8502d80208ef764c12976715edc8dcf2f395ad16b32371ab8327ddf585d161d05654a449996ec1e5a9765183fe769c93f2a40e36973d2de9ad8ed73c98ceb119cff487b2a083ab487ee62f4c60a4e6f88ce99ff6610a556d34f9087458a8a5a18d7560e4eeca6021ed9c37003bc330f85fd456457fb37dd423400f4e6c966a66d22dfc0ceb0131f67c1ceaebfc49ac2476f57b501a2ff60363ba665eb1fc9a885c9434424ec78c599e995b4d7f52f99a93d5d61a5c7e1748e3ec4a86899d47c541c3d4b99a4d4033962958676660b8bfb9f6c2256abc48f644ffafb7bdd926b5029507ff81696e375eb3ef739a9a728572db684ad0e4fc3cdd9b0d8cdbb520caf45ec80eb23e1964d4c7bb89c1e724268ab9b238da8d2a719d262ffc26ed83b9b049a57cc037c8d6fa749b688dfb2b4536cb68d9e50bf957ea54c982ff92782086617e15e64f737f3da519d62674c499fe44f9fc79ab6d07a58e7dd387658f5c22d1ccdcf2a17c19ad73c65c2e2f25492967663e147a646c35e04d53f9854053bb2a692f76087103b84b187b65d1186e1346c840a7f158f6da2c59d51cf794c6af88c311d1a07ba3b6a0cfc810f24f35064d84d6890478dd255f79524314fb0a4fcce198271d7f5e489529b67704f91acc63c630088141e2d95c09d41c00cff344aa3092ebcaa583de52d007b56ba678c8c1cf028344b56d0143c6a8cd223223b54e551c73dc40a0d397bb50be64e521c972bdbcdb961e67bd9a364da757cd5131fbe5b9ddcbb77abdee28bbae6416b2b633486f2a2882353851eafc75e080711db43b64af2713ec6276e3c8eec0fc031f0229a64da43ef661825da3a78b5f499cc646f2dbefe77f9f0c945b9c545872e1cfe87ff0644cff317b694c2d8fff7b9dcda0ff9cd6500dd287643bca861932f7b950f05c93e21e828f77a13a16ead8529af968c44339c7dcf968ba79f3e011a03f7f241c270da09cbf151860b693ccaa67226390e92656695c5fa515740d4ade9602514b94e1929e47d3864cd16bcb38b'}›

@msbutler
Copy link
Collaborator

msbutler commented Nov 5, 2024

oh lol this test is not well set up: we randomly pick half of the crdb nodes to disconnect from all other nodes. we should instead disconnect a random pair of src and dest nodes.
https://github.com/msbutler/cockroach/blob/butler-src-schema-lock/pkg/cmd/roachtest/tests/logical_data_replication.go#L538

in this test nodes 1,3,5 were disconnected from all other nodes, so node 2 was left all alone. I am surprised this test doesn't flake more often.

msbutler added a commit to msbutler/cockroach that referenced this issue Nov 5, 2024
This test could previously disconnect a set of nodes that could cause a cluster
to loose quorom. With this patch, the test now disconnects a src-dest node pair
that are replicating data.

Fixes cockroachdb#133801

Release note: none
msbutler added a commit to msbutler/cockroach that referenced this issue Nov 5, 2024
This test could previously disconnect a set of nodes that could cause a cluster
to loose quorom. With this patch, the test now disconnects a src-dest node pair
that are replicating data.

Fixes cockroachdb#133801

Release note: none
msbutler added a commit to msbutler/cockroach that referenced this issue Nov 8, 2024
This test could previously disconnect a set of nodes that could cause a cluster
to loose quorom. With this patch, the test now disconnects a src-dest node pair
that are replicating data.

Fixes cockroachdb#133801

Release note: none
msbutler added a commit to msbutler/cockroach that referenced this issue Nov 8, 2024
This test could previously disconnect a set of nodes that could cause a cluster
to loose quorom. With this patch, the test now disconnects a src-dest node pair
that are replicating data.

Fixes cockroachdb#133801

Release note: none
msbutler added a commit to msbutler/cockroach that referenced this issue Nov 8, 2024
This test could previously disconnect a set of nodes that could cause a cluster
to loose quorom. With this patch, the test now disconnects a src-dest node pair
that are replicating data.

Fixes cockroachdb#133801

Release note: none
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-disaster-recovery branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. P-3 Issues/test failures with no fix SLA T-disaster-recovery
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants