Description
Packages
Scylla version: 6.1.0~dev-20240619.7567b87e72cf
with build-id f0a41eece67ec7a8d64b047e743f0396d9f23463
Kernel Version: 5.15.0-1063-aws
Issue description
- This issue is a regression.
- It is unknown if this issue is a regression.
Running the disrupt_bootstrap_streaming_error
nemesis we got following failure:
2024-06-22 06:17:17,314 f:nemesis.py l:1835 c:sdcm.nemesis p:INFO > sdcm.nemesis.SisyphusMonkey: >>>>>>>>>>>>>Started random_disrupt_method bootstrap_streaming_error
2024-06-22 06:17:17,315 f:nemesis.py l:406 c:sdcm.nemesis p:INFO > sdcm.nemesis.SisyphusMonkey: Current Target: Node longevity-parallel-topology-schema--db-node-9ab27f39-10 [34.249.118.64 | 10.4.10.130] with running nemesis: BootstrapStreamingError
...
2024-06-22 06:23:09,349 f:base.py l:231 c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.11.228>: Datacenter: eu-west
2024-06-22 06:23:09,349 f:base.py l:231 c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.11.228>: ===================
2024-06-22 06:23:09,349 f:base.py l:231 c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.11.228>: Status=Up/Down
2024-06-22 06:23:09,349 f:base.py l:231 c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.11.228>: |/ State=Normal/Leaving/Joining/Moving
2024-06-22 06:23:09,349 f:base.py l:231 c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.11.228>: -- Address Load Tokens Owns Host ID Rack
2024-06-22 06:23:09,349 f:base.py l:231 c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.11.228>: UN 10.4.10.130 ? 256 ? 1dd0f17d-bc11-44b9-8497-c56a901bc71d 1c
2024-06-22 06:23:09,349 f:base.py l:231 c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.11.228>: UN 10.4.10.56 ? 256 ? 630790e3-a8e6-4b98-97e8-ea4a94de5dc2 1c
2024-06-22 06:23:09,349 f:base.py l:231 c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.11.228>: UN 10.4.10.6 ? 256 ? 5ce7d885-dc11-414e-9162-bcbeaa5dd8e8 1c
2024-06-22 06:23:09,349 f:base.py l:231 c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.11.228>: UN 10.4.11.228 490.32 MB 256 ? a9f95872-0677-4071-9fba-9675e404c4d0 1c
2024-06-22 06:23:09,349 f:base.py l:231 c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.11.228>: UN 10.4.8.225 ? 256 ? f60d7cee-5b6b-4617-aedf-0a3d341425b2 1c
2024-06-22 06:23:09,349 f:base.py l:231 c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.11.228>: UN 10.4.9.188 ? 256 ? 473a2961-0169-48b2-9dfb-8bf06a6f098e 1c
...
2024-06-22 06:23:36,934 f:nemesis.py l:4978 c:sdcm.nemesis p:INFO > sdcm.nemesis.SisyphusMonkey: Wait 5 minutes with new topology
...
2024-06-22 06:28:36,983 f:nemesis.py l:4981 c:sdcm.nemesis p:INFO > sdcm.nemesis.SisyphusMonkey: Decommission added new node
2024-06-22 06:28:37,829 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:INFO > 2024-06-22 06:28:37.827: (NodetoolEvent Severity.NORMAL) period_type=begin event_id=b508c05a-f37e-4295-9b1a-fcc5d203e2ca: nodetool_command=decommission node=longevity-parallel-topology-schema--db-node-9ab27f39-11
...
2024-06-22 06:28:37,827 f:remote_base.py l:521 c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.11.228>: Running command "/usr/bin/nodetool decommission "...
...
2024-06-22 06:29:31,655 f:nemesis.py l:5137 c:sdcm.nemesis p:ERROR > sdcm.nemesis.SisyphusMonkey: Unhandled exception in method <bound method Nemesis.disrupt_bootstrap_streaming_error of <sdcm.nemesis.SisyphusMonkey object at 0x7f1901961de0>>
2024-06-22 06:29:31,655 f:nemesis.py l:5137 c:sdcm.nemesis p:ERROR > Traceback (most recent call last):
2024-06-22 06:29:31,655 f:nemesis.py l:5137 c:sdcm.nemesis p:ERROR > File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 605, in _run
2024-06-22 06:29:31,655 f:nemesis.py l:5137 c:sdcm.nemesis p:ERROR > return self._run_execute(cmd, timeout, ignore_status, verbose, new_session, watchers)
2024-06-22 06:29:31,655 f:nemesis.py l:5137 c:sdcm.nemesis p:ERROR > File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 538, in _run_execute
2024-06-22 06:29:31,655 f:nemesis.py l:5137 c:sdcm.nemesis p:ERROR > result = connection.run(**command_kwargs)
2024-06-22 06:29:31,655 f:nemesis.py l:5137 c:sdcm.nemesis p:ERROR > File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 620, in run
2024-06-22 06:29:31,655 f:nemesis.py l:5137 c:sdcm.nemesis p:ERROR > return self._complete_run(channel, exception, timeout_reached, timeout, result, warn, stdout, stderr)
2024-06-22 06:29:31,655 f:nemesis.py l:5137 c:sdcm.nemesis p:ERROR > File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 650, in _complete_run
2024-06-22 06:29:31,655 f:nemesis.py l:5137 c:sdcm.nemesis p:ERROR > raise exception
2024-06-22 06:29:31,655 f:nemesis.py l:5137 c:sdcm.nemesis p:ERROR > File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 77, in run
2024-06-22 06:29:31,655 f:nemesis.py l:5137 c:sdcm.nemesis p:ERROR > self._read_output(self._session, self._channel, self._timeout,
2024-06-22 06:29:31,655 f:nemesis.py l:5137 c:sdcm.nemesis p:ERROR > File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 105, in _read_output
2024-06-22 06:29:31,655 f:nemesis.py l:5137 c:sdcm.nemesis p:ERROR > stdout_size, stdout_chunk = channel.read()
2024-06-22 06:29:31,655 f:nemesis.py l:5137 c:sdcm.nemesis p:ERROR > File "ssh2/channel.pyx", line 116, in ssh2.channel.Channel.read
2024-06-22 06:29:31,655 f:nemesis.py l:5137 c:sdcm.nemesis p:ERROR > File "ssh2/channel.pyx", line 144, in ssh2.channel.Channel.read_ex
2024-06-22 06:29:31,655 f:nemesis.py l:5137 c:sdcm.nemesis p:ERROR > File "ssh2/utils.pyx", line 214, in ssh2.utils.handle_error_codes
2024-06-22 06:29:31,655 f:nemesis.py l:5137 c:sdcm.nemesis p:ERROR > ssh2.exceptions.SocketRecvError
2024-06-22 06:29:31,655 f:nemesis.py l:5137 c:sdcm.nemesis p:ERROR >
2024-06-22 06:29:31,655 f:nemesis.py l:5137 c:sdcm.nemesis p:ERROR > During handling of the above exception, another exception occurred:
2024-06-22 06:29:31,655 f:nemesis.py l:5137 c:sdcm.nemesis p:ERROR >
2024-06-22 06:29:31,655 f:nemesis.py l:5137 c:sdcm.nemesis p:ERROR > Traceback (most recent call last):
2024-06-22 06:29:31,655 f:nemesis.py l:5137 c:sdcm.nemesis p:ERROR > File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 5113, in wrapper
2024-06-22 06:29:31,655 f:nemesis.py l:5137 c:sdcm.nemesis p:ERROR > result = method(*args[1:], **kwargs)
2024-06-22 06:29:31,655 f:nemesis.py l:5137 c:sdcm.nemesis p:ERROR > File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 4982, in disrupt_bootstrap_streaming_error
2024-06-22 06:29:31,655 f:nemesis.py l:5137 c:sdcm.nemesis p:ERROR > self.cluster.decommission(new_node, timeout=7200)
2024-06-22 06:29:31,655 f:nemesis.py l:5137 c:sdcm.nemesis p:ERROR > File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster.py", line 4959, in decommission
2024-06-22 06:29:31,655 f:nemesis.py l:5137 c:sdcm.nemesis p:ERROR > node.run_nodetool("decommission", timeout=timeout, retry=0)
2024-06-22 06:29:31,655 f:nemesis.py l:5137 c:sdcm.nemesis p:ERROR > File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster.py", line 2618, in run_nodetool
2024-06-22 06:29:31,655 f:nemesis.py l:5137 c:sdcm.nemesis p:ERROR > self.remoter.run(cmd, timeout=timeout, ignore_status=ignore_status, verbose=verbose, retry=retry)
2024-06-22 06:29:31,655 f:nemesis.py l:5137 c:sdcm.nemesis p:ERROR > File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 614, in run
2024-06-22 06:29:31,655 f:nemesis.py l:5137 c:sdcm.nemesis p:ERROR > result = _run()
2024-06-22 06:29:31,655 f:nemesis.py l:5137 c:sdcm.nemesis p:ERROR > File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/decorators.py", line 65, in inner
2024-06-22 06:29:31,655 f:nemesis.py l:5137 c:sdcm.nemesis p:ERROR > return func(*args, **kwargs)
2024-06-22 06:29:31,655 f:nemesis.py l:5137 c:sdcm.nemesis p:ERROR > File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 607, in _run
2024-06-22 06:29:31,655 f:nemesis.py l:5137 c:sdcm.nemesis p:ERROR > if self._run_on_retryable_exception(exc, new_session):
2024-06-22 06:29:31,655 f:nemesis.py l:5137 c:sdcm.nemesis p:ERROR > File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_libssh_cmd_runner.py", line 78, in _run_on_retryable_exception
2024-06-22 06:29:31,655 f:nemesis.py l:5137 c:sdcm.nemesis p:ERROR > raise RetryableNetworkException(str(exc), original=exc)
2024-06-22 06:29:31,655 f:nemesis.py l:5137 c:sdcm.nemesis p:ERROR > sdcm.remote.base.RetryableNetworkException
It led to the situation that we decomissined node from the DB cluster, but didn't remove it de-facto because following after the point of failure, valuable verify_decommission
method was not called:
def decommission(self, node: BaseNode, timeout: int | float = None):
with adaptive_timeout(operation=Operations.DECOMMISSION, node=node):
node.run_nodetool("decommission", timeout=timeout, retry=0)
self.verify_decommission(node)
It, on it's turn, led to the situation that we had a former-db-node (node-11
) which is capable to run nodetool
commands.
So, next nemesis disrupt_terminate_and_replace_node
picked that decommissioned node (node-11
) as a verification one on it's nodetool status
check.
Not being part of the DB cluster anymore, it returned obsolete OK result after the nemesis terminated the node-9
:
2024-06-22 06:47:50,206 f:remote_base.py l:521 c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.11.228>: Running command "/usr/bin/nodetool status "...
2024-06-22 06:47:50,428 f:base.py l:231 c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.11.228>: Datacenter: eu-west
2024-06-22 06:47:50,428 f:base.py l:231 c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.11.228>: ===================
2024-06-22 06:47:50,428 f:base.py l:231 c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.11.228>: Status=Up/Down
2024-06-22 06:47:50,428 f:base.py l:231 c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.11.228>: |/ State=Normal/Leaving/Joining/Moving
2024-06-22 06:47:50,428 f:base.py l:231 c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.11.228>: -- Address Load Tokens Owns Host ID Rack
2024-06-22 06:47:50,428 f:base.py l:231 c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.11.228>: UN 10.4.10.130 23.64 GB 256 ? 1dd0f17d-bc11-44b9-8497-c56a901bc71d 1c
2024-06-22 06:47:50,428 f:base.py l:231 c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.11.228>: UN 10.4.10.56 24.96 GB 256 ? 630790e3-a8e6-4b98-97e8-ea4a94de5dc2 1c
2024-06-22 06:47:50,428 f:base.py l:231 c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.11.228>: UN 10.4.10.6 24.12 GB 256 ? 5ce7d885-dc11-414e-9162-bcbeaa5dd8e8 1c
2024-06-22 06:47:50,428 f:base.py l:231 c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.11.228>: UN 10.4.8.225 20.84 GB 256 ? f60d7cee-5b6b-4617-aedf-0a3d341425b2 1c
2024-06-22 06:47:50,428 f:base.py l:231 c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.11.228>: UN 10.4.9.188 24.95 GB 256 ? 473a2961-0169-48b2-9dfb-8bf06a6f098e 1c
Which led to the one more error caused by the wrong nodetool status
result:
2024-06-22 06:47:50,929 f:nemesis.py l:5137 c:sdcm.nemesis p:ERROR > Traceback (most recent call last):
2024-06-22 06:47:50,929 f:nemesis.py l:5137 c:sdcm.nemesis p:ERROR > File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 5113, in wrapper
2024-06-22 06:47:50,929 f:nemesis.py l:5137 c:sdcm.nemesis p:ERROR > result = method(*args[1:], **kwargs)
2024-06-22 06:47:50,929 f:nemesis.py l:5137 c:sdcm.nemesis p:ERROR > File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 1542, in disrupt_terminate_and_replace_node
2024-06-22 06:47:50,929 f:nemesis.py l:5137 c:sdcm.nemesis p:ERROR > assert get_node_state(old_node_ip) == "DN", "Removed node state should be DN"
2024-06-22 06:47:50,929 f:nemesis.py l:5137 c:sdcm.nemesis p:ERROR > AssertionError: Removed node state should be DN
2024-06-22 06:47:50,929 f:nemesis.py l:5149 c:sdcm.nemesis p:DEBUG > sdcm.nemesis.SisyphusMonkey: TerminateAndReplaceNode duration -> 498 s
So, it contains multiple problems:
- SCT doesn't do required stuff if
nodetool decommission
command fails like here - SCT used decommissined node for running
nodetool status
command - SCT doesn't check the
nodetool status
result for correctness (node num, node matching ... )
Impact
Redundant failures which could be avoided.
How frequently does it reproduce?
Unknown.
Installation details
Cluster size: 5 nodes (i4i.2xlarge)
Scylla Nodes used in this run:
- longevity-parallel-topology-schema--db-node-9ab27f39-9 (54.229.12.207 | 10.4.9.188) (shards: 7)
- longevity-parallel-topology-schema--db-node-9ab27f39-8 (52.208.78.175 | 10.4.8.225) (shards: 7)
- longevity-parallel-topology-schema--db-node-9ab27f39-7 (54.73.47.64 | 10.4.11.129) (shards: 7)
- longevity-parallel-topology-schema--db-node-9ab27f39-6 (52.18.221.226 | 10.4.8.164) (shards: 7)
- longevity-parallel-topology-schema--db-node-9ab27f39-5 (52.213.136.109 | 10.4.10.6) (shards: 7)
- longevity-parallel-topology-schema--db-node-9ab27f39-4 (54.75.205.14 | 10.4.8.78) (shards: 7)
- longevity-parallel-topology-schema--db-node-9ab27f39-3 (52.30.77.69 | 10.4.10.3) (shards: 7)
- longevity-parallel-topology-schema--db-node-9ab27f39-2 (79.125.60.207 | 10.4.11.192) (shards: 7)
- longevity-parallel-topology-schema--db-node-9ab27f39-12 (54.72.238.52 | 10.4.8.116) (shards: 7)
- longevity-parallel-topology-schema--db-node-9ab27f39-11 (99.80.203.111 | 10.4.11.228) (shards: -1)
- longevity-parallel-topology-schema--db-node-9ab27f39-10 (34.249.118.64 | 10.4.10.130) (shards: 7)
- longevity-parallel-topology-schema--db-node-9ab27f39-1 (34.251.143.100 | 10.4.10.56) (shards: 7)
OS / Image: ami-038debe0a5d79391a
(aws: undefined_region)
Test: longevity-schema-topology-changes-12h-test
Test id: 9ab27f39-b614-45d6-8649-94c33fdf70ca
Test name: scylla-master/tier1/longevity-schema-topology-changes-12h-test
Test config file(s):
Logs and commands
- Restore Monitor Stack command:
$ hydra investigate show-monitor 9ab27f39-b614-45d6-8649-94c33fdf70ca
- Restore monitor on AWS instance using Jenkins job
- Show all stored logs command:
$ hydra investigate show-logs 9ab27f39-b614-45d6-8649-94c33fdf70ca
Logs:
- db-cluster-9ab27f39.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/9ab27f39-b614-45d6-8649-94c33fdf70ca/20240622_071219/db-cluster-9ab27f39.tar.gz
- sct-runner-events-9ab27f39.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/9ab27f39-b614-45d6-8649-94c33fdf70ca/20240622_071219/sct-runner-events-9ab27f39.tar.gz
- sct-9ab27f39.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/9ab27f39-b614-45d6-8649-94c33fdf70ca/20240622_071219/sct-9ab27f39.log.tar.gz
- loader-set-9ab27f39.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/9ab27f39-b614-45d6-8649-94c33fdf70ca/20240622_071219/loader-set-9ab27f39.tar.gz
- monitor-set-9ab27f39.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/9ab27f39-b614-45d6-8649-94c33fdf70ca/20240622_071219/monitor-set-9ab27f39.tar.gz