Skip to content

SCT doesn't finish decommission logic (node stays in python list of nodes) if nodetool decommission command fails and the operation happens to succeed #7833

Open

Description

Packages

Scylla version: 6.1.0~dev-20240619.7567b87e72cf with build-id f0a41eece67ec7a8d64b047e743f0396d9f23463

Kernel Version: 5.15.0-1063-aws

Issue description

  • This issue is a regression.
  • It is unknown if this issue is a regression.

Running the disrupt_bootstrap_streaming_error nemesis we got following failure:

2024-06-22 06:17:17,314 f:nemesis.py      l:1835 c:sdcm.nemesis         p:INFO  > sdcm.nemesis.SisyphusMonkey: >>>>>>>>>>>>>Started random_disrupt_method bootstrap_streaming_error
2024-06-22 06:17:17,315 f:nemesis.py      l:406  c:sdcm.nemesis         p:INFO  > sdcm.nemesis.SisyphusMonkey: Current Target: Node longevity-parallel-topology-schema--db-node-9ab27f39-10 [34.249.118.64 | 10.4.10.130] with running nemesis: BootstrapStreamingError
...
2024-06-22 06:23:09,349 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.11.228>: Datacenter: eu-west
2024-06-22 06:23:09,349 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.11.228>: ===================
2024-06-22 06:23:09,349 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.11.228>: Status=Up/Down 
2024-06-22 06:23:09,349 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.11.228>: |/ State=Normal/Leaving/Joining/Moving
2024-06-22 06:23:09,349 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.11.228>: -- Address     Load      Tokens Owns Host ID                              Rack
2024-06-22 06:23:09,349 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.11.228>: UN 10.4.10.130 ?         256    ?    1dd0f17d-bc11-44b9-8497-c56a901bc71d 1c
2024-06-22 06:23:09,349 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.11.228>: UN 10.4.10.56  ?         256    ?    630790e3-a8e6-4b98-97e8-ea4a94de5dc2 1c
2024-06-22 06:23:09,349 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.11.228>: UN 10.4.10.6   ?         256    ?    5ce7d885-dc11-414e-9162-bcbeaa5dd8e8 1c
2024-06-22 06:23:09,349 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.11.228>: UN 10.4.11.228 490.32 MB 256    ?    a9f95872-0677-4071-9fba-9675e404c4d0 1c
2024-06-22 06:23:09,349 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.11.228>: UN 10.4.8.225  ?         256    ?    f60d7cee-5b6b-4617-aedf-0a3d341425b2 1c
2024-06-22 06:23:09,349 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.11.228>: UN 10.4.9.188  ?         256    ?    473a2961-0169-48b2-9dfb-8bf06a6f098e 1c
...
2024-06-22 06:23:36,934 f:nemesis.py      l:4978 c:sdcm.nemesis         p:INFO  > sdcm.nemesis.SisyphusMonkey: Wait 5 minutes with new topology
...
2024-06-22 06:28:36,983 f:nemesis.py      l:4981 c:sdcm.nemesis         p:INFO  > sdcm.nemesis.SisyphusMonkey: Decommission added new node
2024-06-22 06:28:37,829 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 2024-06-22 06:28:37.827: (NodetoolEvent Severity.NORMAL) period_type=begin event_id=b508c05a-f37e-4295-9b1a-fcc5d203e2ca: nodetool_command=decommission node=longevity-parallel-topology-schema--db-node-9ab27f39-11
...
2024-06-22 06:28:37,827 f:remote_base.py  l:521  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.11.228>: Running command "/usr/bin/nodetool  decommission "...

...

2024-06-22 06:29:31,655 f:nemesis.py      l:5137 c:sdcm.nemesis         p:ERROR > sdcm.nemesis.SisyphusMonkey: Unhandled exception in method <bound method Nemesis.disrupt_bootstrap_streaming_error of <sdcm.nemesis.SisyphusMonkey object at 0x7f1901961de0>>
2024-06-22 06:29:31,655 f:nemesis.py      l:5137 c:sdcm.nemesis         p:ERROR > Traceback (most recent call last):
2024-06-22 06:29:31,655 f:nemesis.py      l:5137 c:sdcm.nemesis         p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 605, in _run
2024-06-22 06:29:31,655 f:nemesis.py      l:5137 c:sdcm.nemesis         p:ERROR >     return self._run_execute(cmd, timeout, ignore_status, verbose, new_session, watchers)
2024-06-22 06:29:31,655 f:nemesis.py      l:5137 c:sdcm.nemesis         p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 538, in _run_execute
2024-06-22 06:29:31,655 f:nemesis.py      l:5137 c:sdcm.nemesis         p:ERROR >     result = connection.run(**command_kwargs)
2024-06-22 06:29:31,655 f:nemesis.py      l:5137 c:sdcm.nemesis         p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 620, in run
2024-06-22 06:29:31,655 f:nemesis.py      l:5137 c:sdcm.nemesis         p:ERROR >     return self._complete_run(channel, exception, timeout_reached, timeout, result, warn, stdout, stderr)
2024-06-22 06:29:31,655 f:nemesis.py      l:5137 c:sdcm.nemesis         p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 650, in _complete_run
2024-06-22 06:29:31,655 f:nemesis.py      l:5137 c:sdcm.nemesis         p:ERROR >     raise exception
2024-06-22 06:29:31,655 f:nemesis.py      l:5137 c:sdcm.nemesis         p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 77, in run
2024-06-22 06:29:31,655 f:nemesis.py      l:5137 c:sdcm.nemesis         p:ERROR >     self._read_output(self._session, self._channel, self._timeout,
2024-06-22 06:29:31,655 f:nemesis.py      l:5137 c:sdcm.nemesis         p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 105, in _read_output
2024-06-22 06:29:31,655 f:nemesis.py      l:5137 c:sdcm.nemesis         p:ERROR >     stdout_size, stdout_chunk = channel.read()
2024-06-22 06:29:31,655 f:nemesis.py      l:5137 c:sdcm.nemesis         p:ERROR >   File "ssh2/channel.pyx", line 116, in ssh2.channel.Channel.read
2024-06-22 06:29:31,655 f:nemesis.py      l:5137 c:sdcm.nemesis         p:ERROR >   File "ssh2/channel.pyx", line 144, in ssh2.channel.Channel.read_ex
2024-06-22 06:29:31,655 f:nemesis.py      l:5137 c:sdcm.nemesis         p:ERROR >   File "ssh2/utils.pyx", line 214, in ssh2.utils.handle_error_codes
2024-06-22 06:29:31,655 f:nemesis.py      l:5137 c:sdcm.nemesis         p:ERROR > ssh2.exceptions.SocketRecvError
2024-06-22 06:29:31,655 f:nemesis.py      l:5137 c:sdcm.nemesis         p:ERROR >               
2024-06-22 06:29:31,655 f:nemesis.py      l:5137 c:sdcm.nemesis         p:ERROR > During handling of the above exception, another exception occurred:
2024-06-22 06:29:31,655 f:nemesis.py      l:5137 c:sdcm.nemesis         p:ERROR >               
2024-06-22 06:29:31,655 f:nemesis.py      l:5137 c:sdcm.nemesis         p:ERROR > Traceback (most recent call last):
2024-06-22 06:29:31,655 f:nemesis.py      l:5137 c:sdcm.nemesis         p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 5113, in wrapper
2024-06-22 06:29:31,655 f:nemesis.py      l:5137 c:sdcm.nemesis         p:ERROR >     result = method(*args[1:], **kwargs)
2024-06-22 06:29:31,655 f:nemesis.py      l:5137 c:sdcm.nemesis         p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 4982, in disrupt_bootstrap_streaming_error
2024-06-22 06:29:31,655 f:nemesis.py      l:5137 c:sdcm.nemesis         p:ERROR >     self.cluster.decommission(new_node, timeout=7200)
2024-06-22 06:29:31,655 f:nemesis.py      l:5137 c:sdcm.nemesis         p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster.py", line 4959, in decommission
2024-06-22 06:29:31,655 f:nemesis.py      l:5137 c:sdcm.nemesis         p:ERROR >     node.run_nodetool("decommission", timeout=timeout, retry=0)
2024-06-22 06:29:31,655 f:nemesis.py      l:5137 c:sdcm.nemesis         p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster.py", line 2618, in run_nodetool
2024-06-22 06:29:31,655 f:nemesis.py      l:5137 c:sdcm.nemesis         p:ERROR >     self.remoter.run(cmd, timeout=timeout, ignore_status=ignore_status, verbose=verbose, retry=retry)
2024-06-22 06:29:31,655 f:nemesis.py      l:5137 c:sdcm.nemesis         p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 614, in run
2024-06-22 06:29:31,655 f:nemesis.py      l:5137 c:sdcm.nemesis         p:ERROR >     result = _run()
2024-06-22 06:29:31,655 f:nemesis.py      l:5137 c:sdcm.nemesis         p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/decorators.py", line 65, in inner     
2024-06-22 06:29:31,655 f:nemesis.py      l:5137 c:sdcm.nemesis         p:ERROR >     return func(*args, **kwargs)
2024-06-22 06:29:31,655 f:nemesis.py      l:5137 c:sdcm.nemesis         p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 607, in _run
2024-06-22 06:29:31,655 f:nemesis.py      l:5137 c:sdcm.nemesis         p:ERROR >     if self._run_on_retryable_exception(exc, new_session):
2024-06-22 06:29:31,655 f:nemesis.py      l:5137 c:sdcm.nemesis         p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_libssh_cmd_runner.py", line 78, in _run_on_retryable_exception
2024-06-22 06:29:31,655 f:nemesis.py      l:5137 c:sdcm.nemesis         p:ERROR >     raise RetryableNetworkException(str(exc), original=exc)
2024-06-22 06:29:31,655 f:nemesis.py      l:5137 c:sdcm.nemesis         p:ERROR > sdcm.remote.base.RetryableNetworkException

It led to the situation that we decomissined node from the DB cluster, but didn't remove it de-facto because following after the point of failure, valuable verify_decommission method was not called:

    def decommission(self, node: BaseNode, timeout: int | float = None):                                
        with adaptive_timeout(operation=Operations.DECOMMISSION, node=node):                            
            node.run_nodetool("decommission", timeout=timeout, retry=0)                                 
        self.verify_decommission(node)

It, on it's turn, led to the situation that we had a former-db-node (node-11) which is capable to run nodetool commands.

So, next nemesis disrupt_terminate_and_replace_node picked that decommissioned node (node-11) as a verification one on it's nodetool status check.
Not being part of the DB cluster anymore, it returned obsolete OK result after the nemesis terminated the node-9:

2024-06-22 06:47:50,206 f:remote_base.py  l:521  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.11.228>: Running command "/usr/bin/nodetool  status "...
2024-06-22 06:47:50,428 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.11.228>: Datacenter: eu-west
2024-06-22 06:47:50,428 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.11.228>: ===================
2024-06-22 06:47:50,428 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.11.228>: Status=Up/Down
2024-06-22 06:47:50,428 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.11.228>: |/ State=Normal/Leaving/Joining/Moving
2024-06-22 06:47:50,428 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.11.228>: -- Address     Load     Tokens Owns Host ID                              Rack
2024-06-22 06:47:50,428 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.11.228>: UN 10.4.10.130 23.64 GB 256    ?    1dd0f17d-bc11-44b9-8497-c56a901bc71d 1c
2024-06-22 06:47:50,428 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.11.228>: UN 10.4.10.56  24.96 GB 256    ?    630790e3-a8e6-4b98-97e8-ea4a94de5dc2 1c  
2024-06-22 06:47:50,428 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.11.228>: UN 10.4.10.6   24.12 GB 256    ?    5ce7d885-dc11-414e-9162-bcbeaa5dd8e8 1c
2024-06-22 06:47:50,428 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.11.228>: UN 10.4.8.225  20.84 GB 256    ?    f60d7cee-5b6b-4617-aedf-0a3d341425b2 1c  
2024-06-22 06:47:50,428 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.11.228>: UN 10.4.9.188  24.95 GB 256    ?    473a2961-0169-48b2-9dfb-8bf06a6f098e 1c

Which led to the one more error caused by the wrong nodetool status result:

2024-06-22 06:47:50,929 f:nemesis.py      l:5137 c:sdcm.nemesis         p:ERROR > Traceback (most recent call last):
2024-06-22 06:47:50,929 f:nemesis.py      l:5137 c:sdcm.nemesis         p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 5113, in wrapper
2024-06-22 06:47:50,929 f:nemesis.py      l:5137 c:sdcm.nemesis         p:ERROR >     result = method(*args[1:], **kwargs)
2024-06-22 06:47:50,929 f:nemesis.py      l:5137 c:sdcm.nemesis         p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 1542, in disrupt_terminate_and_replace_node
2024-06-22 06:47:50,929 f:nemesis.py      l:5137 c:sdcm.nemesis         p:ERROR >     assert get_node_state(old_node_ip) == "DN", "Removed node state should be DN"
2024-06-22 06:47:50,929 f:nemesis.py      l:5137 c:sdcm.nemesis         p:ERROR > AssertionError: Removed node state should be DN
2024-06-22 06:47:50,929 f:nemesis.py      l:5149 c:sdcm.nemesis         p:DEBUG > sdcm.nemesis.SisyphusMonkey: TerminateAndReplaceNode duration -> 498 s

So, it contains multiple problems:

  • SCT doesn't do required stuff if nodetool decommission command fails like here
  • SCT used decommissined node for running nodetool status command
  • SCT doesn't check the nodetool status result for correctness (node num, node matching ... )

Impact

Redundant failures which could be avoided.

How frequently does it reproduce?

Unknown.

Installation details

Cluster size: 5 nodes (i4i.2xlarge)

Scylla Nodes used in this run:

  • longevity-parallel-topology-schema--db-node-9ab27f39-9 (54.229.12.207 | 10.4.9.188) (shards: 7)
  • longevity-parallel-topology-schema--db-node-9ab27f39-8 (52.208.78.175 | 10.4.8.225) (shards: 7)
  • longevity-parallel-topology-schema--db-node-9ab27f39-7 (54.73.47.64 | 10.4.11.129) (shards: 7)
  • longevity-parallel-topology-schema--db-node-9ab27f39-6 (52.18.221.226 | 10.4.8.164) (shards: 7)
  • longevity-parallel-topology-schema--db-node-9ab27f39-5 (52.213.136.109 | 10.4.10.6) (shards: 7)
  • longevity-parallel-topology-schema--db-node-9ab27f39-4 (54.75.205.14 | 10.4.8.78) (shards: 7)
  • longevity-parallel-topology-schema--db-node-9ab27f39-3 (52.30.77.69 | 10.4.10.3) (shards: 7)
  • longevity-parallel-topology-schema--db-node-9ab27f39-2 (79.125.60.207 | 10.4.11.192) (shards: 7)
  • longevity-parallel-topology-schema--db-node-9ab27f39-12 (54.72.238.52 | 10.4.8.116) (shards: 7)
  • longevity-parallel-topology-schema--db-node-9ab27f39-11 (99.80.203.111 | 10.4.11.228) (shards: -1)
  • longevity-parallel-topology-schema--db-node-9ab27f39-10 (34.249.118.64 | 10.4.10.130) (shards: 7)
  • longevity-parallel-topology-schema--db-node-9ab27f39-1 (34.251.143.100 | 10.4.10.56) (shards: 7)

OS / Image: ami-038debe0a5d79391a (aws: undefined_region)

Test: longevity-schema-topology-changes-12h-test
Test id: 9ab27f39-b614-45d6-8649-94c33fdf70ca
Test name: scylla-master/tier1/longevity-schema-topology-changes-12h-test
Test config file(s):

Logs and commands
  • Restore Monitor Stack command: $ hydra investigate show-monitor 9ab27f39-b614-45d6-8649-94c33fdf70ca
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs 9ab27f39-b614-45d6-8649-94c33fdf70ca

Logs:

Jenkins job URL
Argus

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions