SCT doesn't finish decommission logic (node stays in python list of nodes) if `nodetool decommission` command fails and the operation happens to succeed

## Packages

Scylla version: `6.1.0~dev-20240619.7567b87e72cf` with build-id `f0a41eece67ec7a8d64b047e743f0396d9f23463`

Kernel Version: `5.15.0-1063-aws`

## Issue description

- [ ] This issue is a regression.
- [ ] It is unknown if this issue is a regression.


Running the `disrupt_bootstrap_streaming_error` nemesis we got following failure:

```
2024-06-22 06:17:17,314 f:nemesis.py      l:1835 c:sdcm.nemesis         p:INFO  > sdcm.nemesis.SisyphusMonkey: >>>>>>>>>>>>>Started random_disrupt_method bootstrap_streaming_error
2024-06-22 06:17:17,315 f:nemesis.py      l:406  c:sdcm.nemesis         p:INFO  > sdcm.nemesis.SisyphusMonkey: Current Target: Node longevity-parallel-topology-schema--db-node-9ab27f39-10 [34.249.118.64 | 10.4.10.130] with running nemesis: BootstrapStreamingError
...
2024-06-22 06:23:09,349 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.11.228>: Datacenter: eu-west
2024-06-22 06:23:09,349 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.11.228>: ===================
2024-06-22 06:23:09,349 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.11.228>: Status=Up/Down 
2024-06-22 06:23:09,349 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.11.228>: |/ State=Normal/Leaving/Joining/Moving
2024-06-22 06:23:09,349 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.11.228>: -- Address     Load      Tokens Owns Host ID                              Rack
2024-06-22 06:23:09,349 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.11.228>: UN 10.4.10.130 ?         256    ?    1dd0f17d-bc11-44b9-8497-c56a901bc71d 1c
2024-06-22 06:23:09,349 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.11.228>: UN 10.4.10.56  ?         256    ?    630790e3-a8e6-4b98-97e8-ea4a94de5dc2 1c
2024-06-22 06:23:09,349 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.11.228>: UN 10.4.10.6   ?         256    ?    5ce7d885-dc11-414e-9162-bcbeaa5dd8e8 1c
2024-06-22 06:23:09,349 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.11.228>: UN 10.4.11.228 490.32 MB 256    ?    a9f95872-0677-4071-9fba-9675e404c4d0 1c
2024-06-22 06:23:09,349 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.11.228>: UN 10.4.8.225  ?         256    ?    f60d7cee-5b6b-4617-aedf-0a3d341425b2 1c
2024-06-22 06:23:09,349 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.11.228>: UN 10.4.9.188  ?         256    ?    473a2961-0169-48b2-9dfb-8bf06a6f098e 1c
...
2024-06-22 06:23:36,934 f:nemesis.py      l:4978 c:sdcm.nemesis         p:INFO  > sdcm.nemesis.SisyphusMonkey: Wait 5 minutes with new topology
...
2024-06-22 06:28:36,983 f:nemesis.py      l:4981 c:sdcm.nemesis         p:INFO  > sdcm.nemesis.SisyphusMonkey: Decommission added new node
2024-06-22 06:28:37,829 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 2024-06-22 06:28:37.827: (NodetoolEvent Severity.NORMAL) period_type=begin event_id=b508c05a-f37e-4295-9b1a-fcc5d203e2ca: nodetool_command=decommission node=longevity-parallel-topology-schema--db-node-9ab27f39-11
...
2024-06-22 06:28:37,827 f:remote_base.py  l:521  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.11.228>: Running command "/usr/bin/nodetool  decommission "...

...

2024-06-22 06:29:31,655 f:nemesis.py      l:5137 c:sdcm.nemesis         p:ERROR > sdcm.nemesis.SisyphusMonkey: Unhandled exception in method <bound method Nemesis.disrupt_bootstrap_streaming_error of <sdcm.nemesis.SisyphusMonkey object at 0x7f1901961de0>>
2024-06-22 06:29:31,655 f:nemesis.py      l:5137 c:sdcm.nemesis         p:ERROR > Traceback (most recent call last):
2024-06-22 06:29:31,655 f:nemesis.py      l:5137 c:sdcm.nemesis         p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 605, in _run
2024-06-22 06:29:31,655 f:nemesis.py      l:5137 c:sdcm.nemesis         p:ERROR >     return self._run_execute(cmd, timeout, ignore_status, verbose, new_session, watchers)
2024-06-22 06:29:31,655 f:nemesis.py      l:5137 c:sdcm.nemesis         p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 538, in _run_execute
2024-06-22 06:29:31,655 f:nemesis.py      l:5137 c:sdcm.nemesis         p:ERROR >     result = connection.run(**command_kwargs)
2024-06-22 06:29:31,655 f:nemesis.py      l:5137 c:sdcm.nemesis         p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 620, in run
2024-06-22 06:29:31,655 f:nemesis.py      l:5137 c:sdcm.nemesis         p:ERROR >     return self._complete_run(channel, exception, timeout_reached, timeout, result, warn, stdout, stderr)
2024-06-22 06:29:31,655 f:nemesis.py      l:5137 c:sdcm.nemesis         p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 650, in _complete_run
2024-06-22 06:29:31,655 f:nemesis.py      l:5137 c:sdcm.nemesis         p:ERROR >     raise exception
2024-06-22 06:29:31,655 f:nemesis.py      l:5137 c:sdcm.nemesis         p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 77, in run
2024-06-22 06:29:31,655 f:nemesis.py      l:5137 c:sdcm.nemesis         p:ERROR >     self._read_output(self._session, self._channel, self._timeout,
2024-06-22 06:29:31,655 f:nemesis.py      l:5137 c:sdcm.nemesis         p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 105, in _read_output
2024-06-22 06:29:31,655 f:nemesis.py      l:5137 c:sdcm.nemesis         p:ERROR >     stdout_size, stdout_chunk = channel.read()
2024-06-22 06:29:31,655 f:nemesis.py      l:5137 c:sdcm.nemesis         p:ERROR >   File "ssh2/channel.pyx", line 116, in ssh2.channel.Channel.read
2024-06-22 06:29:31,655 f:nemesis.py      l:5137 c:sdcm.nemesis         p:ERROR >   File "ssh2/channel.pyx", line 144, in ssh2.channel.Channel.read_ex
2024-06-22 06:29:31,655 f:nemesis.py      l:5137 c:sdcm.nemesis         p:ERROR >   File "ssh2/utils.pyx", line 214, in ssh2.utils.handle_error_codes
2024-06-22 06:29:31,655 f:nemesis.py      l:5137 c:sdcm.nemesis         p:ERROR > ssh2.exceptions.SocketRecvError
2024-06-22 06:29:31,655 f:nemesis.py      l:5137 c:sdcm.nemesis         p:ERROR >               
2024-06-22 06:29:31,655 f:nemesis.py      l:5137 c:sdcm.nemesis         p:ERROR > During handling of the above exception, another exception occurred:
2024-06-22 06:29:31,655 f:nemesis.py      l:5137 c:sdcm.nemesis         p:ERROR >               
2024-06-22 06:29:31,655 f:nemesis.py      l:5137 c:sdcm.nemesis         p:ERROR > Traceback (most recent call last):
2024-06-22 06:29:31,655 f:nemesis.py      l:5137 c:sdcm.nemesis         p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 5113, in wrapper
2024-06-22 06:29:31,655 f:nemesis.py      l:5137 c:sdcm.nemesis         p:ERROR >     result = method(*args[1:], **kwargs)
2024-06-22 06:29:31,655 f:nemesis.py      l:5137 c:sdcm.nemesis         p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 4982, in disrupt_bootstrap_streaming_error
2024-06-22 06:29:31,655 f:nemesis.py      l:5137 c:sdcm.nemesis         p:ERROR >     self.cluster.decommission(new_node, timeout=7200)
2024-06-22 06:29:31,655 f:nemesis.py      l:5137 c:sdcm.nemesis         p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster.py", line 4959, in decommission
2024-06-22 06:29:31,655 f:nemesis.py      l:5137 c:sdcm.nemesis         p:ERROR >     node.run_nodetool("decommission", timeout=timeout, retry=0)
2024-06-22 06:29:31,655 f:nemesis.py      l:5137 c:sdcm.nemesis         p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster.py", line 2618, in run_nodetool
2024-06-22 06:29:31,655 f:nemesis.py      l:5137 c:sdcm.nemesis         p:ERROR >     self.remoter.run(cmd, timeout=timeout, ignore_status=ignore_status, verbose=verbose, retry=retry)
2024-06-22 06:29:31,655 f:nemesis.py      l:5137 c:sdcm.nemesis         p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 614, in run
2024-06-22 06:29:31,655 f:nemesis.py      l:5137 c:sdcm.nemesis         p:ERROR >     result = _run()
2024-06-22 06:29:31,655 f:nemesis.py      l:5137 c:sdcm.nemesis         p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/decorators.py", line 65, in inner     
2024-06-22 06:29:31,655 f:nemesis.py      l:5137 c:sdcm.nemesis         p:ERROR >     return func(*args, **kwargs)
2024-06-22 06:29:31,655 f:nemesis.py      l:5137 c:sdcm.nemesis         p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 607, in _run
2024-06-22 06:29:31,655 f:nemesis.py      l:5137 c:sdcm.nemesis         p:ERROR >     if self._run_on_retryable_exception(exc, new_session):
2024-06-22 06:29:31,655 f:nemesis.py      l:5137 c:sdcm.nemesis         p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_libssh_cmd_runner.py", line 78, in _run_on_retryable_exception
2024-06-22 06:29:31,655 f:nemesis.py      l:5137 c:sdcm.nemesis         p:ERROR >     raise RetryableNetworkException(str(exc), original=exc)
2024-06-22 06:29:31,655 f:nemesis.py      l:5137 c:sdcm.nemesis         p:ERROR > sdcm.remote.base.RetryableNetworkException
```

It led to the situation that we decomissined node from the DB cluster, but didn't remove it de-facto because following after the point of failure, valuable `verify_decommission` method was not called:

```
    def decommission(self, node: BaseNode, timeout: int | float = None):                                
        with adaptive_timeout(operation=Operations.DECOMMISSION, node=node):                            
            node.run_nodetool("decommission", timeout=timeout, retry=0)                                 
        self.verify_decommission(node)
```

It, on it's turn, led to the situation that we had a former-db-node (`node-11`) which is capable to run `nodetool` commands.

So, next nemesis `disrupt_terminate_and_replace_node` picked that decommissioned node (`node-11`) as a verification one on it's `nodetool status` check.
Not being part of the DB cluster anymore, it returned obsolete OK result after the nemesis terminated the `node-9`:

```
2024-06-22 06:47:50,206 f:remote_base.py  l:521  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.11.228>: Running command "/usr/bin/nodetool  status "...
2024-06-22 06:47:50,428 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.11.228>: Datacenter: eu-west
2024-06-22 06:47:50,428 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.11.228>: ===================
2024-06-22 06:47:50,428 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.11.228>: Status=Up/Down
2024-06-22 06:47:50,428 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.11.228>: |/ State=Normal/Leaving/Joining/Moving
2024-06-22 06:47:50,428 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.11.228>: -- Address     Load     Tokens Owns Host ID                              Rack
2024-06-22 06:47:50,428 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.11.228>: UN 10.4.10.130 23.64 GB 256    ?    1dd0f17d-bc11-44b9-8497-c56a901bc71d 1c
2024-06-22 06:47:50,428 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.11.228>: UN 10.4.10.56  24.96 GB 256    ?    630790e3-a8e6-4b98-97e8-ea4a94de5dc2 1c  
2024-06-22 06:47:50,428 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.11.228>: UN 10.4.10.6   24.12 GB 256    ?    5ce7d885-dc11-414e-9162-bcbeaa5dd8e8 1c
2024-06-22 06:47:50,428 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.11.228>: UN 10.4.8.225  20.84 GB 256    ?    f60d7cee-5b6b-4617-aedf-0a3d341425b2 1c  
2024-06-22 06:47:50,428 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.11.228>: UN 10.4.9.188  24.95 GB 256    ?    473a2961-0169-48b2-9dfb-8bf06a6f098e 1c
```

Which led to the one more error caused by the wrong `nodetool status` result:

```
2024-06-22 06:47:50,929 f:nemesis.py      l:5137 c:sdcm.nemesis         p:ERROR > Traceback (most recent call last):
2024-06-22 06:47:50,929 f:nemesis.py      l:5137 c:sdcm.nemesis         p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 5113, in wrapper
2024-06-22 06:47:50,929 f:nemesis.py      l:5137 c:sdcm.nemesis         p:ERROR >     result = method(*args[1:], **kwargs)
2024-06-22 06:47:50,929 f:nemesis.py      l:5137 c:sdcm.nemesis         p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 1542, in disrupt_terminate_and_replace_node
2024-06-22 06:47:50,929 f:nemesis.py      l:5137 c:sdcm.nemesis         p:ERROR >     assert get_node_state(old_node_ip) == "DN", "Removed node state should be DN"
2024-06-22 06:47:50,929 f:nemesis.py      l:5137 c:sdcm.nemesis         p:ERROR > AssertionError: Removed node state should be DN
2024-06-22 06:47:50,929 f:nemesis.py      l:5149 c:sdcm.nemesis         p:DEBUG > sdcm.nemesis.SisyphusMonkey: TerminateAndReplaceNode duration -> 498 s
```

So, it contains multiple problems:
- SCT doesn't do required stuff if `nodetool decommission` command fails like here
- SCT used decommissined node for running `nodetool status` command
- SCT doesn't check the `nodetool status` result for correctness (node num, node matching ... )

## Impact

Redundant failures which could be avoided.

## How frequently does it reproduce?

Unknown.

## Installation details

Cluster size: 5 nodes (i4i.2xlarge)

Scylla Nodes used in this run:
- longevity-parallel-topology-schema--db-node-9ab27f39-9 (54.229.12.207 | 10.4.9.188) (shards: 7)
- longevity-parallel-topology-schema--db-node-9ab27f39-8 (52.208.78.175 | 10.4.8.225) (shards: 7)
- longevity-parallel-topology-schema--db-node-9ab27f39-7 (54.73.47.64 | 10.4.11.129) (shards: 7)
- longevity-parallel-topology-schema--db-node-9ab27f39-6 (52.18.221.226 | 10.4.8.164) (shards: 7)
- longevity-parallel-topology-schema--db-node-9ab27f39-5 (52.213.136.109 | 10.4.10.6) (shards: 7)
- longevity-parallel-topology-schema--db-node-9ab27f39-4 (54.75.205.14 | 10.4.8.78) (shards: 7)
- longevity-parallel-topology-schema--db-node-9ab27f39-3 (52.30.77.69 | 10.4.10.3) (shards: 7)
- longevity-parallel-topology-schema--db-node-9ab27f39-2 (79.125.60.207 | 10.4.11.192) (shards: 7)
- longevity-parallel-topology-schema--db-node-9ab27f39-12 (54.72.238.52 | 10.4.8.116) (shards: 7)
- longevity-parallel-topology-schema--db-node-9ab27f39-11 (99.80.203.111 | 10.4.11.228) (shards: -1)
- longevity-parallel-topology-schema--db-node-9ab27f39-10 (34.249.118.64 | 10.4.10.130) (shards: 7)
- longevity-parallel-topology-schema--db-node-9ab27f39-1 (34.251.143.100 | 10.4.10.56) (shards: 7)

OS / Image: `ami-038debe0a5d79391a` (aws: undefined_region)

Test: `longevity-schema-topology-changes-12h-test`
Test id: `9ab27f39-b614-45d6-8649-94c33fdf70ca`
Test name: `scylla-master/tier1/longevity-schema-topology-changes-12h-test`
Test config file(s):

- [longevity-parallel-schema-changes-12h.yaml](https://github.com/scylladb/scylla-cluster-tests/blob/9b2a1cd9d33e7d352721cc77030ad3fe6cfa3b8b/test-cases/longevity/longevity-parallel-schema-changes-12h.yaml)

<details>
<summary>
Logs and commands
</summary>

- Restore Monitor Stack command: `$ hydra investigate show-monitor 9ab27f39-b614-45d6-8649-94c33fdf70ca`
- Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=9ab27f39-b614-45d6-8649-94c33fdf70ca)
- Show all stored logs command: `$ hydra investigate show-logs 9ab27f39-b614-45d6-8649-94c33fdf70ca`

## Logs:
- **db-cluster-9ab27f39.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/9ab27f39-b614-45d6-8649-94c33fdf70ca/20240622_071219/db-cluster-9ab27f39.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/9ab27f39-b614-45d6-8649-94c33fdf70ca/20240622_071219/db-cluster-9ab27f39.tar.gz)
- **sct-runner-events-9ab27f39.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/9ab27f39-b614-45d6-8649-94c33fdf70ca/20240622_071219/sct-runner-events-9ab27f39.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/9ab27f39-b614-45d6-8649-94c33fdf70ca/20240622_071219/sct-runner-events-9ab27f39.tar.gz)
- **sct-9ab27f39.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/9ab27f39-b614-45d6-8649-94c33fdf70ca/20240622_071219/sct-9ab27f39.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/9ab27f39-b614-45d6-8649-94c33fdf70ca/20240622_071219/sct-9ab27f39.log.tar.gz)
- **loader-set-9ab27f39.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/9ab27f39-b614-45d6-8649-94c33fdf70ca/20240622_071219/loader-set-9ab27f39.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/9ab27f39-b614-45d6-8649-94c33fdf70ca/20240622_071219/loader-set-9ab27f39.tar.gz)
- **monitor-set-9ab27f39.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/9ab27f39-b614-45d6-8649-94c33fdf70ca/20240622_071219/monitor-set-9ab27f39.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/9ab27f39-b614-45d6-8649-94c33fdf70ca/20240622_071219/monitor-set-9ab27f39.tar.gz)

[Jenkins job URL](https://jenkins.scylladb.com/job/scylla-master/job/tier1/job/longevity-schema-topology-changes-12h-test/21/)
[Argus](https://argus.scylladb.com/test/e17ff39a-f3f6-4ac6-a161-871267fd3955/runs?additionalRuns[]=9ab27f39-b614-45d6-8649-94c33fdf70ca)
</details>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SCT doesn't finish decommission logic (node stays in python list of nodes) if `nodetool decommission` command fails and the operation happens to succeed #7833

vponomaryov
openedon Jul 2, 2024

Packages

Issue description

Impact

How frequently does it reproduce?

Installation details

Logs:

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

SCT doesn't finish decommission logic (node stays in python list of nodes) if nodetool decommission command fails and the operation happens to succeed #7833

Description

vponomaryovopenedon Jul 2, 2024