Can't replace or remove a node

## Context

We had a 7-node Scylla cluster in GCP, n2d-highmem-32 with 6TB local SSDs, running Scylla 6.0.3, Scylla Operator 1.13.0, Scylla Manager 3.3.3.

(It is the same cluster that was the protagonist of https://github.com/scylladb/scylla-operator/issues/2068)

Before the current issue has started, we did:

1. an upgrade from ScyllaDB 5.4.9 to 6.0.3,
2. enable the Raft-based consistent topology updates feature (as documented [here](https://opensource.docs.scylladb.com/branch-6.0/upgrade/upgrade-opensource/upgrade-guide-from-5.4-to-6.0/upgrade-guide-from-5.4-to-6.0-generic.html#after-upgrading-every-node)),

...and wanted to upgrade to the latest 6.1.1 (as we were trying to fix https://github.com/scylladb/scylladb/issues/19793).

## What happened

### Pod 3 loses its data on 16th of Sep

On Sep 16th, 19:14 UTC the pod `3` (id `dea17e3f-198a-4ab8-b246-ff29e103941a`) has lost its data.

The local provisioned on that node has logged:
```
2024-09-16T19:14:27.217439362Z I0916 19:14:27.217317       1 cache.go:64] Updated pv "local-pv-a3d05a4c" to cache
2024-09-16T19:14:30.717550046Z I0916 19:14:30.717390       1 deleter.go:195] Start cleanup for pv local-pv-a3d05a4c
2024-09-16T19:14:30.718020906Z I0916 19:14:30.717933       1 deleter.go:266] Deleting PV file volume "local-pv-a3d05a4c" contents at hostpath "/mnt/raid-disks/disk0", mountpath "/mnt/raid-disks/disk0"
2024-09-16T19:15:42.190376535Z I0916 19:15:42.190192       1 cache.go:64] Updated pv "local-pv-a3d05a4c" to cache
2024-09-16T19:16:50.731826090Z I0916 19:16:50.731625       1 deleter.go:165] Deleting pv local-pv-a3d05a4c after successful cleanup
2024-09-16T19:16:50.745132170Z I0916 19:16:50.744977       1 cache.go:64] Updated pv "local-pv-a3d05a4c" to cache
2024-09-16T19:16:50.757703739Z I0916 19:16:50.757577       1 cache.go:73] Deleted pv "local-pv-a3d05a4c" from cache
2024-09-16T19:17:00.746148946Z I0916 19:17:00.745992       1 discovery.go:384] Found new volume at host path "/mnt/raid-disks/disk0" with capacity 6438149685248, creating Local PV "local-pv-a3d05a4c", required volumeMode "Filesystem"
2024-09-16T19:17:00.752645076Z I0916 19:17:00.752537       1 discovery.go:418] Created PV "local-pv-a3d05a4c" for volume at "/mnt/raid-disks/disk0"
2024-09-16T19:17:00.752837656Z I0916 19:17:00.752744       1 cache.go:55] Added pv "local-pv-a3d05a4c" to cache
```

The nodetool status output at that time looked like nothing was wrong:
```
root@gke-main-scylla-6-25fcbc5b-8hgv:/# nodetool status
Datacenter: us-west1
====================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address      Load    Tokens Owns Host ID                              Rack
UN 10.7.241.130 2.39 TB 256    ?    787555a6-89d6-4b33-941c-940415380062 us-west1-b
UN 10.7.241.174 2.62 TB 256    ?    813f49f9-e397-4d70-8300-79fa91817f11 us-west1-b
UN 10.7.241.175 2.70 TB 256    ?    5342afaf-c19c-4be2-ada1-929698a4c398 us-west1-b
UN 10.7.243.109 2.51 TB 256    ?    880977bf-7cbb-4e0f-be82-ded853da57aa us-west1-b
UN 10.7.248.124 2.54 TB 256    ?    dea17e3f-198a-4ab8-b246-ff29e103941a us-west1-b
UN 10.7.249.238 2.31 TB 256    ?    5cc72b36-6fcf-4790-a540-930e544d59d2 us-west1-b
UN 10.7.252.229 2.82 TB 256    ?    60daa392-6362-423d-93b2-1ff747903287 us-west1-b

Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless
```

The logs from 2024-09-16 19:10-19:34 UTC: [extract-2024-09-30T14_19_57.781Z.csv.zip](https://github.com/user-attachments/files/17191454/extract-2024-09-30T14_19_57.781Z.csv.zip)

But as the disk usage on the nodes was constantly growing, we assumed that the node will automatically get recreated so we left it like that for ~2 days. Then we noticed that it's failing to start with:

### Pod 3 replacement fails on 18th of Sep

```
ERROR 2024-09-18 07:52:08,434 [shard  0:main] init - Startup failed: std::runtime_error (Replaced node with Host ID dea17e3f-198a-4ab8-b246-ff29e103941a not found)
```

...and that the nodetool status is showing this:

```
root@gke-main-scylla-6-25fcbc5b-bq2w:/# nodetool status
Datacenter: us-west1
====================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address      Load    Tokens Owns Host ID                              Rack
UN 10.7.241.130 2.39 TB 256    ?    787555a6-89d6-4b33-941c-940415380062 us-west1-b
UN 10.7.241.174 2.61 TB 256    ?    813f49f9-e397-4d70-8300-79fa91817f11 us-west1-b
UN 10.7.241.175 2.69 TB 256    ?    5342afaf-c19c-4be2-ada1-929698a4c398 us-west1-b
UN 10.7.243.109 2.51 TB 256    ?    880977bf-7cbb-4e0f-be82-ded853da57aa us-west1-b
DN 10.7.248.124 2.54 TB 256    ?    dff2a772-f3e3-4e64-a380-7deaa1bf96df us-west1-b
UN 10.7.249.238 2.32 TB 256    ?    5cc72b36-6fcf-4790-a540-930e544d59d2 us-west1-b
UN 10.7.252.229 2.82 TB 256    ?    60daa392-6362-423d-93b2-1ff747903287 us-west1-b

Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless
```

The old node id from the error message was nowhere to be found:

```
root@gke-main-scylla-6-25fcbc5b-bq2w:/# cqlsh
Connected to scylla at 0.0.0.0:9042
[cqlsh 6.0.20 | Scylla 6.0.3-0.20240808.a56f7ce21ad4 | CQL spec 3.3.1 | Native protocol v4]
Use HELP for help.
cqlsh> select server_id,group_id from system.raft_state ;

 server_id                            | group_id
--------------------------------------+--------------------------------------
 5342afaf-c19c-4be2-ada1-929698a4c398 | 904c8960-2c68-11ee-979c-be9922839fd2
 5cc72b36-6fcf-4790-a540-930e544d59d2 | 904c8960-2c68-11ee-979c-be9922839fd2
 60daa392-6362-423d-93b2-1ff747903287 | 904c8960-2c68-11ee-979c-be9922839fd2
 787555a6-89d6-4b33-941c-940415380062 | 904c8960-2c68-11ee-979c-be9922839fd2
 813f49f9-e397-4d70-8300-79fa91817f11 | 904c8960-2c68-11ee-979c-be9922839fd2
 880977bf-7cbb-4e0f-be82-ded853da57aa | 904c8960-2c68-11ee-979c-be9922839fd2
 dff2a772-f3e3-4e64-a380-7deaa1bf96df | 904c8960-2c68-11ee-979c-be9922839fd2

(7 rows)
cqlsh> select host_id, up from system.cluster_status;

 host_id                              | up
--------------------------------------+-------
 60daa392-6362-423d-93b2-1ff747903287 |  True
 787555a6-89d6-4b33-941c-940415380062 |  True
 dff2a772-f3e3-4e64-a380-7deaa1bf96df | False
 5342afaf-c19c-4be2-ada1-929698a4c398 |  True
 813f49f9-e397-4d70-8300-79fa91817f11 |  True
 5cc72b36-6fcf-4790-a540-930e544d59d2 |  True
 880977bf-7cbb-4e0f-be82-ded853da57aa |  True

(7 rows)
```

### Pod 3 another replacement try fails on 19th of Sep

We tried deleting the old node that had the local SSD issue, create a new one in its place and letting the cluster do the node replacement again, but it failed with a similar error as above:

```
ERROR 2024-09-19 14:06:53,099 [shard  0:main] init - Startup failed: std::runtime_error (Replaced node with Host ID dff2a772-f3e3-4e64-a380-7deaa1bf96df not found)
```

Our cluster looked like this then:

```
root@gke-main-scylla-6-25fcbc5b-bq2w:/# nodetool status
Datacenter: us-west1
====================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address      Load    Tokens Owns Host ID                              Rack
UN 10.7.241.130 2.42 TB 256    ?    787555a6-89d6-4b33-941c-940415380062 us-west1-b
UN 10.7.241.174 2.65 TB 256    ?    813f49f9-e397-4d70-8300-79fa91817f11 us-west1-b
UN 10.7.241.175 2.72 TB 256    ?    5342afaf-c19c-4be2-ada1-929698a4c398 us-west1-b
UN 10.7.243.109 2.54 TB 256    ?    880977bf-7cbb-4e0f-be82-ded853da57aa us-west1-b
DN 10.7.248.124 2.54 TB 256    ?    3ec289d5-5910-4759-93bc-6e26ab5cda9f us-west1-b
UN 10.7.249.238 2.34 TB 256    ?    5cc72b36-6fcf-4790-a540-930e544d59d2 us-west1-b
UN 10.7.252.229 2.85 TB 256    ?    60daa392-6362-423d-93b2-1ff747903287 us-west1-b

Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless
root@gke-main-scylla-6-25fcbc5b-bq2w:/# cqlsh
Connected to scylla at 0.0.0.0:9042
[cqlsh 6.0.20 | Scylla 6.0.3-0.20240808.a56f7ce21ad4 | CQL spec 3.3.1 | Native protocol v4]
Use HELP for help.
cqlsh> select server_id,group_id from system.raft_state ;

 server_id                            | group_id
--------------------------------------+--------------------------------------
 3ec289d5-5910-4759-93bc-6e26ab5cda9f | 904c8960-2c68-11ee-979c-be9922839fd2
 5342afaf-c19c-4be2-ada1-929698a4c398 | 904c8960-2c68-11ee-979c-be9922839fd2
 5cc72b36-6fcf-4790-a540-930e544d59d2 | 904c8960-2c68-11ee-979c-be9922839fd2
 60daa392-6362-423d-93b2-1ff747903287 | 904c8960-2c68-11ee-979c-be9922839fd2
 787555a6-89d6-4b33-941c-940415380062 | 904c8960-2c68-11ee-979c-be9922839fd2
 813f49f9-e397-4d70-8300-79fa91817f11 | 904c8960-2c68-11ee-979c-be9922839fd2
 880977bf-7cbb-4e0f-be82-ded853da57aa | 904c8960-2c68-11ee-979c-be9922839fd2

(7 rows)
cqlsh> select host_id, up from system.cluster_status;

 host_id                              | up
--------------------------------------+-------
 60daa392-6362-423d-93b2-1ff747903287 |  True
 787555a6-89d6-4b33-941c-940415380062 |  True
 3ec289d5-5910-4759-93bc-6e26ab5cda9f | False
 5342afaf-c19c-4be2-ada1-929698a4c398 |  True
 813f49f9-e397-4d70-8300-79fa91817f11 |  True
 5cc72b36-6fcf-4790-a540-930e544d59d2 |  True
 880977bf-7cbb-4e0f-be82-ded853da57aa |  True

(7 rows)
```

### Node removal fails on 24th of Sep

At this point we decided to try to remove the down node, with id `3ec289d5-5910-4759-93bc-6e26ab5cda9f`, from the cluster to continue our original task of upgrade Scylla to 6.1.1, planning to go back to replacing the missing node after we do that.

However, the noderemove operation also failed.

```
ERROR 2024-09-24 15:21:28,364 [shard  0:strm] raft_topology - Removenode failed. See earlier errors (Rolled back: Failed stream ranges: std::runtime_error (raft topology: exec_global_command(stream_ranges) failed with std::runtime_error (failed status returned from 880977bf-7cbb-4e0f-be82-ded853da57aa/10.7.243.109))). Request ID: af34f102-7977-11ef-db22-982520c2c047
```

We can't find meaningful errors from before this message, so I attach ~100k lines of logs from 2024-09-24 14:15-15:22 UTC from that day here: [extract-2024-09-30T14_10_55.525Z.csv.zip](https://github.com/user-attachments/files/17191354/extract-2024-09-30T14_10_55.525Z.csv.zip)

### Node removal fails after a retry on 27th of Sep

Retrying noderemove didn't work:

```
root@gke-main-scylla-6-25fcbc5b-bq2w:/# nodetool removenode 3ec289d5-5910-4759-93bc-6e26ab5cda9f
error executing POST request to http://localhost:10000/storage_service/remove_node with parameters {"host_id": "3ec289d5-5910-4759-93bc-6e26ab5cda9f"}: remote replied with status code 500 Internal Server Error:
std::runtime_error (removenode: node 3ec289d5-5910-4759-93bc-6e26ab5cda9f is in 'removing' state. Wait for it to be in 'normal' state)
```

We tried to do a rolling restart of the cluster and retry, similarly to what we did in https://github.com/scylladb/scylla-operator/issues/2068, but that did not help this time. The error message was as before, just with a different timestamp:

```
ERROR 2024-09-27 18:36:15,901 [shard  0:strm] raft_topology - Removenode failed. See earlier errors (Rolled back: Failed stream ranges: std::runtime_error (raft topology: exec_global_command(stream_ranges) failed with std::runtime_error (failed status returned from 880977bf-7cbb-4e0f-be82-ded853da57aa/10.7.243.109))). Request ID: 28a3f924-7bef-11ef-17f5-dfc99f47da8e
```

## Additional info

During this time we had surprising moments when our Scylla disks were being filled with snapshots, getting dangerously close to 80% of disk use, example:

```
root@gke-main-scylla-6-25fcbc5b-bq2w:/# du -hs /var/lib/scylla/data/production/activities-5de370f07b2011ed86b49f900e84b8e9/snapshots
2.9T	/var/lib/scylla/data/production/activities-5de370f07b2011ed86b49f900e84b8e9/snapshots
```

We cleared the snapshots when that happened using `commandnodetool clearsnapshot`.

## must-gather output

[scylla-operator-must-gather-w7rn9tspr85z.zip](https://github.com/user-attachments/files/17190909/scylla-operator-must-gather-w7rn9tspr85z.zip)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't replace or remove a node #2133

gdubicki
openedon Sep 30, 2024

Context

What happened

Pod 3 loses its data on 16th of Sep

Pod 3 replacement fails on 18th of Sep

Pod 3 another replacement try fails on 19th of Sep

Node removal fails on 24th of Sep

Node removal fails after a retry on 27th of Sep

Additional info

must-gather output

Assignees

Labels

Type

Projects

Milestone

Relationships

Development