Description
openedon Sep 30, 2024
Context
We had a 7-node Scylla cluster in GCP, n2d-highmem-32 with 6TB local SSDs, running Scylla 6.0.3, Scylla Operator 1.13.0, Scylla Manager 3.3.3.
(It is the same cluster that was the protagonist of #2068)
Before the current issue has started, we did:
- an upgrade from ScyllaDB 5.4.9 to 6.0.3,
- enable the Raft-based consistent topology updates feature (as documented here),
...and wanted to upgrade to the latest 6.1.1 (as we were trying to fix scylladb/scylladb#19793).
What happened
Pod 3 loses its data on 16th of Sep
On Sep 16th, 19:14 UTC the pod 3
(id dea17e3f-198a-4ab8-b246-ff29e103941a
) has lost its data.
The local provisioned on that node has logged:
2024-09-16T19:14:27.217439362Z I0916 19:14:27.217317 1 cache.go:64] Updated pv "local-pv-a3d05a4c" to cache
2024-09-16T19:14:30.717550046Z I0916 19:14:30.717390 1 deleter.go:195] Start cleanup for pv local-pv-a3d05a4c
2024-09-16T19:14:30.718020906Z I0916 19:14:30.717933 1 deleter.go:266] Deleting PV file volume "local-pv-a3d05a4c" contents at hostpath "/mnt/raid-disks/disk0", mountpath "/mnt/raid-disks/disk0"
2024-09-16T19:15:42.190376535Z I0916 19:15:42.190192 1 cache.go:64] Updated pv "local-pv-a3d05a4c" to cache
2024-09-16T19:16:50.731826090Z I0916 19:16:50.731625 1 deleter.go:165] Deleting pv local-pv-a3d05a4c after successful cleanup
2024-09-16T19:16:50.745132170Z I0916 19:16:50.744977 1 cache.go:64] Updated pv "local-pv-a3d05a4c" to cache
2024-09-16T19:16:50.757703739Z I0916 19:16:50.757577 1 cache.go:73] Deleted pv "local-pv-a3d05a4c" from cache
2024-09-16T19:17:00.746148946Z I0916 19:17:00.745992 1 discovery.go:384] Found new volume at host path "/mnt/raid-disks/disk0" with capacity 6438149685248, creating Local PV "local-pv-a3d05a4c", required volumeMode "Filesystem"
2024-09-16T19:17:00.752645076Z I0916 19:17:00.752537 1 discovery.go:418] Created PV "local-pv-a3d05a4c" for volume at "/mnt/raid-disks/disk0"
2024-09-16T19:17:00.752837656Z I0916 19:17:00.752744 1 cache.go:55] Added pv "local-pv-a3d05a4c" to cache
The nodetool status output at that time looked like nothing was wrong:
root@gke-main-scylla-6-25fcbc5b-8hgv:/# nodetool status
Datacenter: us-west1
====================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN 10.7.241.130 2.39 TB 256 ? 787555a6-89d6-4b33-941c-940415380062 us-west1-b
UN 10.7.241.174 2.62 TB 256 ? 813f49f9-e397-4d70-8300-79fa91817f11 us-west1-b
UN 10.7.241.175 2.70 TB 256 ? 5342afaf-c19c-4be2-ada1-929698a4c398 us-west1-b
UN 10.7.243.109 2.51 TB 256 ? 880977bf-7cbb-4e0f-be82-ded853da57aa us-west1-b
UN 10.7.248.124 2.54 TB 256 ? dea17e3f-198a-4ab8-b246-ff29e103941a us-west1-b
UN 10.7.249.238 2.31 TB 256 ? 5cc72b36-6fcf-4790-a540-930e544d59d2 us-west1-b
UN 10.7.252.229 2.82 TB 256 ? 60daa392-6362-423d-93b2-1ff747903287 us-west1-b
Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless
The logs from 2024-09-16 19:10-19:34 UTC: extract-2024-09-30T14_19_57.781Z.csv.zip
But as the disk usage on the nodes was constantly growing, we assumed that the node will automatically get recreated so we left it like that for ~2 days. Then we noticed that it's failing to start with:
Pod 3 replacement fails on 18th of Sep
ERROR 2024-09-18 07:52:08,434 [shard 0:main] init - Startup failed: std::runtime_error (Replaced node with Host ID dea17e3f-198a-4ab8-b246-ff29e103941a not found)
...and that the nodetool status is showing this:
root@gke-main-scylla-6-25fcbc5b-bq2w:/# nodetool status
Datacenter: us-west1
====================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN 10.7.241.130 2.39 TB 256 ? 787555a6-89d6-4b33-941c-940415380062 us-west1-b
UN 10.7.241.174 2.61 TB 256 ? 813f49f9-e397-4d70-8300-79fa91817f11 us-west1-b
UN 10.7.241.175 2.69 TB 256 ? 5342afaf-c19c-4be2-ada1-929698a4c398 us-west1-b
UN 10.7.243.109 2.51 TB 256 ? 880977bf-7cbb-4e0f-be82-ded853da57aa us-west1-b
DN 10.7.248.124 2.54 TB 256 ? dff2a772-f3e3-4e64-a380-7deaa1bf96df us-west1-b
UN 10.7.249.238 2.32 TB 256 ? 5cc72b36-6fcf-4790-a540-930e544d59d2 us-west1-b
UN 10.7.252.229 2.82 TB 256 ? 60daa392-6362-423d-93b2-1ff747903287 us-west1-b
Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless
The old node id from the error message was nowhere to be found:
root@gke-main-scylla-6-25fcbc5b-bq2w:/# cqlsh
Connected to scylla at 0.0.0.0:9042
[cqlsh 6.0.20 | Scylla 6.0.3-0.20240808.a56f7ce21ad4 | CQL spec 3.3.1 | Native protocol v4]
Use HELP for help.
cqlsh> select server_id,group_id from system.raft_state ;
server_id | group_id
--------------------------------------+--------------------------------------
5342afaf-c19c-4be2-ada1-929698a4c398 | 904c8960-2c68-11ee-979c-be9922839fd2
5cc72b36-6fcf-4790-a540-930e544d59d2 | 904c8960-2c68-11ee-979c-be9922839fd2
60daa392-6362-423d-93b2-1ff747903287 | 904c8960-2c68-11ee-979c-be9922839fd2
787555a6-89d6-4b33-941c-940415380062 | 904c8960-2c68-11ee-979c-be9922839fd2
813f49f9-e397-4d70-8300-79fa91817f11 | 904c8960-2c68-11ee-979c-be9922839fd2
880977bf-7cbb-4e0f-be82-ded853da57aa | 904c8960-2c68-11ee-979c-be9922839fd2
dff2a772-f3e3-4e64-a380-7deaa1bf96df | 904c8960-2c68-11ee-979c-be9922839fd2
(7 rows)
cqlsh> select host_id, up from system.cluster_status;
host_id | up
--------------------------------------+-------
60daa392-6362-423d-93b2-1ff747903287 | True
787555a6-89d6-4b33-941c-940415380062 | True
dff2a772-f3e3-4e64-a380-7deaa1bf96df | False
5342afaf-c19c-4be2-ada1-929698a4c398 | True
813f49f9-e397-4d70-8300-79fa91817f11 | True
5cc72b36-6fcf-4790-a540-930e544d59d2 | True
880977bf-7cbb-4e0f-be82-ded853da57aa | True
(7 rows)
Pod 3 another replacement try fails on 19th of Sep
We tried deleting the old node that had the local SSD issue, create a new one in its place and letting the cluster do the node replacement again, but it failed with a similar error as above:
ERROR 2024-09-19 14:06:53,099 [shard 0:main] init - Startup failed: std::runtime_error (Replaced node with Host ID dff2a772-f3e3-4e64-a380-7deaa1bf96df not found)
Our cluster looked like this then:
root@gke-main-scylla-6-25fcbc5b-bq2w:/# nodetool status
Datacenter: us-west1
====================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN 10.7.241.130 2.42 TB 256 ? 787555a6-89d6-4b33-941c-940415380062 us-west1-b
UN 10.7.241.174 2.65 TB 256 ? 813f49f9-e397-4d70-8300-79fa91817f11 us-west1-b
UN 10.7.241.175 2.72 TB 256 ? 5342afaf-c19c-4be2-ada1-929698a4c398 us-west1-b
UN 10.7.243.109 2.54 TB 256 ? 880977bf-7cbb-4e0f-be82-ded853da57aa us-west1-b
DN 10.7.248.124 2.54 TB 256 ? 3ec289d5-5910-4759-93bc-6e26ab5cda9f us-west1-b
UN 10.7.249.238 2.34 TB 256 ? 5cc72b36-6fcf-4790-a540-930e544d59d2 us-west1-b
UN 10.7.252.229 2.85 TB 256 ? 60daa392-6362-423d-93b2-1ff747903287 us-west1-b
Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless
root@gke-main-scylla-6-25fcbc5b-bq2w:/# cqlsh
Connected to scylla at 0.0.0.0:9042
[cqlsh 6.0.20 | Scylla 6.0.3-0.20240808.a56f7ce21ad4 | CQL spec 3.3.1 | Native protocol v4]
Use HELP for help.
cqlsh> select server_id,group_id from system.raft_state ;
server_id | group_id
--------------------------------------+--------------------------------------
3ec289d5-5910-4759-93bc-6e26ab5cda9f | 904c8960-2c68-11ee-979c-be9922839fd2
5342afaf-c19c-4be2-ada1-929698a4c398 | 904c8960-2c68-11ee-979c-be9922839fd2
5cc72b36-6fcf-4790-a540-930e544d59d2 | 904c8960-2c68-11ee-979c-be9922839fd2
60daa392-6362-423d-93b2-1ff747903287 | 904c8960-2c68-11ee-979c-be9922839fd2
787555a6-89d6-4b33-941c-940415380062 | 904c8960-2c68-11ee-979c-be9922839fd2
813f49f9-e397-4d70-8300-79fa91817f11 | 904c8960-2c68-11ee-979c-be9922839fd2
880977bf-7cbb-4e0f-be82-ded853da57aa | 904c8960-2c68-11ee-979c-be9922839fd2
(7 rows)
cqlsh> select host_id, up from system.cluster_status;
host_id | up
--------------------------------------+-------
60daa392-6362-423d-93b2-1ff747903287 | True
787555a6-89d6-4b33-941c-940415380062 | True
3ec289d5-5910-4759-93bc-6e26ab5cda9f | False
5342afaf-c19c-4be2-ada1-929698a4c398 | True
813f49f9-e397-4d70-8300-79fa91817f11 | True
5cc72b36-6fcf-4790-a540-930e544d59d2 | True
880977bf-7cbb-4e0f-be82-ded853da57aa | True
(7 rows)
Node removal fails on 24th of Sep
At this point we decided to try to remove the down node, with id 3ec289d5-5910-4759-93bc-6e26ab5cda9f
, from the cluster to continue our original task of upgrade Scylla to 6.1.1, planning to go back to replacing the missing node after we do that.
However, the noderemove operation also failed.
ERROR 2024-09-24 15:21:28,364 [shard 0:strm] raft_topology - Removenode failed. See earlier errors (Rolled back: Failed stream ranges: std::runtime_error (raft topology: exec_global_command(stream_ranges) failed with std::runtime_error (failed status returned from 880977bf-7cbb-4e0f-be82-ded853da57aa/10.7.243.109))). Request ID: af34f102-7977-11ef-db22-982520c2c047
We can't find meaningful errors from before this message, so I attach ~100k lines of logs from 2024-09-24 14:15-15:22 UTC from that day here: extract-2024-09-30T14_10_55.525Z.csv.zip
Node removal fails after a retry on 27th of Sep
Retrying noderemove didn't work:
root@gke-main-scylla-6-25fcbc5b-bq2w:/# nodetool removenode 3ec289d5-5910-4759-93bc-6e26ab5cda9f
error executing POST request to http://localhost:10000/storage_service/remove_node with parameters {"host_id": "3ec289d5-5910-4759-93bc-6e26ab5cda9f"}: remote replied with status code 500 Internal Server Error:
std::runtime_error (removenode: node 3ec289d5-5910-4759-93bc-6e26ab5cda9f is in 'removing' state. Wait for it to be in 'normal' state)
We tried to do a rolling restart of the cluster and retry, similarly to what we did in #2068, but that did not help this time. The error message was as before, just with a different timestamp:
ERROR 2024-09-27 18:36:15,901 [shard 0:strm] raft_topology - Removenode failed. See earlier errors (Rolled back: Failed stream ranges: std::runtime_error (raft topology: exec_global_command(stream_ranges) failed with std::runtime_error (failed status returned from 880977bf-7cbb-4e0f-be82-ded853da57aa/10.7.243.109))). Request ID: 28a3f924-7bef-11ef-17f5-dfc99f47da8e
Additional info
During this time we had surprising moments when our Scylla disks were being filled with snapshots, getting dangerously close to 80% of disk use, example:
root@gke-main-scylla-6-25fcbc5b-bq2w:/# du -hs /var/lib/scylla/data/production/activities-5de370f07b2011ed86b49f900e84b8e9/snapshots
2.9T /var/lib/scylla/data/production/activities-5de370f07b2011ed86b49f900e84b8e9/snapshots
We cleared the snapshots when that happened using commandnodetool clearsnapshot
.