-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Description
With ScyllaDB Operator, we are observing very high frequency of failed bootstraps of multi-datacenter clusters with the following error message. In the following example this is coming from node europe-west4-a-0 (c1af46c6-4f5a-4265-916e-10058fef1d0d, 10.43.0.15)
.
2025-08-07T16:03:20.308728619Z INFO 2025-08-07 16:03:20,308 [shard 0:main] init - join cluster 2025-08-07T16:03:20.308752839Z INFO 2025-08-07 16:03:20,308 [shard 0:strm] storage_service - entering STARTING mode 2025-08-07T16:03:20.332395948Z INFO 2025-08-07 16:03:20,332 [shard 0:strm] storage_service - Found group 0 with ID 48bf3021-73a6-11f0-afbf-4cd9f5c8bdd5, with leader of ID 95f6ce46-7a2d-4654-820d-f174e1aff199 and IP 10.35.64.63 2025-08-07T16:03:20.339846987Z INFO 2025-08-07 16:03:20,339 [shard 0:strm] storage_service - Will join existing cluster in raft topology operations mode 2025-08-07T16:03:20.339862577Z INFO 2025-08-07 16:03:20,339 [shard 0:strm] storage_service - Loading persisted peers into the gossiper 2025-08-07T16:03:20.339874507Z INFO 2025-08-07 16:03:20,339 [shard 0:strm] storage_service - initial_contact_nodes={10.35.64.63, 10.27.225.14}, loaded_endpoints=[], loaded_peer_features=0 2025-08-07T16:03:20.339942477Z INFO 2025-08-07 16:03:20,339 [shard 0:strm] storage_service - Performing gossip shadow round 2025-08-07T16:03:20.339949187Z INFO 2025-08-07 16:03:20,339 [shard 0:strm] gossip - Gossip shadow round started with nodes={10.35.64.63, 10.27.225.14} 2025-08-07T16:03:20.355357267Z INFO 2025-08-07 16:03:20,355 [shard 0:strm] gossip - Gossip shadow round finished with nodes_talked={10.27.225.14, 10.35.64.63} ... 2025-08-07T16:03:20.357548926Z INFO 2025-08-07 16:03:20,357 [shard 0:strm] gossip - failure_detector_loop: Started main loop 2025-08-07T16:03:20.357575616Z INFO 2025-08-07 16:03:20,357 [shard 0:strm] raft_group0 - setup_group0: joining group 0... 2025-08-07T16:03:20.357632916Z INFO 2025-08-07 16:03:20,357 [shard 0:strm] raft_group0 - server c1af46c6-4f5a-4265-916e-10058fef1d0d found no local group 0. Discovering... 2025-08-07T16:03:20.365309406Z INFO 2025-08-07 16:03:20,365 [shard 0:strm] raft_group0 - server c1af46c6-4f5a-4265-916e-10058fef1d0d found group 0 with group id 48bf3021-73a6-11f0-afbf-4cd9f5c8bdd5, leader 95f6ce46-7a2d-4654-820d-f174e1aff199 2025-08-07T16:03:20.365323576Z INFO 2025-08-07 16:03:20,365 [shard 0:strm] raft_topology - join: sending the join request to 10.35.64.63 2025-08-07T16:03:20.412128844Z INFO 2025-08-07 16:03:20,412 [shard 0:strm] raft_topology - join: request to join placed, waiting for the response from the topology coordinator 2025-08-07T16:03:20.415015544Z INFO 2025-08-07 16:03:20,414 [shard 0:strm] raft_group0 - Server c1af46c6-4f5a-4265-916e-10058fef1d0d is starting group 0 with id 48bf3021-73a6-11f0-afbf-4cd9f5c8bdd5 2025-08-07T16:03:20.416326514Z INFO 2025-08-07 16:03:20,416 [shard 0:strm] raft_group0 - Detected snapshot with index=0, id=1e2ac6f8-906a-4338-981c-5dc1858ba13a, triggering new snapshot 2025-08-07T16:03:20.416340464Z WARN 2025-08-07 16:03:20,416 [shard 0:strm] raft_group0 - Could not create new snapshot, there are no entries applied ... 2025-08-07T16:03:22.433275447Z ERROR 2025-08-07 16:03:22,433 [shard 0:main] init - Startup failed: std::runtime_error (the topology coordinator rejected request to join the cluster: request canceled because some required nodes are dead) 2025-08-07T16:03:22.495580624Z 2025-08-07 16:03:22,495 WARN exited: scylla (exit status 1; not expected) 2025-08-07T16:03:22.495598024Z 2025-08-07 16:03:22,495 WARN exited: scylla (exit status 1; not expected)
The process then restarts and unsuccessfully tries to join the cluster again, but hangs.
2025-08-07T16:03:24.879468091Z INFO 2025-08-07 16:03:24,879 [shard 0:main] init - join cluster
2025-08-07T16:03:24.879471920Z INFO 2025-08-07 16:03:24,879 [shard 0:strm] storage_service - entering STARTING mode
...
(no further related logs)
Meanwhile, the logs from europe-west1-a-0 (95f6ce46-7a2d-4654-820d-f174e1aff199, 10.35.64.63)
:
2025-08-07T16:03:20.369190138Z INFO 2025-08-07 16:03:20,369 [shard 0: gms] raft_topology - received request to join from host_id: c1af46c6-4f5a-4265-916e-10058fef1d0d
...
2025-08-07T16:03:20.408394557Z INFO 2025-08-07 16:03:20,408 [shard 0: gms] raft_topology - placed join request for c1af46c6-4f5a-4265-916e-10058fef1d0d
...
2025-08-07T16:03:21.407623878Z INFO 2025-08-07 16:03:21,407 [shard 0: gms] gossip - InetAddress c1af46c6-4f5a-4265-916e-10058fef1d0d/10.43.0.15 is now UP, status = UNKNOWN
2025-08-07T16:03:21.427971408Z INFO 2025-08-07 16:03:21,427 [shard 0: gms] gossip - Removed endpoint c1af46c6-4f5a-4265-916e-10058fef1d0d
2025-08-07T16:03:21.428001868Z INFO 2025-08-07 16:03:21,427 [shard 0: gms] gossip - InetAddress c1af46c6-4f5a-4265-916e-10058fef1d0d/c1af46c6-4f5a-4265-916e-10058fef1d0d is now DOWN, status = UNKNOWN
2025-08-07T16:03:21.428050228Z INFO 2025-08-07 16:03:21,427 [shard 0: gms] gossip - Finished to force remove node c1af46c6-4f5a-4265-916e-10058fef1d0d
We've occasionally observed this in 2025.1.2 (least frequently) and 2025.1.5, but it became much more prevalent in 2025.2.1, with roughly 80-90% of multi-datacenter clusters created in our tests failing on bootstrap with this error.
I'd like to figure out the cause behind this scenario and so I'll appreciate any help in debugging this.
Is this expected that the node can't join the cluster having encountered this error?
Is there something we can do to prevent this?
In case more verbose logs are required to debug this, please let me know which services should have the log level raised.
Logs and supplementary info (nodetool status, gossipinfo where applicable) can be found here:
europe-west1-a-0: https://gcsweb.scylla-operator.scylladb.com/gcs/scylla-operator-prow/logs/ci-scylla-operator-latest-e2e-gke-multi-datacenter-parallel/1953480005082157056/artifacts/e2e/workers/europe-west1/namespaces/e2e-test-scylladbcluster-nqjbc-c9nsb-2qnap/pods/basic-vswwm-europe-west1-europe-west1-a-0/
europe-west-1-b-0: https://gcsweb.scylla-operator.scylladb.com/gcs/scylla-operator-prow/logs/ci-scylla-operator-latest-e2e-gke-multi-datacenter-parallel/1953480005082157056/artifacts/e2e/workers/europe-west1/namespaces/e2e-test-scylladbcluster-nqjbc-c9nsb-2qnap/pods/basic-vswwm-europe-west1-europe-west1-b-0/
europe-west-1-c-0: https://gcsweb.scylla-operator.scylladb.com/gcs/scylla-operator-prow/logs/ci-scylla-operator-latest-e2e-gke-multi-datacenter-parallel/1953480005082157056/artifacts/e2e/workers/europe-west1/namespaces/e2e-test-scylladbcluster-nqjbc-c9nsb-2qnap/pods/basic-vswwm-europe-west1-europe-west1-c-0/
europe-west-3-a-0: https://gcsweb.scylla-operator.scylladb.com/gcs/scylla-operator-prow/logs/ci-scylla-operator-latest-e2e-gke-multi-datacenter-parallel/1953480005082157056/artifacts/e2e/workers/europe-west3/namespaces/e2e-test-scylladbcluster-nqjbc-c9nsb-2a92t/pods/basic-vswwm-europe-west3-europe-west3-a-0/
europe-west-3-b-0: https://gcsweb.scylla-operator.scylladb.com/gcs/scylla-operator-prow/logs/ci-scylla-operator-latest-e2e-gke-multi-datacenter-parallel/1953480005082157056/artifacts/e2e/workers/europe-west3/namespaces/e2e-test-scylladbcluster-nqjbc-c9nsb-2a92t/pods/basic-vswwm-europe-west3-europe-west3-b-0/
europe-west-3-c-0: https://gcsweb.scylla-operator.scylladb.com/gcs/scylla-operator-prow/logs/ci-scylla-operator-latest-e2e-gke-multi-datacenter-parallel/1953480005082157056/artifacts/e2e/workers/europe-west3/namespaces/e2e-test-scylladbcluster-nqjbc-c9nsb-2a92t/pods/basic-vswwm-europe-west3-europe-west3-c-0/