Skip to content

Frequent failures in multi-datacenter bootstrap: the topology coordinator rejected request to join the cluster: request canceled because some required nodes are dead #25410

@rzetelskik

Description

@rzetelskik

With ScyllaDB Operator, we are observing very high frequency of failed bootstraps of multi-datacenter clusters with the following error message. In the following example this is coming from node europe-west4-a-0 (c1af46c6-4f5a-4265-916e-10058fef1d0d, 10.43.0.15).

2025-08-07T16:03:20.308728619Z INFO  2025-08-07 16:03:20,308 [shard 0:main] init - join cluster
2025-08-07T16:03:20.308752839Z INFO  2025-08-07 16:03:20,308 [shard 0:strm] storage_service - entering STARTING mode
2025-08-07T16:03:20.332395948Z INFO  2025-08-07 16:03:20,332 [shard 0:strm] storage_service - Found group 0 with ID 48bf3021-73a6-11f0-afbf-4cd9f5c8bdd5, with leader of ID 95f6ce46-7a2d-4654-820d-f174e1aff199 and IP 10.35.64.63
2025-08-07T16:03:20.339846987Z INFO  2025-08-07 16:03:20,339 [shard 0:strm] storage_service - Will join existing cluster in raft topology operations mode
2025-08-07T16:03:20.339862577Z INFO  2025-08-07 16:03:20,339 [shard 0:strm] storage_service - Loading persisted peers into the gossiper
2025-08-07T16:03:20.339874507Z INFO  2025-08-07 16:03:20,339 [shard 0:strm] storage_service - initial_contact_nodes={10.35.64.63, 10.27.225.14}, loaded_endpoints=[], loaded_peer_features=0
2025-08-07T16:03:20.339942477Z INFO  2025-08-07 16:03:20,339 [shard 0:strm] storage_service - Performing gossip shadow round
2025-08-07T16:03:20.339949187Z INFO  2025-08-07 16:03:20,339 [shard 0:strm] gossip - Gossip shadow round started with nodes={10.35.64.63, 10.27.225.14}
2025-08-07T16:03:20.355357267Z INFO  2025-08-07 16:03:20,355 [shard 0:strm] gossip - Gossip shadow round finished with nodes_talked={10.27.225.14, 10.35.64.63}
...
2025-08-07T16:03:20.357548926Z INFO  2025-08-07 16:03:20,357 [shard 0:strm] gossip - failure_detector_loop: Started main loop
2025-08-07T16:03:20.357575616Z INFO  2025-08-07 16:03:20,357 [shard 0:strm] raft_group0 - setup_group0: joining group 0...
2025-08-07T16:03:20.357632916Z INFO  2025-08-07 16:03:20,357 [shard 0:strm] raft_group0 - server c1af46c6-4f5a-4265-916e-10058fef1d0d found no local group 0. Discovering...
2025-08-07T16:03:20.365309406Z INFO  2025-08-07 16:03:20,365 [shard 0:strm] raft_group0 - server c1af46c6-4f5a-4265-916e-10058fef1d0d found group 0 with group id 48bf3021-73a6-11f0-afbf-4cd9f5c8bdd5, leader 95f6ce46-7a2d-4654-820d-f174e1aff199
2025-08-07T16:03:20.365323576Z INFO  2025-08-07 16:03:20,365 [shard 0:strm] raft_topology - join: sending the join request to 10.35.64.63
2025-08-07T16:03:20.412128844Z INFO  2025-08-07 16:03:20,412 [shard 0:strm] raft_topology - join: request to join placed, waiting for the response from the topology coordinator
2025-08-07T16:03:20.415015544Z INFO  2025-08-07 16:03:20,414 [shard 0:strm] raft_group0 - Server c1af46c6-4f5a-4265-916e-10058fef1d0d is starting group 0 with id 48bf3021-73a6-11f0-afbf-4cd9f5c8bdd5
2025-08-07T16:03:20.416326514Z INFO  2025-08-07 16:03:20,416 [shard 0:strm] raft_group0 - Detected snapshot with index=0, id=1e2ac6f8-906a-4338-981c-5dc1858ba13a, triggering new snapshot
2025-08-07T16:03:20.416340464Z WARN  2025-08-07 16:03:20,416 [shard 0:strm] raft_group0 - Could not create new snapshot, there are no entries applied
...
2025-08-07T16:03:22.433275447Z ERROR 2025-08-07 16:03:22,433 [shard 0:main] init - Startup failed: std::runtime_error (the topology coordinator rejected request to join the cluster: request canceled because some required nodes are dead)
2025-08-07T16:03:22.495580624Z 2025-08-07 16:03:22,495 WARN exited: scylla (exit status 1; not expected)
2025-08-07T16:03:22.495598024Z 2025-08-07 16:03:22,495 WARN exited: scylla (exit status 1; not expected)

The process then restarts and unsuccessfully tries to join the cluster again, but hangs.

2025-08-07T16:03:24.879468091Z INFO  2025-08-07 16:03:24,879 [shard 0:main] init - join cluster
2025-08-07T16:03:24.879471920Z INFO  2025-08-07 16:03:24,879 [shard 0:strm] storage_service - entering STARTING mode
...
(no further related logs)

Meanwhile, the logs from europe-west1-a-0 (95f6ce46-7a2d-4654-820d-f174e1aff199, 10.35.64.63):

2025-08-07T16:03:20.369190138Z INFO  2025-08-07 16:03:20,369 [shard 0: gms] raft_topology - received request to join from host_id: c1af46c6-4f5a-4265-916e-10058fef1d0d
...
2025-08-07T16:03:20.408394557Z INFO  2025-08-07 16:03:20,408 [shard 0: gms] raft_topology - placed join request for c1af46c6-4f5a-4265-916e-10058fef1d0d
...
2025-08-07T16:03:21.407623878Z INFO  2025-08-07 16:03:21,407 [shard 0: gms] gossip - InetAddress c1af46c6-4f5a-4265-916e-10058fef1d0d/10.43.0.15 is now UP, status = UNKNOWN
2025-08-07T16:03:21.427971408Z INFO  2025-08-07 16:03:21,427 [shard 0: gms] gossip - Removed endpoint c1af46c6-4f5a-4265-916e-10058fef1d0d
2025-08-07T16:03:21.428001868Z INFO  2025-08-07 16:03:21,427 [shard 0: gms] gossip - InetAddress c1af46c6-4f5a-4265-916e-10058fef1d0d/c1af46c6-4f5a-4265-916e-10058fef1d0d is now DOWN, status = UNKNOWN
2025-08-07T16:03:21.428050228Z INFO  2025-08-07 16:03:21,427 [shard 0: gms] gossip - Finished to force remove node c1af46c6-4f5a-4265-916e-10058fef1d0d

We've occasionally observed this in 2025.1.2 (least frequently) and 2025.1.5, but it became much more prevalent in 2025.2.1, with roughly 80-90% of multi-datacenter clusters created in our tests failing on bootstrap with this error.

I'd like to figure out the cause behind this scenario and so I'll appreciate any help in debugging this.
Is this expected that the node can't join the cluster having encountered this error?
Is there something we can do to prevent this?

In case more verbose logs are required to debug this, please let me know which services should have the log level raised.


Logs and supplementary info (nodetool status, gossipinfo where applicable) can be found here:
europe-west1-a-0: https://gcsweb.scylla-operator.scylladb.com/gcs/scylla-operator-prow/logs/ci-scylla-operator-latest-e2e-gke-multi-datacenter-parallel/1953480005082157056/artifacts/e2e/workers/europe-west1/namespaces/e2e-test-scylladbcluster-nqjbc-c9nsb-2qnap/pods/basic-vswwm-europe-west1-europe-west1-a-0/
europe-west-1-b-0: https://gcsweb.scylla-operator.scylladb.com/gcs/scylla-operator-prow/logs/ci-scylla-operator-latest-e2e-gke-multi-datacenter-parallel/1953480005082157056/artifacts/e2e/workers/europe-west1/namespaces/e2e-test-scylladbcluster-nqjbc-c9nsb-2qnap/pods/basic-vswwm-europe-west1-europe-west1-b-0/
europe-west-1-c-0: https://gcsweb.scylla-operator.scylladb.com/gcs/scylla-operator-prow/logs/ci-scylla-operator-latest-e2e-gke-multi-datacenter-parallel/1953480005082157056/artifacts/e2e/workers/europe-west1/namespaces/e2e-test-scylladbcluster-nqjbc-c9nsb-2qnap/pods/basic-vswwm-europe-west1-europe-west1-c-0/

europe-west-3-a-0: https://gcsweb.scylla-operator.scylladb.com/gcs/scylla-operator-prow/logs/ci-scylla-operator-latest-e2e-gke-multi-datacenter-parallel/1953480005082157056/artifacts/e2e/workers/europe-west3/namespaces/e2e-test-scylladbcluster-nqjbc-c9nsb-2a92t/pods/basic-vswwm-europe-west3-europe-west3-a-0/
europe-west-3-b-0: https://gcsweb.scylla-operator.scylladb.com/gcs/scylla-operator-prow/logs/ci-scylla-operator-latest-e2e-gke-multi-datacenter-parallel/1953480005082157056/artifacts/e2e/workers/europe-west3/namespaces/e2e-test-scylladbcluster-nqjbc-c9nsb-2a92t/pods/basic-vswwm-europe-west3-europe-west3-b-0/
europe-west-3-c-0: https://gcsweb.scylla-operator.scylladb.com/gcs/scylla-operator-prow/logs/ci-scylla-operator-latest-e2e-gke-multi-datacenter-parallel/1953480005082157056/artifacts/e2e/workers/europe-west3/namespaces/e2e-test-scylladbcluster-nqjbc-c9nsb-2a92t/pods/basic-vswwm-europe-west3-europe-west3-c-0/

europe-west4-a-0: https://gcsweb.scylla-operator.scylladb.com/gcs/scylla-operator-prow/logs/ci-scylla-operator-latest-e2e-gke-multi-datacenter-parallel/1953480005082157056/artifacts/e2e/workers/europe-west4/namespaces/e2e-test-scylladbcluster-nqjbc-c9nsb-33q49/pods/basic-vswwm-europe-west4-europe-west4-a-0/

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions