Description
Is your feature request related to a problem? Please describe.
Currently there's no way to configure RabbitMQ by specifying the seed node that all other nodes have to join. All mechanisms assume that if they can't join other nodes, they should form a new cluster so that other nodes can join them. This works well in a vast majority of cases but still causes "mis-clustering" in some environments: for example a 3-node cluster deployment to Kubernetes may start as 2 nodes clustered together and 1 node that formed a separate cluster. While we see this issue reported occasionally, we never received full debug logs from all nodes to understand why this happens.
It's not a Kubernetes-specific issue however. While each peer discovery backend has its own logic and is more or less prone to this issue, it also happened recently in our CI with classic peer discovery.
In many environments, including Kubernetes, all (or at least some) node names are known upfront. For example a StatefulSet named foo
on Kubernetes will always create pods with the names foo-0
, foo-1
and so on. Therefore, rather than querying the Kubernetes API for the list of endpoints behind a Service, we could just configure RabbitMQ to use foo-0
as the seed node, since it has to exist, regardless of the number of nodes in the cluster.
Another recent report:
https://discord.com/channels/1092487794984755311/1092487853654687774/1324478177237536859
Describe the solution you'd like
There are few options and I'd like to get some feedback about them.
OPTION 1: New Backend
Add a new peer discovery backend that simply takes 1 node name and only that node is allowed to form a new cluster while all the other nodes need to join it. This will require some changes to the generic parts of the peer discovery mechanism, since it currently always falls back to forming a new cluster after the configured number of attempts. In this case, we would like all nodes apart from the seed node to just keep trying forever. They are not allowed to form a new cluster under any circumstances.
The downsides:
- completely opt-in - you need to start using the new backend to get the benefits
- confusing - users expect k8s peer discovery to be the best option for Kubernetes for example
OPTION 2: New Behaviour in Classic Config Peer Discovery
Introduce a new behaviour within the classic peer discovery backend. The desired logic is very close to what classic backend already does. However, it currently:
- expects the local node to be on the list of configured nodes
- falls back to forming a new cluster (see above)
The new behaviour could be: if the local node is not on the list, it's not allowed to form a new cluster and has to join one of the other nodes. With such changes, the following configuration should accomplish the desired behaviour:
cluster_formation.peer_discovery_backend = classic_config
cluster_formation.classic_config.nodes.1 = rabbit@seed.node
The downsides are:
- confusing - k8s peer discovery is not the best backend on Kubernetes (would not be used by the Cluster Operator)
- potentially confusing - different behaviour based on whether the local node is on the list or not; however, personally I was surprised it was not working like that. For me it feels pretty intuitive that a node not mentioned on the list has to join of the nodes that are mentioned
- not clear how the mechanism would work if the list contained multiple seed nodes but not all expected nodes of the cluster. I'd rather only allow a single seed node.
We could introduce a dedicated configuration option for this behaviour, for example:
cluster_formation.classic_config.seed_node = rabbit@seed.node # or perhaps just "node"
OPTION 3: Change k8s Peer Discovery
Change the way k8s peer discovery works. Currently it queries the Kubernetes API to receive the list of endpoints for a given Service. There are a lot of configuration options related to how to connect to the Kubernetes API (if the defaults don't work; fortunately they do work in most cases) to perform a query that we already know the answer to.
As mentioned above, a StatefulSet always uses consecutive 0-based suffixes. It is not possible to have a StatefulSet that does not have the ...-0
pod. If, for any reason, other nodes start successfully but pod ...-0
can't, it's totally acceptable in my opinion that the other nodes would just keep waiting for node ...-0
to start and to join it (note: peer discovery only happens on the initial cluster deployment, the unavailability of pod ...-0
wouldn't affect any operations once the cluster is formed initially). There's little benefit of forming a cluster without it.
So instead of querying the Kubernetes API, the plugin could just take a single parameter cluster_formation.k8s.statefulset_name
, append -0
to that value and treat the result as the seed node. If the configuration option is not present, just repalce the -ID
suffix of the local node with -0
.
Benefits:
- we could make this a transparent change for Kubernetes users - the k8s plugin would just start working (even) more reliably
Drawbacks: - users might want this behaviour despite not using Kubernetes; using the k8s peer discovery plugin outside of Kubernetes is counter-intuitive
Describe alternatives you've considered
We can just do nothing and hope that one day someone provides sufficient data to understand why the k8s plugin occasionally fails in some environments. It hasn't happened in our GKE cluster even once in the last few years, so there seems to be something environment specific.
Additional context
No response