Skip to content

Cluster doesn't recover if all rabbitmq-server pods deleted from cluster #609

@sheiks

Description

@sheiks

Describe the bug

RabbitMQ cluster cannot recover if someone deletes all pods in the rabbitmq cluster using kubectl cli.

To Reproduce

Steps to reproduce the behavior:

  1. Create RabbitmqCluster with 3 replicas.
  2. Once cluster is healthy, then delete all 3 pods with kubectl cli. kubectl delete pods rabbitmq-server-0 rabbitmq-server-1 rabbitmq-server-2
  3. Verify rabbitmq pods status. kubectl get pods
  4. NAME READY STATUS RESTARTS AGE
    rabbitmq-server-0 0/1 Running 5 69m

kubectl logs -f rabbitmq-server-0

2021-02-17 14:29:41.001 [info] <0.273.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2021-02-17 14:30:11.002 [warning] <0.273.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,['rabbit@rabbitmq-server-2.rabbitmq-nodes.rabbitmq-cl-op-poc','rabbit@rabbitmq-server-1.rabbitmq-nodes.rabbitmq-cl-op-poc','rabbit@rabbitmq-server-0.rabbitmq-nodes.rabbitmq-cl-op-poc'],[rabbit_durable_queue]}
2021-02-17 14:30:11.002 [info] <0.273.0> Waiting for Mnesia tables for 30000 ms, 8 retries left
2021-02-17 14:30:41.002 [warning] <0.273.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,['rabbit@rabbitmq-server-2.rabbitmq-nodes.rabbitmq-cl-op-poc','rabbit@rabbitmq-server-1.rabbitmq-nodes.rabbitmq-cl-op-poc','rabbit@rabbitmq-server-0.rabbitmq-nodes.rabbitmq-cl-op-poc'],[rabbit_durable_queue]}
2021-02-17 14:30:41.003 [info] <0.273.0> Waiting for Mnesia tables for 30000 ms, 7 retries left
2021-02-17 14:31:11.004 [warning] <0.273.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,['rabbit@rabbitmq-server-2.rabbitmq-nodes.rabbitmq-cl-op-poc','rabbit@rabbitmq-server-1.rabbitmq-nodes.rabbitmq-cl-op-poc','rabbit@rabbitmq-server-0.rabbitmq-nodes.rabbitmq-cl-op-poc'],[rabbit_durable_queue]}
2021-02-17 14:31:11.004 [info] <0.273.0> Waiting for Mnesia tables for 30000 ms, 6 retries left
2021-02-17 14:31:41.005 [warning] <0.273.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,['rabbit@rabbitmq-server-2.rabbitmq-nodes.rabbitmq-cl-op-poc','rabbit@rabbitmq-server-1.rabbitmq-nodes.rabbitmq-cl-op-poc','rabbit@rabbitmq-server-0.rabbitmq-nodes.rabbitmq-cl-op-poc'],[rabbit_durable_queue]}
2021-02-17 14:31:41.005 [info] <0.273.0> Waiting for Mnesia tables for 30000 ms, 5 retries left
2021-02-17 14:32:11.006 [warning] <0.273.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,['rabbit@rabbitmq-server-2.rabbitmq-nodes.rabbitmq-cl-op-poc','rabbit@rabbitmq-server-1.rabbitmq-nodes.rabbitmq-cl-op-poc','rabbit@rabbitmq-server-0.rabbitmq-nodes.rabbitmq-cl-op-poc'],[rabbit_durable_queue]}
2021-02-17 14:32:11.006 [info] <0.273.0> Waiting for Mnesia tables for 30000 ms, 4 retries left
2021-02-17 14:32:41.007 [warning] <0.273.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,['rabbit@rabbitmq-server-2.rabbitmq-nodes.rabbitmq-cl-op-poc','rabbit@rabbitmq-server-1.rabbitmq-nodes.rabbitmq-cl-op-poc','rabbit@rabbitmq-server-0.rabbitmq-nodes.rabbitmq-cl-op-poc'],[rabbit_durable_queue]}
2021-02-17 14:32:41.007 [info] <0.273.0> Waiting for Mnesia tables for 30000 ms, 3 retries left
2021-02-17 14:33:11.008 [warning] <0.273.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,['rabbit@rabbitmq-server-2.rabbitmq-nodes.rabbitmq-cl-op-poc','rabbit@rabbitmq-server-1.rabbitmq-nodes.rabbitmq-cl-op-poc','rabbit@rabbitmq-server-0.rabbitmq-nodes.rabbitmq-cl-op-poc'],[rabbit_durable_queue]}
2021-02-17 14:33:11.008 [info] <0.273.0> Waiting for Mnesia tables for 30000 ms, 2 retries left
2021-02-17 14:33:41.009 [warning] <0.273.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,['rabbit@rabbitmq-server-2.rabbitmq-nodes.rabbitmq-cl-op-poc','rabbit@rabbitmq-server-1.rabbitmq-nodes.rabbitmq-cl-op-poc','rabbit@rabbitmq-server-0.rabbitmq-nodes.rabbitmq-cl-op-poc'],[rabbit_durable_queue]}
2021-02-17 14:33:41.009 [info] <0.273.0> Waiting for Mnesia tables for 30000 ms, 1 retries left

Below values.yaml file used with https://github.com/rabbitmq/cluster-operator/tree/main/charts/rabbitmq

labels:
label1: foo
label2: bar

annotations:
annotation1: foo
annotation2: bar

replicas: 3

imagePullSecrets:

  • name: foo

service:
type: LoadBalancer

resources:
requests:
cpu: 100m
memory: 1Gi
limits:
cpu: 100m
memory: 1Gi

tolerations:

  • key: "dedicated"
    operator: "Equal"
    value: "rabbitmq"
    effect: "NoSchedule"

rabbitmq:
additionalPlugins:
- rabbitmq_shovel
- rabbitmq_shovel_management
additionalConfig: |
cluster_formation.peer_discovery_backend = rabbit_peer_discovery_k8s
envConfig: |
PLUGINS_DIR=/opt/rabbitmq/plugins:/opt/rabbitmq/community-plugins
advancedConfig: |
[
{ra, [
{wal_data_dir, '/var/lib/rabbitmq/quorum-wal'}
]}
].

terminationGracePeriodSeconds: 42

skipPostDeploySteps: true

override:
statefulSet:
spec:
template:
spec:
containers:
- name: rabbitmq
ports:
- containerPort: 12345 # opens an additional port on the rabbitmq server container
name: additional-port
protocol: TCP

Expected behavior
We had seen this problem when were using bitnami images, and solution for this problem documented here https://github.com/bitnami/charts/tree/master/bitnami/rabbitmq#recovering-the-cluster-from-complete-shutdown

May be it's good to document same for cluster-operator as well

Screenshots

If applicable, add screenshots to help explain your problem.

Version and environment information

  • RabbitMQ: 3.8.11
  • RabbitMQ Cluster Operator: 1.1.0
  • Kubernetes: v1.17.8
  • vmware(PKS)

Additional context

Add any other context about the problem here.
https://github.com/bitnami/charts/tree/master/bitnami/rabbitmq#recovering-the-cluster-from-complete-shutdown

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions