Some classic queues don't have consumers after node restart

@acogoluegnes this may or may not be a perf-test issue. To be investigated further next week.

Given a bunch of classic queues with publishers and consumers
when one of the nodes in the cluster is restarted
some queues no longer have consumers

This was originally reported in the context of a Kubernetes Operator and a full rolling restart. However, seems like it can be reproduced locally with just a single node restart.

Steps:
```
$ make start-cluster PLUGINS="rabbitmq_management"

$ perf-test -x 10 -y 10 -r 10 -c 10 -A 10 -q 10 -qq -qp 'qq-%d' -qpf 0 -qpt 9 -H amqp://guest:guest@localhost:5672,amqp://guest:guest@localhost:5673,amqp://guest:guest@localhost:5674

$ perf-test -x 10 -y 10 -r 10 -c 10 -A 10 -q 10 -qp 'cq-%d' -qpf 0 -qpt 9 --auto-delete false -f persistent -H amqp://guest:guest@localhost:5672,amqp://guest:guest@localhost:5673,amqp://guest:guest@localhost:5674

$ ./sbin/rabbitmqctl -n rabbit-1 stop_app; ./sbin/rabbitmqctl -n rabbit-1 start_app
```

Before restarting the node, the state of the queues looks like this:
<img width="950" alt="before" src="https://user-images.githubusercontent.com/9566114/164544610-4ae0943c-15d0-4224-a9cd-13cee2035117.png">

After the restart, it looks like this:
<img width="946" alt="after" src="https://user-images.githubusercontent.com/9566114/164544634-11a3c1d5-ab4a-4705-8872-3ea946353838.png">

Notes:
* some of the queues don't have consumers, it may or may not be a perf-test re-connection issue
* it seems to always work well for QQs; sometimes it takes a moment to "heal" - a queue or two will take longer to get a consumer but they do get it; meanwhile, some CQs just don't
* the issue doesn't occur in a single-node RabbitMQ (just replace `start-cluster` with `run-broker` and all is good)

In my tests, perf-test logs this error:
```
Caused by: com.rabbitmq.client.ShutdownSignalException: channel error;
  protocol method: #method<channel.close>(reply-code=404, reply-text=NOT_FOUND
  - queue 'cq-9' in vhost '/' process is stopped by supervisor, class-id=60, method-id=20)
```

The original issue report also contains this:
```
Caused by: com.rabbitmq.client.ShutdownSignalException: channel error;
  protocol method: #method<channel.close>(reply-code=404, reply-text=NOT_FOUND
  - failed to perform operation on queue 'perf-test-10q-rate100-5' in vhost 'perf-test-grid' due to timeout, class-id=60, method-id=20)
```

Originally reported on Slack: https://rabbitmq.slack.com/archives/CTMSV81HA/p1650538594525179
There are many additional details and potential issues. The above doesn't match exactly what was reported but is a simplified case of one of the issues that rules out mirroring/auto-delete/non-durable properties as the cause. The above uses durable, non-mirrored queues and still has the same behaviour of "losing consumers".

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Some classic queues don't have consumers after node restart #330

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Some classic queues don't have consumers after node restart #330

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions