Description
@acogoluegnes this may or may not be a perf-test issue. To be investigated further next week.
Given a bunch of classic queues with publishers and consumers
when one of the nodes in the cluster is restarted
some queues no longer have consumers
This was originally reported in the context of a Kubernetes Operator and a full rolling restart. However, seems like it can be reproduced locally with just a single node restart.
Steps:
$ make start-cluster PLUGINS="rabbitmq_management"
$ perf-test -x 10 -y 10 -r 10 -c 10 -A 10 -q 10 -qq -qp 'qq-%d' -qpf 0 -qpt 9 -H amqp://guest:guest@localhost:5672,amqp://guest:guest@localhost:5673,amqp://guest:guest@localhost:5674
$ perf-test -x 10 -y 10 -r 10 -c 10 -A 10 -q 10 -qp 'cq-%d' -qpf 0 -qpt 9 --auto-delete false -f persistent -H amqp://guest:guest@localhost:5672,amqp://guest:guest@localhost:5673,amqp://guest:guest@localhost:5674
$ ./sbin/rabbitmqctl -n rabbit-1 stop_app; ./sbin/rabbitmqctl -n rabbit-1 start_app
Before restarting the node, the state of the queues looks like this:
After the restart, it looks like this:
Notes:
- some of the queues don't have consumers, it may or may not be a perf-test re-connection issue
- it seems to always work well for QQs; sometimes it takes a moment to "heal" - a queue or two will take longer to get a consumer but they do get it; meanwhile, some CQs just don't
- the issue doesn't occur in a single-node RabbitMQ (just replace
start-cluster
withrun-broker
and all is good)
In my tests, perf-test logs this error:
Caused by: com.rabbitmq.client.ShutdownSignalException: channel error;
protocol method: #method<channel.close>(reply-code=404, reply-text=NOT_FOUND
- queue 'cq-9' in vhost '/' process is stopped by supervisor, class-id=60, method-id=20)
The original issue report also contains this:
Caused by: com.rabbitmq.client.ShutdownSignalException: channel error;
protocol method: #method<channel.close>(reply-code=404, reply-text=NOT_FOUND
- failed to perform operation on queue 'perf-test-10q-rate100-5' in vhost 'perf-test-grid' due to timeout, class-id=60, method-id=20)
Originally reported on Slack: https://rabbitmq.slack.com/archives/CTMSV81HA/p1650538594525179
There are many additional details and potential issues. The above doesn't match exactly what was reported but is a simplified case of one of the issues that rules out mirroring/auto-delete/non-durable properties as the cause. The above uses durable, non-mirrored queues and still has the same behaviour of "losing consumers".