Skip to content

Some classic queues don't have consumers after node restart #330

Closed
@mkuratczyk

Description

@acogoluegnes this may or may not be a perf-test issue. To be investigated further next week.

Given a bunch of classic queues with publishers and consumers
when one of the nodes in the cluster is restarted
some queues no longer have consumers

This was originally reported in the context of a Kubernetes Operator and a full rolling restart. However, seems like it can be reproduced locally with just a single node restart.

Steps:

$ make start-cluster PLUGINS="rabbitmq_management"

$ perf-test -x 10 -y 10 -r 10 -c 10 -A 10 -q 10 -qq -qp 'qq-%d' -qpf 0 -qpt 9 -H amqp://guest:guest@localhost:5672,amqp://guest:guest@localhost:5673,amqp://guest:guest@localhost:5674

$ perf-test -x 10 -y 10 -r 10 -c 10 -A 10 -q 10 -qp 'cq-%d' -qpf 0 -qpt 9 --auto-delete false -f persistent -H amqp://guest:guest@localhost:5672,amqp://guest:guest@localhost:5673,amqp://guest:guest@localhost:5674

$ ./sbin/rabbitmqctl -n rabbit-1 stop_app; ./sbin/rabbitmqctl -n rabbit-1 start_app

Before restarting the node, the state of the queues looks like this:
before

After the restart, it looks like this:
after

Notes:

  • some of the queues don't have consumers, it may or may not be a perf-test re-connection issue
  • it seems to always work well for QQs; sometimes it takes a moment to "heal" - a queue or two will take longer to get a consumer but they do get it; meanwhile, some CQs just don't
  • the issue doesn't occur in a single-node RabbitMQ (just replace start-cluster with run-broker and all is good)

In my tests, perf-test logs this error:

Caused by: com.rabbitmq.client.ShutdownSignalException: channel error;
  protocol method: #method<channel.close>(reply-code=404, reply-text=NOT_FOUND
  - queue 'cq-9' in vhost '/' process is stopped by supervisor, class-id=60, method-id=20)

The original issue report also contains this:

Caused by: com.rabbitmq.client.ShutdownSignalException: channel error;
  protocol method: #method<channel.close>(reply-code=404, reply-text=NOT_FOUND
  - failed to perform operation on queue 'perf-test-10q-rate100-5' in vhost 'perf-test-grid' due to timeout, class-id=60, method-id=20)

Originally reported on Slack: https://rabbitmq.slack.com/archives/CTMSV81HA/p1650538594525179
There are many additional details and potential issues. The above doesn't match exactly what was reported but is a simplified case of one of the issues that rules out mirroring/auto-delete/non-durable properties as the cause. The above uses durable, non-mirrored queues and still has the same behaviour of "losing consumers".

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions