Skip to content

Fix recovery when terms are accidentally empty #3099

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

lhoguin
Copy link
Contributor

@lhoguin lhoguin commented Jun 10, 2021

This is a fix for an issue that occurs when shutting down
a node (via SIGTERM) while the queues and more specifically
the queue index is recovering. When that happens
rabbit_recovery_terms has already started, and when
it starts it calls dets:open_file/2 which creates an
empty recovery.dets file. After the node is down and
restarted again, the node thinks the shutdown was clean
because the recovery file is there, except it is empty
and therefore the queues have lost all their state.

This results in RabbitMQ thinking there are 0 messages
in all classic queues.

To avoid this issue, we consider a shutdown to be dirty
in the case where we have a recovery file BUT we do not
find our state in the recovery terms.

To reliably reproduce the issue this fixes:

  • Start a node

  • Fill it with many messages (800k is more than enough)

  • Wait a little and then kill the node via Ctrl+C twice
    (to force dirty recovery next start)

  • Start the node again

  • While it says "Starting broker", after waiting
    about 5 seconds, send a SIGTERM (killall beam.smp)
    to shutdown the node "cleanly"

  • Start the node again

  • Management will show 0 messages in all classic queues

Types of Changes

What types of changes does your code introduce to this project?
Put an x in the boxes that apply

  • Bug fix (non-breaking change which fixes issue #NNNN)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause an observable behavior change in existing systems)
  • Documentation improvements (corrections, new content, etc)
  • Cosmetic change (whitespace, formatting, etc)

Checklist

Put an x in the boxes that apply. You can also fill these out after creating
the PR. If you're unsure about any of them, don't hesitate to ask on the
mailing list. We're here to help! This is simply a reminder of what we are
going to look for before merging your code.

  • I have read the CONTRIBUTING.md document
  • I have signed the CA (see https://cla.pivotal.io/sign/rabbitmq)
  • All tests pass locally with my changes
  • I have added tests that prove my fix is effective or that my feature works
  • I have added necessary documentation (if appropriate)
  • Any dependent changes have been merged and published in related repositories

This is a fix for an issue that occurs when shutting down
a node (via SIGTERM) while the queues and more specifically
the queue index is recovering. When that happens
rabbit_recovery_terms has already started, and when
it starts it calls dets:open_file/2 which creates an
empty recovery.dets file. After the node is down and
restarted again, the node thinks the shutdown was clean
because the recovery file is there, except it is empty
and therefore the queues have lost all their state.

This results in RabbitMQ thinking there are 0 messages
in all classic queues.

To avoid this issue, we consider a shutdown to be dirty
in the case where we have a recovery file BUT we do not
find our state in the recovery terms.

To reliably reproduce the issue this fixes:

* Start a node

* Fill it with many messages (800k is more than enough)

* Wait a little and then kill the node via Ctrl+C twice
  (to force dirty recovery next start)

* Start the node again

* While it says "Starting broker", after waiting
  about 5 seconds, send a SIGTERM (killall beam.smp)
  to shutdown the node "cleanly"

* Start the node again

* Management will show 0 messages in all classic queues
@michaelklishin michaelklishin merged commit 24291b7 into master Jun 11, 2021
@michaelklishin michaelklishin deleted the fix-recovery-with-accidental-empty-recovery-terms branch June 11, 2021 00:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants