-
Notifications
You must be signed in to change notification settings - Fork 614
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rm_stm: fix a race during partition shutdown #24936
Conversation
/ci-repeat 5 |
Currently apply fiber can continue to run (and possibly add new producers to _producers map) as the state machine is shutting down. This can manifest in weird crashes as the clean up destroys the _producers without deregistering properly. First manifestation Iterator invalidation in reset_producers() as it loops thru _producers with scheduling points while state machine apply adds new producers future<> rm_stm::stop() { ..... co_await _gate.close(); co_await reset_producers(); <---- interferes with state machine apply _metrics.clear(); co_await raft::persisted_stm<>::stop(); ..... Second manifestation Crashes: every producer creation registers with an intrusive list in producer_state_manager using a safe link. Now, if a new producer is registered after reset_producers, the map is destroyed in the state machine destructor without unlinking from the producer_state_manager and the safe_link fires an assert. This bug has been there forever from what I can tell, perhaps got worsened with recent changes that added more scheduling points in the surrounding code.
674d61b
to
873b282
Compare
/ci-repeat 5 |
/ci-repeat 5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks reasonable to me, thanks for detailed explanation
/backport v24.3.x |
/backport v24.2.x |
/backport v24.1.x |
Failed to create a backport PR to v24.1.x branch. I tried:
|
Currently apply fiber can continue to run (and possibly add new producers to _producers map) as the state machine is shutting down. This can manifest in weird crashes as the clean up destroys the _producers without deregistering properly.
First manifestation
Iterator invalidation in reset_producers() as it loops thru _producers with scheduling points while state machine apply adds new producers
Second manifestation
Crashes: every producer creation registers with an intrusive list in producer_state_manager using a safe link. Now, if a new producer is registered after reset_producers, the map is destroyed in the state machine destructor without unlinking from the producer_state_manager and the safe_link fires an assert.
Note: since BOOST_ASSERT calls assert() which is a noop in release mode (with -NDEBUG), the crash stacks can be even weirder, they can be in producer_state_manager that tries to loop through the intrusive list (after the producers are cleaned up incorrectly)
This bug has been there forever from what I can tell, perhaps gotten worse with recent changes that added more scheduling points.
https://redpandadata.atlassian.net/browse/CORE-8883
https://redpandadata.atlassian.net/browse/CORE-8843
https://redpandadata.atlassian.net/browse/CORE-8841
https://redpandadata.atlassian.net/browse/CORE-8845
Backports Required
Release Notes
Bug Fixes