Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rm_stm: fix a race during partition shutdown #24936

Merged
merged 2 commits into from
Jan 27, 2025

Conversation

bharathv
Copy link
Contributor

@bharathv bharathv commented Jan 25, 2025

Currently apply fiber can continue to run (and possibly add new producers to _producers map) as the state machine is shutting down. This can manifest in weird crashes as the clean up destroys the _producers without deregistering properly.

First manifestation

Iterator invalidation in reset_producers() as it loops thru _producers with scheduling points while state machine apply adds new producers

future<> rm_stm::stop() {
.....
    co_await _gate.close();
    co_await reset_producers();  <---- interferes with state machine apply 
    _metrics.clear();
    co_await raft::persisted_stm<>::stop();

Second manifestation

Crashes: every producer creation registers with an intrusive list in producer_state_manager using a safe link. Now, if a new producer is registered after reset_producers, the map is destroyed in the state machine destructor without unlinking from the producer_state_manager and the safe_link fires an assert.

Note: since BOOST_ASSERT calls assert() which is a noop in release mode (with -NDEBUG), the crash stacks can be even weirder, they can be in producer_state_manager that tries to loop through the intrusive list (after the producers are cleaned up incorrectly)

This bug has been there forever from what I can tell, perhaps gotten worse with recent changes that added more scheduling points.

https://redpandadata.atlassian.net/browse/CORE-8883
https://redpandadata.atlassian.net/browse/CORE-8843
https://redpandadata.atlassian.net/browse/CORE-8841
https://redpandadata.atlassian.net/browse/CORE-8845

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v24.3.x
  • v24.2.x
  • v24.1.x

Release Notes

Bug Fixes

  • Fixes a crash during partition shutdown. This can happen during partition moves (cross core/broker) or at broker shutdown.

@bharathv
Copy link
Contributor Author

/ci-repeat 5
debug
skip-units
dt-repeat=50
tests/rptest/tests/datalake/partition_movement_test.py::PartitionMovementTest.test_cross_core_movements
tests/rptest/tests/topic_creation_test.py::TopicRecreateTest.test_topic_recreation_while_producing
tests/rptest/tests/partition_force_reconfiguration_test.py::PartitionForceReconfigurationTest.test_node_wise_recovery

@bharathv bharathv changed the title wip: fix crashes rm_stm: fix a race at partition shutdown Jan 25, 2025
@bharathv bharathv changed the title rm_stm: fix a race at partition shutdown rm_stm: fix a race during partition shutdown Jan 25, 2025
Currently apply fiber can continue to run (and possibly add new
producers to _producers map) as the state machine is shutting down.
This can manifest in weird crashes as the clean up destroys the
_producers without deregistering properly.

First manifestation

Iterator invalidation in reset_producers() as it loops thru _producers
with scheduling points while state machine apply adds new producers

future<> rm_stm::stop() {
.....
    co_await _gate.close();
    co_await reset_producers();  <---- interferes with state machine apply
    _metrics.clear();
    co_await raft::persisted_stm<>::stop();
.....

Second manifestation

Crashes: every producer creation registers with an intrusive list in
producer_state_manager using a safe link. Now, if a new producer is
registered after reset_producers, the map is destroyed in the state
machine destructor without unlinking from the producer_state_manager
and the safe_link fires an assert.

This bug has been there forever from what I can tell, perhaps got
worsened with recent changes that added more scheduling points in the
surrounding code.
@bharathv bharathv marked this pull request as ready for review January 25, 2025 05:25
@bharathv
Copy link
Contributor Author

/ci-repeat 5
skip-units
dt-repeat=100
tests/rptest/tests/datalake/partition_movement_test.py::PartitionMovementTest.test_cross_core_movements
tests/rptest/tests/topic_creation_test.py::TopicRecreateTest.test_topic_recreation_while_producing
tests/rptest/tests/partition_force_reconfiguration_test.py::PartitionForceReconfigurationTest.test_node_wise_recovery

@mmaslankaprv
Copy link
Member

/ci-repeat 5
skip-units
dt-repeat=30
tests/rptest/tests/datalake/partition_movement_test.py::PartitionMovementTest.test_cross_core_movements
tests/rptest/tests/topic_creation_test.py::TopicRecreateTest.test_topic_recreation_while_producing
tests/rptest/tests/partition_force_reconfiguration_test.py::PartitionForceReconfigurationTest.test_node_wise_recovery

Copy link
Contributor

@bashtanov bashtanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks reasonable to me, thanks for detailed explanation

@mmaslankaprv mmaslankaprv merged commit 1d1aefd into redpanda-data:dev Jan 27, 2025
21 checks passed
@vbotbuildovich
Copy link
Collaborator

/backport v24.3.x

@vbotbuildovich
Copy link
Collaborator

/backport v24.2.x

@vbotbuildovich
Copy link
Collaborator

/backport v24.1.x

@vbotbuildovich
Copy link
Collaborator

Failed to create a backport PR to v24.1.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-24936-v24.1.x-280 remotes/upstream/v24.1.x
git cherry-pick -x fb57ccd229 873b28214f

Workflow run logs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants