rm_stm: fix a race during partition shutdown #24936

bharathv · 2025-01-25T01:36:31Z

Currently apply fiber can continue to run (and possibly add new producers to _producers map) as the state machine is shutting down. This can manifest in weird crashes as the clean up destroys the _producers without deregistering properly.

First manifestation

Iterator invalidation in reset_producers() as it loops thru _producers with scheduling points while state machine apply adds new producers

future<> rm_stm::stop() {
.....
    co_await _gate.close();
    co_await reset_producers();  <---- interferes with state machine apply 
    _metrics.clear();
    co_await raft::persisted_stm<>::stop();

Second manifestation

Crashes: every producer creation registers with an intrusive list in producer_state_manager using a safe link. Now, if a new producer is registered after reset_producers, the map is destroyed in the state machine destructor without unlinking from the producer_state_manager and the safe_link fires an assert.

Note: since BOOST_ASSERT calls assert() which is a noop in release mode (with -NDEBUG), the crash stacks can be even weirder, they can be in producer_state_manager that tries to loop through the intrusive list (after the producers are cleaned up incorrectly)

This bug has been there forever from what I can tell, perhaps gotten worse with recent changes that added more scheduling points.

https://redpandadata.atlassian.net/browse/CORE-8883
https://redpandadata.atlassian.net/browse/CORE-8843
https://redpandadata.atlassian.net/browse/CORE-8841
https://redpandadata.atlassian.net/browse/CORE-8845

Backports Required

Release Notes

Bug Fixes

Fixes a crash during partition shutdown. This can happen during partition moves (cross core/broker) or at broker shutdown.

bharathv · 2025-01-25T01:39:39Z

/ci-repeat 5
debug
skip-units
dt-repeat=50
tests/rptest/tests/datalake/partition_movement_test.py::PartitionMovementTest.test_cross_core_movements
tests/rptest/tests/topic_creation_test.py::TopicRecreateTest.test_topic_recreation_while_producing
tests/rptest/tests/partition_force_reconfiguration_test.py::PartitionForceReconfigurationTest.test_node_wise_recovery

Currently apply fiber can continue to run (and possibly add new producers to _producers map) as the state machine is shutting down. This can manifest in weird crashes as the clean up destroys the _producers without deregistering properly. First manifestation Iterator invalidation in reset_producers() as it loops thru _producers with scheduling points while state machine apply adds new producers future<> rm_stm::stop() { ..... co_await _gate.close(); co_await reset_producers(); <---- interferes with state machine apply _metrics.clear(); co_await raft::persisted_stm<>::stop(); ..... Second manifestation Crashes: every producer creation registers with an intrusive list in producer_state_manager using a safe link. Now, if a new producer is registered after reset_producers, the map is destroyed in the state machine destructor without unlinking from the producer_state_manager and the safe_link fires an assert. This bug has been there forever from what I can tell, perhaps got worsened with recent changes that added more scheduling points in the surrounding code.

bharathv · 2025-01-27T00:42:14Z

/ci-repeat 5
skip-units
dt-repeat=100
tests/rptest/tests/datalake/partition_movement_test.py::PartitionMovementTest.test_cross_core_movements
tests/rptest/tests/topic_creation_test.py::TopicRecreateTest.test_topic_recreation_while_producing
tests/rptest/tests/partition_force_reconfiguration_test.py::PartitionForceReconfigurationTest.test_node_wise_recovery

mmaslankaprv · 2025-01-27T06:49:35Z

/ci-repeat 5
skip-units
dt-repeat=30
tests/rptest/tests/datalake/partition_movement_test.py::PartitionMovementTest.test_cross_core_movements
tests/rptest/tests/topic_creation_test.py::TopicRecreateTest.test_topic_recreation_while_producing
tests/rptest/tests/partition_force_reconfiguration_test.py::PartitionForceReconfigurationTest.test_node_wise_recovery

bashtanov

looks reasonable to me, thanks for detailed explanation

vbotbuildovich · 2025-01-27T11:05:44Z

/backport v24.3.x

vbotbuildovich · 2025-01-27T11:05:45Z

/backport v24.2.x

vbotbuildovich · 2025-01-27T11:05:46Z

/backport v24.1.x

vbotbuildovich · 2025-01-27T11:06:54Z

Failed to create a backport PR to v24.1.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-24936-v24.1.x-280 remotes/upstream/v24.1.x
git cherry-pick -x fb57ccd229 873b28214f

Workflow run logs.

github-actions bot added the area/redpanda label Jan 25, 2025

bharathv changed the title ~~wip: fix crashes~~ rm_stm: fix a race at partition shutdown Jan 25, 2025

bharathv changed the title ~~rm_stm: fix a race at partition shutdown~~ rm_stm: fix a race during partition shutdown Jan 25, 2025

bharathv requested review from mmaslankaprv, ztlpn, dotnwat and bashtanov January 25, 2025 05:14

bharathv added 2 commits January 24, 2025 21:17

rm_stm/logging: more logging

873b282

bharathv force-pushed the fix_intr_crashes branch from 674d61b to 873b282 Compare January 25, 2025 05:19

bharathv marked this pull request as ready for review January 25, 2025 05:25

bashtanov approved these changes Jan 27, 2025

View reviewed changes

mmaslankaprv approved these changes Jan 27, 2025

View reviewed changes

mmaslankaprv merged commit 1d1aefd into redpanda-data:dev Jan 27, 2025
21 checks passed

This was referenced Jan 27, 2025

[v24.1.x] rm_stm: fix a race during partition shutdown #24937

Open

[v24.2.x] rm_stm: fix a race during partition shutdown #24938

Merged

[v24.3.x] rm_stm: fix a race during partition shutdown #24939

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rm_stm: fix a race during partition shutdown #24936

rm_stm: fix a race during partition shutdown #24936

bharathv commented Jan 25, 2025 •

edited

Loading

bharathv commented Jan 25, 2025

bharathv commented Jan 27, 2025

mmaslankaprv commented Jan 27, 2025

bashtanov left a comment

vbotbuildovich commented Jan 27, 2025

vbotbuildovich commented Jan 27, 2025

vbotbuildovich commented Jan 27, 2025

vbotbuildovich commented Jan 27, 2025

rm_stm: fix a race during partition shutdown #24936

rm_stm: fix a race during partition shutdown #24936

Conversation

bharathv commented Jan 25, 2025 • edited Loading

Backports Required

Release Notes

Bug Fixes

bharathv commented Jan 25, 2025

bharathv commented Jan 27, 2025

mmaslankaprv commented Jan 27, 2025

bashtanov left a comment

Choose a reason for hiding this comment

vbotbuildovich commented Jan 27, 2025

vbotbuildovich commented Jan 27, 2025

vbotbuildovich commented Jan 27, 2025

vbotbuildovich commented Jan 27, 2025

bharathv commented Jan 25, 2025 •

edited

Loading