Skip to content

health_monitor_backend: dangling reference fix#29909

Merged
bharathv merged 1 commit into
redpanda-data:devfrom
joe-redpanda:joe/hmb_dangling_fix
Mar 23, 2026
Merged

health_monitor_backend: dangling reference fix#29909
bharathv merged 1 commit into
redpanda-data:devfrom
joe-redpanda:joe/hmb_dangling_fix

Conversation

@joe-redpanda
Copy link
Copy Markdown
Contributor

@joe-redpanda joe-redpanda commented Mar 20, 2026

The inner loop within fill_aggregate_with_offline_partitions would use an iterator pointing to potentially invalid memory. Although the first few statements in the inner lambda check whether the outer iterator was stable for this duration, invalid memory was already read in iterating against the inner iterator.

This change swaps 'for' for 'while', such that outer iterator stability can be checked BEFORE incrementing the inner iterator.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v25.3.x
  • v25.2.x
  • v25.1.x

Release Notes

Bug Fixes

  • fix dangling reference in many partitions edge case

Copilot AI review requested due to automatic review settings March 20, 2026 22:57
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a potential dangling-iterator access in health_monitor_backend::fill_aggregate_with_offline_partitions by changing the inner iteration strategy so iterator stability can be checked before advancing the inner iterator.

Changes:

  • Replace ssx::async_for_each_counter over assignment_set with a manual iterator loop using ssx::async_while_counter.
  • Add an explicit it.check() call inside the inner loop body prior to dereference/increment of the inner iterator.

Comment thread src/v/cluster/health_monitor_backend.cc Outdated
@piyushredpanda piyushredpanda added this to the v26.1.1-rc5 milestone Mar 20, 2026
@vbotbuildovich
Copy link
Copy Markdown
Collaborator

CI test results

test results on build#82154
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
WriteCachingFailureInjectionE2ETest test_crash_all {"use_transactions": false} integration https://buildkite.com/redpanda/redpanda/builds/82154#019d0d84-3d3e-4937-bd3b-8ea487949353 FLAKY 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0969, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.2636, p1=0.0469, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=WriteCachingFailureInjectionE2ETest&test_method=test_crash_all

The inner loop within fill_aggregate_with_offline_partitions would use
an iterator pointing to potentially invalid memory. Although the first
few statements in the inner lambda check whether the outer iterator was
stable for this duration, invalid memory was already read in iterating
against the inner iterator.

This change swaps 'for' for 'while', such that outer iterator stability
can be checked BEFORE incrementing the inner iterator.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated no new comments.

@@ -1136,12 +1132,17 @@ ss::future<> health_monitor_backend::fill_aggregate_with_offline_partitions(
++it) {
const auto& topic = it->first;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we also need to it.check() here?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking the same.. my understanding is it is not needed because there is no scheduling point between cond() and f().. the logic of async_while() is something like..

....
if (cond()) {
  f()
 }
...
co_await maybe_yield()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah nevermind, I see the stable iterator validates when incrementing

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea this is a slightly different portion of code; I also scrutinized L1144 but I agree it seems safe

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so the idea here is that it.check is also protecting the inner iterators over assignment set?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct.

@bharathv bharathv merged commit 1a6ba75 into redpanda-data:dev Mar 23, 2026
23 checks passed
@vbotbuildovich
Copy link
Copy Markdown
Collaborator

/backport v25.3.x

@vbotbuildovich
Copy link
Copy Markdown
Collaborator

/backport v25.2.x

@vbotbuildovich
Copy link
Copy Markdown
Collaborator

/backport v25.1.x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants