Skip to content

Conversation

@eddyashton
Copy link
Member

We're logging the FAIL line in lru_shrink_to_fit a surprising amount. This is a symptom of a deeper bug - the estimated_store_cache_size is not accurately kept up-to-date with the in-memory stores, due to a bug in adjust_ranges. There is a path in adjust_ranges to handle the new range being a disjoint prefix of the previous range, and when that was hit a suffix of seqnos were discarded without being added to removed.

There's a unit test that captures a minimal repro of this - handle X was previously requesting seqno 100, ledger entry at 100 is provided, handle X is updated to request seqno 42. We drop the state for 100 as expected, but didn't previously remove the refs to decrement estimated_store_cache_size, so we had some orphaned counts in there that could inflate forever.

As a little bit of paranoia, there's then a randomised variant of the test, exposing the internal adjust_ranges function and calling it with various randomly generated arguments, confirming that the [removed, added] pair it returns (calculated by some awkward iterator logic internally) matches what we'd expect from set differences on the caller side.

While there, do a related bit of simplification: don't track an explicit any_diff bool. It should be (and now is) equivalent to !removed.empty() || !added.empty().

@eddyashton eddyashton requested a review from a team as a code owner December 4, 2025 16:03
Copilot AI review requested due to automatic review settings December 4, 2025 16:03
@eddyashton eddyashton added auto-backport Automatically backport this PR to LTS branch 6.x-todo PRs which should be backported to 6.x labels Dec 4, 2025
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a critical bug in the historical query LRU cache size calculations. The adjust_ranges method was not properly tracking removed sequence numbers when a request's range changed to a disjoint prefix, causing estimated_store_cache_size to become inflated with orphaned counts.

Key Changes:

  • Fixed adjust_ranges to collect removed seqnos before erasing them (lines 314-317)
  • Simplified any_diff tracking by deriving it from removed/added vectors instead of maintaining a separate boolean
  • Added get_estimated_store_cache_size() accessor for testing

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
src/node/historical_queries.h Core bug fix in adjust_ranges to track suffix removals, simplified any_diff logic, added size accessor method
src/node/test/historical_queries.cpp Added regression test for cache size estimation and randomized test for adjust_ranges correctness

@maxtropets maxtropets added the run-long-test Run Long Test job label Dec 4, 2025
@maxtropets maxtropets merged commit 4505d32 into microsoft:main Dec 4, 2025
28 checks passed
eddyashton added a commit to eddyashton/CCF that referenced this pull request Dec 5, 2025
Co-authored-by: Max <maxtropets@microsoft.com>
(cherry picked from commit 4505d32)
@eddyashton eddyashton added the backported This PR was successfully backported to LTS branch label Dec 5, 2025
eddyashton added a commit that referenced this pull request Dec 5, 2025
…ations (#7511) (#7512)

Co-authored-by: Max <maxtropets@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

6.x-todo PRs which should be backported to 6.x auto-backport Automatically backport this PR to LTS branch backported This PR was successfully backported to LTS branch run-long-test Run Long Test job

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants