Skip to content

Add fallback pruning for stalled finalization#179

Open
pablodeymo wants to merge 14 commits intomainfrom
periodic-fallback-pruning
Open

Add fallback pruning for stalled finalization#179
pablodeymo wants to merge 14 commits intomainfrom
periodic-fallback-pruning

Conversation

@pablodeymo
Copy link
Collaborator

@pablodeymo pablodeymo commented Mar 3, 2026

Motivation

PR #170 adds retention-based pruning for blocks and states, but it only triggers when finalization
advances (finalized.slot > old_finalized_slot in update_checkpoints). If finalization stalls —
due to network issues, low participation, or other consensus failures — the States and
BlockHeaders/BlockBodies/BlockSignatures tables grow unbounded because the pruning condition
is never met.

Description

Adds an else branch to update_checkpoints in the storage module that calls prune_old_states
and prune_old_blocks on every head update when finalization doesn't advance. The prune methods
are already no-ops when within retention limits (STATES_TO_KEEP=900, BLOCKS_TO_KEEP=1800), so
this adds negligible overhead in the common case.

Adapted from Zeam's approach of running fallback pruning
when finalization is behind.

Key design decisions

  • Pruning lives entirely in the storage module — the consensus layer (on_tick, on_block)
    is unaware of pruning logic
  • Runs on every head update instead of a periodic interval — simpler, more testable, and the
    prune methods short-circuit when there's nothing to prune
  • Protected roots: [latest_finalized.root, latest_justified.root], matching the existing
    finalization-gated path

Changes

File Change
crates/storage/src/store.rs Add fallback pruning else branch in update_checkpoints, add 2 tests
crates/storage/src/lib.rs Export StorageReadView, StorageWriteBatch, Table

How to test

cargo test -p ethlambda-storage -- fallback_pruning
cargo test --workspace --release
cargo clippy --workspace --all-targets -- -D warnings
cargo fmt --all -- --check

MegaRedHand and others added 4 commits March 2, 2026 11:17
  When finalization stalls (e.g., network issues, low participation), the
  States and Block tables grow unbounded because retention-based pruning
  only triggers on finalization advancement. This adds a periodic check
  every 7200 slots (~8 hours) that calls prune_old_states and
  prune_old_blocks directly when finalization is far behind, adapted from
  Zeam's approach. The check is cheap (two integer comparisons per slot-0
  tick) and the actual pruning reuses existing idempotent methods.
@github-actions
Copy link

github-actions bot commented Mar 3, 2026

🤖 Kimi Code Review

Review Summary

The PR introduces a fallback pruning mechanism for when finalization is stalled, which is a good defensive measure against unbounded storage growth. However, there are several issues that need attention:

Critical Issues

  1. Race condition in on_tick (blockchain/src/store.rs:318-329): The check slot.saturating_sub(store.latest_finalized().slot) > PRUNING_FALLBACK_INTERVAL_SLOTS uses the finalized slot from the store, but this could change between the check and the actual pruning operation. This could lead to pruning when finalization has actually advanced.

  2. Missing synchronization in periodic_prune (storage/src/store.rs:430-440): The method reads latest_finalized() and latest_justified() separately, which could lead to inconsistent state if these values change between reads. Should read both atomically.

Security & Correctness Issues

  1. Incorrect pruning logic (storage/src/store.rs:430): The protected roots only include latest_finalized().root and latest_justified().root, but should also include all their ancestor blocks/states within the retention window. The current implementation might prune necessary historical data.

  2. Test setup issue (storage/src/store.rs:1458-1467): In the test periodic_prune_removes_old_states_and_blocks, the finalized checkpoint is set to slot 0 but the justified checkpoint is set to slot 1. This is an invalid state since justified should always be ≥ finalized.

Performance & Best Practices

  1. Redundant check (blockchain/src/store.rs:319): The slot > 0 check is redundant since slot.is_multiple_of(PRUNING_FALLBACK_INTERVAL_SLOTS) will be false for slot 0 anyway.

  2. Logging level inconsistency: The warning log in on_tick uses warn! but the actual pruning in periodic_prune uses info!. For consistency and operator awareness, both should probably be at the same level.

  3. Magic number exposure: PRUNING_FALLBACK_INTERVAL_SLOTS is exported from storage but only used in blockchain. Consider keeping it private to storage and exposing a higher-level configuration method instead.

Suggested Fixes

// In blockchain/src/store.rs, fix the race condition:
let current_finalized_slot = store.latest_finalized().slot;
if slot.is_multiple_of(PRUNING_FALLBACK_INTERVAL_SLOTS)
    && slot.saturating_sub(current_finalized_slot) > PRUNING_FALLBACK_INTERVAL_SLOTS
{
    // ... rest of code
}

// In storage/src/store.rs, fix atomic reads:
pub fn periodic_prune(&mut self) {
    let (finalized_checkpoint, justified_checkpoint) = {
        let finalized = self.latest_finalized();
        let justified = self.latest_justified();
        (finalized, justified)
    };
    let protected_roots = [finalized_checkpoint.root, justified_checkpoint.root];
    // ... rest of code
}

Automated review by Kimi (Moonshot AI) · custom prompt

@github-actions
Copy link

github-actions bot commented Mar 3, 2026

🤖 Codex Code Review

  1. High: periodic_prune can create dangling LiveChain entries by deleting non-finalized headers
  • periodic_prune only calls prune_old_states + prune_old_blocks (store.rs:430).
  • prune_old_blocks deletes from BlockHeaders/BlockBodies/BlockSignatures but not LiveChain (store.rs:654).
  • During finalization stalls, LiveChain is explicitly still the source of fork-choice candidates, so this can leave roots selectable by fork choice but missing headers/state metadata.
  • There are multiple expect(...) paths that assume header availability for key roots, e.g. store.rs:1101, store.rs:1108, blockchain/store.rs:652, blockchain/store.rs:656, blockchain/store.rs:668.
  • Suggested fix: when pruning blocks in stalled-finality mode, either prune corresponding LiveChain entries too, or avoid pruning any root still present in LiveChain.
  1. Medium: off-by-one in fallback trigger delays first prune by one full interval
  • Condition uses slot - finalized_slot > PRUNING_FALLBACK_INTERVAL_SLOTS (blockchain/store.rs:321).
  • With finalized slot 0, first prune runs at slot 14400 (16h), not 7200 (8h) as documented.
  • If intended behavior is “prune every 7200 slots once lag reaches 7200”, this should be >=.
  1. Medium (performance): full-table scan/sort on tick path
  • Each fallback run does full BlockHeaders scans + sort for both states and blocks (store.rs:569, store.rs:620).
  • Running this synchronously inside on_tick interval 0 (blockchain/store.rs:317) can cause latency spikes under large DBs.
  • Consider incremental/index-based pruning or bounded batch deletion per tick.

Could not run tests in this environment due rustup permission errors (/home/runner/.rustup/tmp not writable), so findings are from static review.


Automated review by OpenAI Codex · custom prompt

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 3, 2026

Greptile Summary

This PR addresses a gap in PR #170's retention-based pruning: because prune_old_states / prune_old_blocks were only called when finalized.slot advanced, a prolonged consensus stall (network partition, low participation) allowed the States and BlockHeaders/BlockBodies/BlockSignatures tables to grow without bound. The fix wires a periodic fallback check into the existing on_tick interval-0 path and delegates to a new Store::periodic_prune() method that reuses the already-idempotent pruning helpers, protecting both the current finalized and justified roots.

Key points:

  • The three-condition guard (slot > 0, multiple of 7200, stall gap > 7200) is conservative by design: when finalization is healthy the check is a no-op (two comparisons + one metadata read per slot).
  • periodic_prune deliberately matches update_checkpoints's protected-roots set [finalized.root, justified.root], so no live checkpoint can be accidentally removed.
  • Two new unit tests cover the pruning and no-op paths at the Store level; the on_tick trigger conditions are straightforward.
  • One test comment contains misleading phrasing about block insertion counts that should be corrected for clarity.

Confidence Score: 4/5

  • Safe to merge; the periodic pruning implementation is conservative, reuses idempotent helpers, and the guard conditions prevent false positives when finalization is healthy.
  • The PR correctly implements fallback pruning for stalled finalization with conservative guard conditions. The three-part check ensures the code is a no-op when finalization is healthy. Protected-roots handling mirrors the existing update_checkpoints logic, so no valid checkpoint can be accidentally removed. The only finding is a misleading test comment that should be clarified for maintainability.
  • crates/storage/src/store.rs — test comment clarification

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["on_tick — interval 0"] --> B{slot > 0?}
    B -- No --> G[Skip pruning]
    B -- Yes --> C{slot % PRUNING_FALLBACK_INTERVAL_SLOTS == 0?}
    C -- No --> G
    C -- Yes --> D{"slot - finalized_slot\n> PRUNING_FALLBACK_INTERVAL_SLOTS?"}
    D -- No\nFinalization healthy --> G
    D -- Yes\nFinalization stalled --> E["warn! log stall event"]
    E --> F["store.periodic_prune()"]
    F --> F1["protected = [finalized.root, justified.root]"]
    F1 --> F2["prune_old_states(protected)\nprune_old_blocks(protected)"]
    F2 --> F3{anything pruned?}
    F3 -- Yes --> F4["info! log pruned counts"]
    F3 -- No --> G
Loading

Last reviewed commit: d6c7958

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

@github-actions
Copy link

github-actions bot commented Mar 3, 2026

🤖 Claude Code Review

Here is my review of PR #179:


Review: Add periodic fallback pruning for stalled finalization

The motivation is clear and the implementation is straightforward. A few points to consider:


Notable: periodic_prune is narrower than update_checkpoints

periodic_prune (store.rs:430) only calls prune_old_states + prune_old_blocks. The update_checkpoints path also calls prune_gossip_signatures, prune_attestation_data_by_root, and prune_aggregated_payload_table for both aggregated tables.

GossipSignatures are deleted post-aggregation (per delete_gossip_signatures at blockchain/src/store.rs:206), so they don't accumulate across slots under normal stall conditions. However, if aggregation fails for a given validator (e.g., insufficient co-signers) the signature stays in the table indefinitely during a stall — prune_gossip_signatures(finalized_slot) cannot help because the finalized slot doesn't advance.

This is an existing pre-stall risk being widened by this PR. Whether it's in-scope here is a judgement call, but it's worth documenting as a known gap.


Minor: latest_finalized() is read 3 times in the same tick

In blockchain/src/store.rs interval-0 handler, latest_finalized() is called:

  • Line 321: slot.saturating_sub(store.latest_finalized().slot) — condition check
  • Line 326: finalized_slot = store.latest_finalized().slot — inside warn!
  • Inside periodic_prune() at store.rs:431: self.latest_finalized().root

That's 3 DB reads for the same checkpoint value that doesn't change between calls. The simplest fix is to read the checkpoint once before the if:

let finalized = store.latest_finalized();
if slot > 0
    && slot.is_multiple_of(PRUNING_FALLBACK_INTERVAL_SLOTS)
    && slot.saturating_sub(finalized.slot) > PRUNING_FALLBACK_INTERVAL_SLOTS
{
    warn!(%slot, finalized_slot = finalized.slot, "Finalization stalled, ...");
    store.periodic_prune();
}

periodic_prune would still do its own reads internally, but eliminating the duplicate in the caller is a clean win.


Minor: No test for the trigger condition

The two new tests exercise periodic_prune() directly. The 3-part guard in blockchain/src/store.rs (slot > 0, multiple-of check, lag check) has no test coverage. A small on_tick test that verifies:

  • pruning fires when slot % PRUNING_FALLBACK_INTERVAL_SLOTS == 0 and slot - finalized_slot > PRUNING_FALLBACK_INTERVAL_SLOTS
  • pruning does NOT fire when finalization is healthy (lag ≤ threshold)

...would close that gap and act as a regression guard if the condition logic is ever refactored.


Positive notes

  • The trigger condition is logically correct. saturating_sub safely handles any edge case where finalized_slot >= slot.
  • is_multiple_of is stable since Rust 1.86.0; the project targets 1.92.0. ✓
  • The correct ordering (prune_old_states before prune_old_blocks) is preserved — states depend on block headers for slot lookup.
  • The 7-day-equivalent constant (7200 slots ≈ 8 hours) is well-chosen and consistent with Zeam's approach.
  • The condition truly is a no-op when finalization is healthy: two integer comparisons per slot-0 tick.
  • Test math for periodic_prune_removes_old_states_and_blocks is correct: 5 excess entries minus 2 protected (finalized + justified) = 3 deleted → STATES_TO_KEEP + 2 remaining.
  • warn! log level is appropriate for a stalled finalization event.

Automated review by Claude (Anthropic) · custom prompt

pablodeymo and others added 3 commits March 3, 2026 12:52
…s 3 redundant DB reads for the same checkpoint value that doesnt change between calls in the interval-0 handler.
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
  Two tests that verify the 3-part guard in on_tick interval 0:
  - on_tick_triggers_periodic_pruning_when_finalization_stalled: pruning
    fires when slot is a multiple of the interval AND finalization lag
    exceeds the threshold
  - on_tick_skips_periodic_pruning_when_finalization_healthy: pruning does
    NOT fire when finalization is recent (lag within threshold)

  Also exports Table, StorageWriteBatch, and StorageReadView from the
  storage crate to enable direct backend access in blockchain tests.
Base automatically changed from prune-states to main March 3, 2026 16:16
Comment on lines +319 to +329
let finalized = store.latest_finalized();
if slot > 0
&& slot.is_multiple_of(PRUNING_FALLBACK_INTERVAL_SLOTS)
&& slot.saturating_sub(finalized.slot) > PRUNING_FALLBACK_INTERVAL_SLOTS
{
warn!(
%slot,
finalized_slot = finalized.slot,
"Finalization stalled, running periodic fallback pruning"
);
store.periodic_prune();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should put this in the storage module instead of here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved the guard condition logic into periodic_prune in the storage module. It now takes slot: u64 and handles the 3-part check (slot > 0, multiple of interval, finalization lag) + the warn! log internally. The blockchain on_tick handler just calls store.periodic_prune(slot).

in crates/storage/src/store.rs (constants and test sections coexist,
periodic pruning tests updated to use test_store_with_backend helper)
  periodic_prune now takes a slot parameter and handles the 3-part trigger
  condition (slot > 0, multiple of interval, finalization lag) internally.
  The blockchain on_tick handler just calls store.periodic_prune(slot).
  Check cheap arithmetic conditions (slot == 0, is_multiple_of) before
  reading latest_finalized from the database. This avoids a wasted DB
  read on 7199 out of 7200 calls.
  Fallback pruning now lives entirely in the storage module's
  update_checkpoints method as an else branch: when finalization doesn't
  advance, it checks the head slot and prunes if finalization is stalled.
  This keeps the consensus layer completely unaware of pruning logic.

  Removes periodic_prune as a standalone method, the on_tick call, and
  the blockchain-level pruning tests (now covered by storage tests).
…7200 slots

  Remove the interval-based guard and PRUNING_FALLBACK_INTERVAL_SLOTS constant.
  The else branch in update_checkpoints now unconditionally calls prune_old_states
  and prune_old_blocks when finalization doesn't advance. The prune methods are
  already no-ops when within retention limits, so this is safe and more testable.
@pablodeymo pablodeymo changed the title Add periodic fallback pruning for stalled finalization Add fallback pruning for stalled finalization Mar 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants