Add fallback pruning for stalled finalization#179
Conversation
When finalization stalls (e.g., network issues, low participation), the States and Block tables grow unbounded because retention-based pruning only triggers on finalization advancement. This adds a periodic check every 7200 slots (~8 hours) that calls prune_old_states and prune_old_blocks directly when finalization is far behind, adapted from Zeam's approach. The check is cheap (two integer comparisons per slot-0 tick) and the actual pruning reuses existing idempotent methods.
🤖 Kimi Code ReviewReview SummaryThe PR introduces a fallback pruning mechanism for when finalization is stalled, which is a good defensive measure against unbounded storage growth. However, there are several issues that need attention: Critical Issues
Security & Correctness Issues
Performance & Best Practices
Suggested Fixes// In blockchain/src/store.rs, fix the race condition:
let current_finalized_slot = store.latest_finalized().slot;
if slot.is_multiple_of(PRUNING_FALLBACK_INTERVAL_SLOTS)
&& slot.saturating_sub(current_finalized_slot) > PRUNING_FALLBACK_INTERVAL_SLOTS
{
// ... rest of code
}
// In storage/src/store.rs, fix atomic reads:
pub fn periodic_prune(&mut self) {
let (finalized_checkpoint, justified_checkpoint) = {
let finalized = self.latest_finalized();
let justified = self.latest_justified();
(finalized, justified)
};
let protected_roots = [finalized_checkpoint.root, justified_checkpoint.root];
// ... rest of code
}Automated review by Kimi (Moonshot AI) · custom prompt |
🤖 Codex Code Review
Could not run tests in this environment due rustup permission errors ( Automated review by OpenAI Codex · custom prompt |
Greptile SummaryThis PR addresses a gap in PR #170's retention-based pruning: because Key points:
Confidence Score: 4/5
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A["on_tick — interval 0"] --> B{slot > 0?}
B -- No --> G[Skip pruning]
B -- Yes --> C{slot % PRUNING_FALLBACK_INTERVAL_SLOTS == 0?}
C -- No --> G
C -- Yes --> D{"slot - finalized_slot\n> PRUNING_FALLBACK_INTERVAL_SLOTS?"}
D -- No\nFinalization healthy --> G
D -- Yes\nFinalization stalled --> E["warn! log stall event"]
E --> F["store.periodic_prune()"]
F --> F1["protected = [finalized.root, justified.root]"]
F1 --> F2["prune_old_states(protected)\nprune_old_blocks(protected)"]
F2 --> F3{anything pruned?}
F3 -- Yes --> F4["info! log pruned counts"]
F3 -- No --> G
Last reviewed commit: d6c7958 |
🤖 Claude Code ReviewHere is my review of PR #179: Review: Add periodic fallback pruning for stalled finalizationThe motivation is clear and the implementation is straightforward. A few points to consider: Notable:
|
…s 3 redundant DB reads for the same checkpoint value that doesnt change between calls in the interval-0 handler.
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Two tests that verify the 3-part guard in on_tick interval 0:
- on_tick_triggers_periodic_pruning_when_finalization_stalled: pruning
fires when slot is a multiple of the interval AND finalization lag
exceeds the threshold
- on_tick_skips_periodic_pruning_when_finalization_healthy: pruning does
NOT fire when finalization is recent (lag within threshold)
Also exports Table, StorageWriteBatch, and StorageReadView from the
storage crate to enable direct backend access in blockchain tests.
crates/blockchain/src/store.rs
Outdated
| let finalized = store.latest_finalized(); | ||
| if slot > 0 | ||
| && slot.is_multiple_of(PRUNING_FALLBACK_INTERVAL_SLOTS) | ||
| && slot.saturating_sub(finalized.slot) > PRUNING_FALLBACK_INTERVAL_SLOTS | ||
| { | ||
| warn!( | ||
| %slot, | ||
| finalized_slot = finalized.slot, | ||
| "Finalization stalled, running periodic fallback pruning" | ||
| ); | ||
| store.periodic_prune(); |
There was a problem hiding this comment.
We should put this in the storage module instead of here.
There was a problem hiding this comment.
Moved the guard condition logic into periodic_prune in the storage module. It now takes slot: u64 and handles the 3-part check (slot > 0, multiple of interval, finalization lag) + the warn! log internally. The blockchain on_tick handler just calls store.periodic_prune(slot).
in crates/storage/src/store.rs (constants and test sections coexist, periodic pruning tests updated to use test_store_with_backend helper)
periodic_prune now takes a slot parameter and handles the 3-part trigger condition (slot > 0, multiple of interval, finalization lag) internally. The blockchain on_tick handler just calls store.periodic_prune(slot).
Check cheap arithmetic conditions (slot == 0, is_multiple_of) before reading latest_finalized from the database. This avoids a wasted DB read on 7199 out of 7200 calls.
Fallback pruning now lives entirely in the storage module's update_checkpoints method as an else branch: when finalization doesn't advance, it checks the head slot and prunes if finalization is stalled. This keeps the consensus layer completely unaware of pruning logic. Removes periodic_prune as a standalone method, the on_tick call, and the blockchain-level pruning tests (now covered by storage tests).
…7200 slots Remove the interval-based guard and PRUNING_FALLBACK_INTERVAL_SLOTS constant. The else branch in update_checkpoints now unconditionally calls prune_old_states and prune_old_blocks when finalization doesn't advance. The prune methods are already no-ops when within retention limits, so this is safe and more testable.
Motivation
PR #170 adds retention-based pruning for blocks and states, but it only triggers when finalization
advances (
finalized.slot > old_finalized_slotinupdate_checkpoints). If finalization stalls —due to network issues, low participation, or other consensus failures — the
StatesandBlockHeaders/BlockBodies/BlockSignaturestables grow unbounded because the pruning conditionis never met.
Description
Adds an
elsebranch toupdate_checkpointsin the storage module that callsprune_old_statesand
prune_old_blockson every head update when finalization doesn't advance. The prune methodsare already no-ops when within retention limits (
STATES_TO_KEEP=900,BLOCKS_TO_KEEP=1800), sothis adds negligible overhead in the common case.
Adapted from Zeam's approach of running fallback pruning
when finalization is behind.
Key design decisions
on_tick,on_block)is unaware of pruning logic
prune methods short-circuit when there's nothing to prune
[latest_finalized.root, latest_justified.root], matching the existingfinalization-gated path
Changes
crates/storage/src/store.rselsebranch inupdate_checkpoints, add 2 testscrates/storage/src/lib.rsStorageReadView,StorageWriteBatch,TableHow to test