Skip to content

perf(l1): optimize snap sync insertion and healing write paths#6159

Open
pablodeymo wants to merge 9 commits intomainfrom
perf/snap-sync-insertion-optimization
Open

perf(l1): optimize snap sync insertion and healing write paths#6159
pablodeymo wants to merge 9 commits intomainfrom
perf/snap-sync-insertion-optimization

Conversation

@pablodeymo
Copy link
Contributor

@pablodeymo pablodeymo commented Feb 9, 2026

Motivation

During snap sync, the node downloads the entire Ethereum state (accounts, storage slots, bytecodes) and rebuilds the state trie locally. This process involves writing tens of millions of trie nodes to RocksDB. Profiling revealed several bottlenecks in the write path that caused unnecessary memory allocations, redundant iterations, and thread scheduling overhead. These optimizations target the hot paths to reduce wall-clock time for both initial state insertion and the subsequent healing phase.


Bottleneck Summary

# Location Issue Impact
1 BackendTrieDB::put_batch Creates a single-element Vec per key (~20k allocs per flush) HIGH
2 insert_accounts (rocksdb) Two full iterations over temp RocksDB (code hashes + trie building) HIGH
3 put_batch_no_alloc default "no_alloc" actually allocates: encodes all nodes, collects into Vec, then hits #1 HIGH
4 Healing batch writes BTreeMap insert overwrites real node data with empty markers; block_on double thread hop MEDIUM
5 Healing error handling .expect() panics instead of returning errors; join_all() swallows task failures MEDIUM
6 trie_from_sorted_accounts_wrap Hardcoded 12 threads regardless of machine LOW

Changes Implemented

Phase 0: Instrumentation

Added tracing::debug! timing to key code paths for ongoing performance monitoring. No correctness changes.

Step 0.1 — Time BackendTrieDB::put_batch and put_batch_no_alloc

  • File: crates/storage/trie.rs
  • Logs total items, total time, and commit time per call.

Step 0.2 — Time flush_nodes_to_write

  • File: crates/common/trie/trie_sorted.rs
  • Logs node count and elapsed time per flush.

Step 0.3 — Time healing batch writes

  • Files: crates/networking/p2p/sync/healing/state.rs, crates/networking/p2p/sync/healing/storage.rs
  • Logs encoding time vs DB write time inside spawn_blocking.

Step 0.4 — Time insert_accounts

  • File: crates/networking/p2p/sync/snap_sync.rs
  • Logs wall time for the trie build pass.

Phase 1: High-Impact Fixes

Step 1.1 — Fix BackendTrieDB::put_batch per-key allocation

  • Files modified: crates/storage/trie.rs
  • Problem: Each key called tx.put_batch(table, vec![(key, value)]), allocating a new Vec for every single key-value pair. With ~20,000 nodes per flush, this meant ~20,000 unnecessary heap allocations per flush cycle.
  • Fix: Changed to call tx.put(table, &key, &value) directly, which calls batch.put_cf on the RocksDB WriteBatch without any intermediate allocation. The StorageWriteBatch trait already had a put method, and RocksDBWriteTx already overrides it to call self.batch.put_cf directly.
  • Expected impact: Eliminates ~20k Vec allocations per flush. With multiple flushes per sync, this removes millions of allocations over a full sync.

Step 1.2 — Override put_batch_no_alloc on BackendTrieDB

  • Files modified: crates/storage/trie.rs
  • Problem: The default implementation in the TrieDB trait (crates/common/trie/db.rs) encodes all nodes via .map(|node| (path.clone(), node.encode_to_vec())).collect(), building a full Vec<(Nibbles, Vec<u8>)> in memory, then passes it to put_batch which previously allocated another Vec per key.
  • Fix: Added an override on BackendTrieDB that opens a single write transaction, iterates over the &[(Nibbles, Node)] slice, encodes each node and writes it directly via tx.put() in one pass, then commits once. No intermediate collection.
  • Expected impact: Eliminates the full intermediate Vec allocation (which at 20k nodes * ~100 bytes = ~2MB per flush) and removes the encode-then-collect-then-write pipeline in favor of encode-and-write-immediately.

Step 1.3 — Merge two-pass iteration in insert_accounts

  • Files modified: crates/networking/p2p/sync/snap_sync.rs
  • Problem: The insert_accounts function iterated over all accounts in the temporary RocksDB twice: once to collect code hashes, and again to build the trie. Both passes decoded AccountState from RLP. For mainnet with ~250M+ accounts, this doubled the I/O and CPU cost.
  • Fix: Removed the first pass entirely. Code hashes are now added directly into the CodeHashCollector's internal HashSet during the .inspect() callback of the trie build pass, which already decodes AccountState. The HashSet deduplicates, so memory is bounded by unique contract accounts (~5M on mainnet = ~160MB). After trie_from_sorted_accounts_wrap returns, remaining code hashes are flushed via flush_if_needed().
  • Trade-off: Code hash flushing now happens after trie building rather than interleaved (can't call async flush_if_needed() inside the synchronous iterator). This slightly delays bytecode download start but eliminates a full scan of the temp DB (~250M+ reads).

Phase 2: Medium-Impact Fixes

Step 2.1 — Remove block_on bridging in storage healing

  • Files modified: crates/storage/store.rs, crates/networking/p2p/sync/healing/storage.rs
  • Problem: Storage healing writes followed this path: spawn_blocking closure → spawned_rt::tasks::block_onasync write_storage_trie_nodes_batchtokio::task::spawn_blocking (internally). This created a double thread hop: the outer spawn_blocking moves work to a blocking thread, then block_on parks that thread waiting for the async function, which itself spawns another blocking task. Two context switches for no benefit.
  • Fix: Added a synchronous method write_storage_trie_nodes_batch_sync to Store that shares the same inner logic (extracted to write_storage_trie_nodes_inner) but runs entirely synchronously. The healing code now calls this sync version directly from its spawn_blocking closure.
  • Expected impact: Eliminates one thread hop per healing batch write. With thousands of batches during healing, this removes thousands of unnecessary thread context switches.

Step 2.2 — Use entry API to fix healing data loss bug

  • Files modified: crates/networking/p2p/sync/healing/state.rs, crates/networking/p2p/sync/healing/storage.rs
  • Problem: In the healing batch writes, insert(path.slice(0, i), vec![]) was used to mark parent nodes for deletion. However, BTreeMap::insert always overwrites. If a real node at path P was processed before a longer path P' (where P is an ancestor prefix of P'), the parent marker loop would overwrite the real node data at P with an empty vec![], silently deleting a valid node. This would cause extra healing cycles to re-download those nodes.
  • Fix: Changed to entry(path.slice(0, i)).or_insert(vec![]), which only inserts the empty placeholder if no entry exists yet. This preserves any real node data that was already stored for that path. In storage healing, replaced the intermediate Vec of tuples with a BTreeMap to enable the same deduplication.
  • Impact: Correctness fix — prevents silent data loss during healing.

Step 2.3 — Proper error propagation in healing background tasks

  • Files modified: crates/networking/p2p/sync/healing/mod.rs, crates/networking/p2p/sync/healing/state.rs, crates/networking/p2p/sync/healing/storage.rs
  • Problem (a): The spawn_blocking closures used .expect() for DB errors, which panics instead of returning a recoverable error. A transient DB failure would crash the entire sync process.
  • Problem (b): join_all() was used to drain remaining tasks at loop exit. If a task panicked (from .expect()), join_all() propagates the panic rather than returning a Result. No way to handle the error gracefully.
  • Problem (c): State healing and storage healing had inconsistent patterns for checking JoinSet results — state.rs used .expect("...").? while storage.rs used pattern matching.
  • Fix:
    • Closures now return Result<(), SyncError> with ? instead of .expect(). Both StoreError and TrieError convert to SyncError via existing From impls.
    • Extracted shared helpers in healing/mod.rs: wait_for_pending_task() (waits for one in-flight task before spawning the next) and drain_pending_tasks() (drains all remaining tasks at loop exit). Both propagate task errors and JoinError as Result.
    • Added if !to_write.is_empty() guard to skip spawning no-op tasks when the buffer is empty (happens when is_done/is_stale triggers a flush with nothing to write).

Phase 3: Lower-Impact Fixes

Step 3.1 — Use available_parallelism in trie_from_sorted_accounts_wrap

  • Files modified: crates/common/trie/trie_sorted.rs
  • Problem: ThreadPool::new(12, s) hardcoded 12 threads for the trie building thread pool, regardless of the machine's actual CPU count.
  • Fix: Changed to std::thread::available_parallelism().map(|n| n.get()).unwrap_or(8). On machines with fewer cores (e.g., 4-core CI runners), this avoids oversubscription. On machines with more cores (e.g., 32-core servers), this takes full advantage of available parallelism.
  • Expected impact: Better resource utilization across different hardware.

Files Modified

File Changes
crates/storage/trie.rs Fixed put_batch per-key allocation; added put_batch_no_alloc override; added timing
crates/storage/store.rs Added write_storage_trie_nodes_batch_sync; extracted shared write_storage_trie_nodes_inner
crates/common/trie/trie_sorted.rs Added flush timing; dynamic thread count via available_parallelism
crates/networking/p2p/sync/snap_sync.rs Merged two-pass iteration; code hashes collected into HashSet directly; added timing
crates/networking/p2p/sync/healing/mod.rs Added wait_for_pending_task and drain_pending_tasks shared helpers
crates/networking/p2p/sync/healing/state.rs entry API fix; Result return from closures; drain_pending_tasks; empty batch guard; timing
crates/networking/p2p/sync/healing/storage.rs entry API fix; sync DB write; Result return from closures; drain_pending_tasks; empty batch guard; timing

How to Test

Automated Tests (all passing)

  • cargo test -p ethrex-trie — 55/55 pass
  • cargo test -p ethrex-p2p --features rocksdb — 38/38 pass
  • cargo test -p ethrex-test snap_server — 12/12 pass
  • cargo clippy -p ethrex-trie -p ethrex-storage -p ethrex-p2p --features rocksdb — 0 warnings

Recommended Manual Validation

To fully validate, the changes require a live snap sync run:

  1. Run snap sync against mainnet or a testnet (e.g., Holesky)
  2. Compare state root at the pivot block — must match the expected root
  3. Compare healing completion — healing_done = true with no extra cycles
  4. Compare timing from instrumentation logs (before vs after each phase)
  5. Monitor memory usage — should decrease noticeably due to reduced allocations

Risk Assessment

Change Risk Mitigation
put_batchput Very Low Same underlying batch.put_cf call, just without Vec wrapper
put_batch_no_alloc override Low Same encode + write logic, just no intermediate collection
Two-pass merge Low Code hashes still collected into HashSet, flushed after trie build
Sync DB write Very Low Same logic, just without async/blocking bridge overhead
BTreeMap entry API Very Low Correctness fix — strictly safer than previous insert behavior
Result propagation in closures Very Low Same error conditions, now returned instead of panicking
drain_pending_tasks Very Low Replaces join_all() with explicit error checking per task
Dynamic thread count Very Low Standard library API; falls back to 8 on error

- Fix BackendTrieDB::put_batch to call tx.put() directly instead of
  tx.put_batch(vec![...]) per key, eliminating ~20k Vec allocations per flush
- Override put_batch_no_alloc on BackendTrieDB to encode and write nodes
  in a single pass without intermediate Vec collection
- Merge two-pass iteration in insert_accounts into a single pass, collecting
  code hashes during the trie-building .inspect() callback
- Add write_storage_trie_nodes_batch_sync to Store to eliminate the
  spawn_blocking -> block_on -> spawn_blocking double thread hop in
  storage healing
- Pipeline healing writes by allowing 2 tasks in flight (state and storage)
  to overlap encoding with DB commit
- Use BTreeMap entry API in both state and storage healing to prevent
  empty placeholders from overwriting real node data
- Replace hardcoded 12-thread pool with std::thread::available_parallelism
- Add tracing::debug timing instrumentation to key write paths
@pablodeymo pablodeymo requested a review from a team as a code owner February 9, 2026 15:52
Copilot AI review requested due to automatic review settings February 9, 2026 15:52
@github-actions
Copy link

github-actions bot commented Feb 9, 2026

🤖 Kimi Code Review

Review Summary

This PR contains several performance optimizations and logging improvements. While most changes are reasonable, there are a few issues that need attention.

Critical Issues

1. Race Condition in State Healing (state.rs:286-287)

The change from single-task to dual-task execution introduces a potential race condition. The comment about avoiding out-of-order deletes is being ignored. When multiple tasks run concurrently, there's no guarantee that trie node deletions and insertions will be applied in the correct order, which could corrupt the state trie.

Recommendation: Revert to single-task execution or implement proper ordering guarantees.

2. Storage Healing Race Condition (storage.rs:215-216)

Same issue as above - allowing 2 concurrent tasks could lead to storage trie corruption.

Performance Issues

3. Inefficient Encoding in put_batch_no_alloc (trie.rs:134-144)

The put_batch_no_alloc method is allocating memory for each encoded node individually, defeating the purpose of "no_alloc". It's also creating a new transaction per call, which is inefficient.

Recommendation:

// Use a pre-allocated buffer or implement a proper zero-copy approach
// Consider batching multiple nodes into a single transaction

Code Quality Issues

4. Redundant Work in Storage Healing (storage.rs:220-230)

The code is rebuilding the BTreeMap for each account, but the input nodes is already a Vec. This adds unnecessary overhead.

Recommendation:

// Skip the BTreeMap creation if nodes are already sorted
let account_nodes: Vec<_> = nodes.into_iter().collect();

5. Magic Numbers

  • SIZE_TO_WRITE_DB constant is used but not visible in the diff
  • Hard-coded values like 100_000 (storage.rs:212) should be constants

6. Error Handling Inconsistency

  • flush_nodes_to_write uses put_batch_no_alloc but other places use put_batch
  • Some functions use expect() while others use proper error handling

Minor Issues

7. Tracing Level

Consider using trace! instead of debug! for performance metrics in production builds.

8. Code Duplication

The encoding logic in state.rs and storage.rs is nearly identical and could be extracted into a shared function.

Positive Changes

Thread count based on available parallelism - Good use of system resources
Synchronous storage write method - Avoids double thread hop
Performance logging - Useful for debugging performance issues
Single-pass account processing - More efficient than the previous two-pass approach


Automated review by Kimi (Moonshot AI) · custom prompt

@github-actions
Copy link

github-actions bot commented Feb 9, 2026

🤖 Codex Code Review

Findings

  • High: Possible out‑of‑order deletes reintroduced by allowing two DB tasks in flight in storage healing. The previous comment explicitly avoided this; now db_joinset permits 2 concurrent batches, and storage writes treat empty node data as delete. If two batches touch the same address_hash/path, an older task can delete nodes written by a newer one. Either prove disjointness per batch (e.g., partition by account hash), reintroduce a single in‑flight task, or add sequencing per account. crates/networking/p2p/sync/healing/storage.rs:216 crates/storage/store.rs:1158
  • High: Memory regression in snap sync: insert_accounts now collects all non‑empty code hashes into a Vec and flushes after the trie build. On mainnet this can be millions of hashes, defeating the previous streaming flush_if_needed() and risking large RAM spikes or OOM. Restore streaming or periodically flush while iterating. crates/networking/p2p/sync/snap_sync.rs:961 crates/networking/p2p/sync/snap_sync.rs:990

Notes / Suggestions

  • Medium: If state healing truly requires ordered DB writes, consider whether the same out‑of‑order risk applies there too (parent placeholders vs real nodes). If batches overlap keyspace, a 2‑task window can cause surprising results. crates/networking/p2p/sync/healing/state.rs:287
  • Low: BackendTrieDB::put_batch now calls tx.put per item, which is fine but may reduce batching efficiency if put_batch was optimized in the backend. Consider a bulk API or at least group by table to limit overhead. crates/storage/trie.rs:105

If you want, I can scan for overlaps in healing batch key ranges to confirm whether the concurrency change is safe.


Automated review by OpenAI Codex · custom prompt

@github-actions
Copy link

github-actions bot commented Feb 9, 2026

🤖 Claude Code Review

Perfect! Now I have enough context. Let me compile my review findings.

PR Review: Optimize snap sync insertion and healing write paths

Summary

This PR addresses performance bottlenecks in snap sync by optimizing hot paths in trie node writes and healing operations. The optimizations are well-researched, clearly documented, and demonstrate good understanding of the codebase. Overall, this is high-quality work with thoughtful performance improvements.


Critical Issues

1. Storage healing: Potential data race with concurrent writes ⚠️

File: crates/networking/p2p/sync/healing/storage.rs:215-217

while db_joinset.len() >= 2 {
    db_joinset.join_next().await;
}

Issue: The PR description states "each batch's deletes and inserts are self-contained within its WriteBatch" as justification for allowing 2 concurrent tasks. However, I need clarification on whether different subtrees can share parent nodes at the boundary.

Concern: If batch N processes subtree A and batch N+1 processes adjacent subtree B, and they share parent nodes near the root, the delete-by-empty-value pattern could race:

  • Batch N: marks parent P as empty (delete)
  • Batch N+1: writes parent P with real data
  • If these commit in the wrong order, the delete could overwrite the valid data

Question: How does the healing algorithm guarantee that parent node paths written by different batches don't overlap? If they do overlap, this could cause trie corruption.

Recommendation: Add a comment explaining the safety guarantee, or if unsure, revert to single-threaded writes (change >= 2 back to !is_empty()).


2. Missing error handling in storage healing

File: crates/networking/p2p/sync/healing/storage.rs:216

db_joinset.join_next().await;

Issue: Unlike state healing (line 290 which uses .expect("joinset is not empty")?), this doesn't handle the Result from join_next.

Fix:

db_joinset
    .join_next()
    .await
    .expect("joinset is not empty")?;

Correctness Issues

3. BTreeMap entry API: Good catch, but verify delete semantics

Files: crates/networking/p2p/sync/healing/state.rs:299, storage.rs:227

encoded_to_write.entry(path.slice(0, i)).or_insert(vec![]);

Good: This correctly prevents overwriting valid node data with empty placeholders.

Question: The comment says "mark parent nodes for deletion" - but how does an empty vec![] cause deletion? Looking at write_storage_trie_nodes_inner (store.rs:1155):

if node_data.is_empty() {
    txn.delete(STORAGE_TRIE_NODES, key.as_ref())?;
}

Potential issue: If a parent node legitimately doesn't exist in the DB yet (new branch), deleting it is a no-op. But if it exists with stale data, this correctly removes it. However, the entry().or_insert() pattern means if we later insert real data for that path, it will be preserved. This seems correct, but deserves a comment explaining the delete-by-empty pattern and why or_insert is critical.


Performance & Code Quality

4. put_batch optimization: Excellent

File: crates/storage/trie.rs:113

tx.put(table, &prefixed_key, &value)

The change from put_batch(vec![(key, value)]) to put() is correct. The default trait implementation in StorageWriteBatch (api/mod.rs:77-79) shows that put() calls put_batch() anyway, so calling put_batch with a single-element Vec was redundant.

Note: This relies on RocksDBWriteTx having an override for put() that calls batch.put_cf directly (backend/rocksdb.rs:320-327), which it does. ✅


5. put_batch_no_alloc: Significant improvement

File: crates/storage/trie.rs:129-151

The custom implementation correctly:

  • Opens a single transaction
  • Encodes nodes inline without collecting to Vec
  • Writes immediately per node
  • Single commit at the end

This eliminates the intermediate Vec allocation from the default trait impl (db.rs:17-24). Good optimization.


6. Two-pass merge: Verify no behavioral change

File: crates/networking/p2p/sync/snap_sync.rs:961-992

Before: Code hashes collected → flushed → trie built
After: Trie built while collecting code hashes → hashes flushed

Concern: The PR description says "Code hash flushing now happens after trie building rather than interleaved." This delays bytecode downloads. Have you verified that:

  1. This doesn't cause downstream operations to wait on missing bytecodes?
  2. The code hash collector buffer is large enough to hold all ~250M+ hashes before flush?

Looking at line 978 (flush_if_needed().await? was called per account), now it's only called once after the entire trie is built. If the collector has a fixed-size buffer, this could OOM.

Recommendation: Verify the code hash collector capacity or keep periodic flushing:

if code_hashes.len() >= 10000 {
    for hash in code_hashes.drain(..) {
        code_hash_collector.add(hash);
        code_hash_collector.flush_if_needed().await?;
    }
}

7. Thread pool sizing: Good improvement

File: crates/common/trie/trie_sorted.rs:361-363

let thread_count = std::thread::available_parallelism()
    .map(|n| n.get())
    .unwrap_or(8);

This is the right approach. Fallback to 8 is reasonable.


8. Sync DB write: Clean refactor

File: crates/storage/store.rs:1141-1173

Extracting the shared logic into write_storage_trie_nodes_inner and providing both async and sync versions is a clean solution. This avoids the spawn_blockingblock_onspawn_blocking double-hop.


Minor Issues

9. Import organization

File: crates/storage/trie.rs:8

use std::time::Instant;

Minor: This import is only used for debug logging. Consider gating it behind a feature flag or using #[cfg(debug_assertions)] if the tracing statements are debug-only.


10. Timing instrumentation verbosity

All the timing logs are at debug! level, which is appropriate. However, consider:

  • Using tracing::instrument for cleaner span-based timing
  • Or using trace! instead of debug! for this level of detail

This is purely stylistic and not a blocking issue.


11. Missing documentation

File: crates/storage/trie.rs:129

The new put_batch_no_alloc override lacks a doc comment explaining why it exists and when it's used vs the default implementation.

Suggested addition:

/// Optimized implementation that avoids intermediate Vec allocation.
/// Encodes and writes nodes directly in a single pass.
fn put_batch_no_alloc(&self, key_values: &[(Nibbles, Node)]) -> Result<(), TrieError> {

Testing Concerns

12. Insufficient automated test coverage

The PR description states:

cargo test -p ethrex-trie -- 55/55 pass

However, these unit tests likely don't cover:

  • The concurrent write scenario (2 tasks in flight)
  • The two-pass merge with delayed code hash flushing
  • The entry API fix preventing overwrites

Recommendation: The PR correctly identifies this needs manual validation via live snap sync. Consider adding integration tests that:

  1. Simulate concurrent healing writes to overlapping subtrees
  2. Verify trie root correctness after healing with 2 concurrent tasks
  3. Test code hash collection with very large account sets

Security Considerations

13. Atomic writes are preserved

All optimizations maintain the atomic write guarantee of RocksDB's WriteBatch. Each batch commits as a single transaction, which is critical for blockchain state consistency.


14. No obvious attack vectors introduced

The optimizations don't introduce new external inputs or change validation logic, so attack surface remains unchanged.


Summary & Recommendation

Strengths:

  • Well-researched performance optimizations targeting real bottlenecks
  • Clear documentation and rationale
  • Maintains atomicity guarantees
  • Good code organization and refactoring

Must-fix before merge:

  1. ⚠️ Critical: Clarify/verify the safety of concurrent writes in storage healing (docs: add milestones #1)
  2. Add error handling to storage healing join_next (chore: create project structure #2)
  3. Verify code hash collector capacity for single-flush approach (Add lint and test jobs to CI #6)

Nice-to-have:

Overall verdict: This is solid performance engineering work, but needs verification of the concurrent write safety before merging. The risk assessment accurately identifies pipelined writes as medium risk - I recommend testing thoroughly on a full mainnet sync to validate trie root correctness.


Automated review by Claude (Anthropic) · custom prompt

@greptile-apps
Copy link

greptile-apps bot commented Feb 9, 2026

Greptile Overview

Greptile Summary

This PR optimizes snap sync’s hottest write paths by removing per-key allocations in trie batch writes, adding a streaming put_batch_no_alloc for BackendTrieDB, merging the temp-RocksDB scan in insert_accounts into a single pass, and reducing healing overhead by avoiding block_on inside spawn_blocking and allowing up to 2 DB write tasks in flight. It also adds debug timing instrumentation around major flush/write steps and sizes the trie-building thread pool using available_parallelism().

Key integration points: the trie builder (crates/common/trie/trie_sorted.rs) now relies on the faster TrieDB::put_batch_no_alloc implementation in crates/storage/trie.rs, while healing uses Store’s new synchronous storage-trie batch write helper to keep blocking work on blocking threads.

Issues to fix before merge: (1) insert_accounts now buffers all code hashes in-memory until trie build completes, which can cause unbounded memory growth on large snapshots; and (2) state healing’s JoinSet draining logic does not surface task panics, risking silent DB write failure while continuing the heal loop.

Confidence Score: 2/5

  • This PR has material merge blockers due to unbounded memory growth and insufficient error surfacing in background DB write tasks.
  • The performance optimizations are straightforward, but insert_accounts now accumulates potentially huge code-hash before flushing, which can OOM on large datasets, and state healing’s JoinSet handling does not propagate task panics, risking silent data loss or incomplete healing.
  • crates/networking/p2p/sync/snap_sync.rs and crates/networking/p2p/sync/healing/state.rs

Important Files Changed

Filename Overview
crates/common/trie/trie_sorted.rs Adds debug timing around flush_nodes_to_write and uses available_parallelism() to size the thread pool dynamically; no functional changes observed.
crates/networking/p2p/sync/healing/state.rs Pipelines DB writes (2 in-flight) and avoids overwriting encoded nodes via BTreeMap::entry; however, current JoinSet draining logic does not surface task panics, risking silent DB write failures.
crates/networking/p2p/sync/healing/storage.rs Pipelines storage healing DB writes and switches to sync Store::write_storage_trie_nodes_batch_sync in blocking tasks; uses BTreeMap::entry to avoid overwrites; changes look correct.
crates/networking/p2p/sync/snap_sync.rs Merges the temp-DB scan into the trie build pass and adds timing logs, but now buffers all code hashes in-memory until after trie build, which can cause large memory growth/OOM on big snapshots.
crates/storage/store.rs Extracts shared storage trie write logic and adds a sync batch writer to avoid async/blocking thread hops; behavior remains equivalent (commit at end).
crates/storage/trie.rs Optimizes put_batch to avoid per-key Vec allocation and adds a streaming put_batch_no_alloc override; also adds debug timing. Main behavior is preserved.

Sequence Diagram

sequenceDiagram
    participant Snap as snap_sync::insert_accounts
    participant TmpDB as RocksDB(temp accounts)
    participant Trie as trie_sorted::trie_from_sorted_accounts_wrap
    participant TrieDB as BackendTrieDB(TrieDB)
    participant HealS as healing::state::heal_state_trie
    participant HealSt as healing::storage::heal_storage_trie
    participant Store as storage::Store

    Snap->>TmpDB: ingest_external_file(SSTs)
    Snap->>TmpDB: full_iterator(Start)
    Snap->>Trie: build trie from sorted iterator
    Trie->>TrieDB: flush_nodes_to_write()
    TrieDB->>TrieDB: put_batch_no_alloc(nodes)
    TrieDB->>TrieDB: begin_write
    loop for each node
        TrieDB->>TrieDB: encode_to_vec
        TrieDB->>TrieDB: tx.put(cf, key, value)
    end
    TrieDB->>TrieDB: tx.commit
    Snap->>Snap: collect code_hashes during inspect()
    Snap->>Store: code_hash_collector.add/flush (after trie build)

    HealS->>HealS: collect nodes_to_write
    HealS->>HealS: spawn_blocking (encode + db write)
    HealS->>Store: open_direct_state_trie
    Store->>TrieDB: db.put_batch(encoded)

    HealSt->>HealSt: drain nodes_to_write per account
    HealSt->>HealSt: spawn_blocking (encode + db write)
    HealSt->>Store: write_storage_trie_nodes_batch_sync(updates)
    Store->>Store: begin_write
    Store->>Store: txn.put/delete (per node)
    Store->>Store: txn.commit

Loading

@github-actions
Copy link

github-actions bot commented Feb 9, 2026

Lines of code report

Total lines added: 102
Total lines removed: 0
Total lines changed: 102

Detailed view
+------------------------------------------------------+-------+------+
| File                                                 | Lines | Diff |
+------------------------------------------------------+-------+------+
| ethrex/crates/common/trie/trie_sorted.rs             | 440   | +10  |
+------------------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/sync/healing/mod.rs     | 34    | +27  |
+------------------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/sync/healing/state.rs   | 392   | +5   |
+------------------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/sync/healing/storage.rs | 626   | +11  |
+------------------------------------------------------+-------+------+
| ethrex/crates/storage/store.rs                       | 2441  | +13  |
+------------------------------------------------------+-------+------+
| ethrex/crates/storage/trie.rs                        | 184   | +36  |
+------------------------------------------------------+-------+------+

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Comment on lines 960 to 974
@@ -976,6 +969,9 @@ async fn insert_accounts(
.account_tries_inserted
.fetch_add(1, Ordering::Relaxed);
let account_state = AccountState::decode(v).expect("We should have accounts here");
if account_state.code_hash != *EMPTY_KECCACK_HASH {
code_hashes.push(account_state.code_hash);
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unbounded code hash buffer
insert_accounts now collects all non-empty code_hash values into code_hashes: Vec<H256> and only flushes after trie_from_sorted_accounts_wrap finishes. On large snapshots this can grow to millions of hashes and OOM before the trie build completes. This used to be bounded by flush_if_needed() during the scan. Consider flushing incrementally during the iterator pass (e.g., push+flush when the collector is full) or using a bounded buffer.

Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/networking/p2p/sync/snap_sync.rs
Line: 960:974

Comment:
**Unbounded code hash buffer**
`insert_accounts` now collects all non-empty `code_hash` values into `code_hashes: Vec<H256>` and only flushes after `trie_from_sorted_accounts_wrap` finishes. On large snapshots this can grow to millions of hashes and OOM before the trie build completes. This used to be bounded by `flush_if_needed()` during the scan. Consider flushing incrementally during the iterator pass (e.g., push+flush when the collector is full) or using a bounded buffer.


How can I resolve this? If you propose a fix, please make it concise.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 08ba6f92c. Removed the intermediate Vec<H256> entirely — code hashes are now added directly into CodeHashCollector's internal HashSet during the .inspect() callback. The HashSet deduplicates, so memory is bounded by unique contract accounts (~5M on mainnet at 32 bytes each = ~160MB) instead of growing unboundedly with duplicates.

flush_if_needed() is called once after the trie build completes. We can't call it during the iterator because trie_from_sorted_accounts_wrap is synchronous and flush_if_needed() is async.

Comment on lines 285 to 292
let store = store.clone();
// NOTE: we keep only a single task in the background to avoid out of order deletes
if !db_joinset.is_empty() {
// Allow up to 2 tasks in flight to overlap encoding with DB commit
while db_joinset.len() >= 2 {
db_joinset
.join_next()
.await
.expect("we just checked joinset is not empty")?;
.expect("joinset is not empty")?;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JoinSet errors dropped
This loop uses join_next().await.expect(...)?; but the spawned closure returns (), so task panics/errors are effectively ignored (and ? here can only be applying to the Option, not the task result). That means DB write panics can be silently swallowed and healing can proceed as if writes succeeded. You likely want to match join_next().await and handle Err(JoinError) explicitly (and propagate/abort on panic).

Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/networking/p2p/sync/healing/state.rs
Line: 285:292

Comment:
**JoinSet errors dropped**
This loop uses `join_next().await.expect(...)?;` but the spawned closure returns `()`, so task panics/errors are effectively ignored (and `?` here can only be applying to the `Option`, not the task result). That means DB write panics can be silently swallowed and healing can proceed as if writes succeeded. You likely want to `match join_next().await` and handle `Err(JoinError)` explicitly (and propagate/abort on panic).


How can I resolve this? If you propose a fix, please make it concise.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the suggestion from the bot is wrong, but I would check the types of the task to make it return errors.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're correct — my comment was wrong.

The spawned closure does return Result<(), SyncError> (you can see Ok(()) at the end of the closure). The JoinSet type is JoinSet<Result<(), SyncError>>, so join_next().await returns:

  • Some(Ok(Ok(()))) — task succeeded, closure returned Ok
  • Some(Ok(Err(e))) — task succeeded, closure returned Err
  • Some(Err(JoinError)) — task panicked/cancelled

The PR properly handles all three cases via the wait_for_pending_task() helper added in healing/mod.rs (lines 22-28 of the diff), which matches on the result and propagates errors correctly. This is actually a fix — before the PR, the closures used .expect() which would panic instead of returning errors.

The error handling is correct as-is. My apologies for the noise!

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Optimizes snap-sync and healing write hot paths to reduce allocations, redundant iteration, and thread scheduling overhead while rebuilding / repairing the Ethereum state trie backed by RocksDB.

Changes:

  • Eliminates per-key Vec allocations in trie batch writes and adds a non-allocating put_batch_no_alloc fast path for BackendTrieDB.
  • Collapses insert_accounts from a two-pass temp-DB scan into a single pass during trie building, with added timing instrumentation.
  • Improves healing write throughput by allowing up to 2 in-flight DB write tasks, deduping parent-delete markers via BTreeMap::entry, and removing async block_on bridging for storage healing writes.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
crates/storage/trie.rs Removes per-key batch allocations and adds put_batch_no_alloc override; adds timing logs.
crates/storage/store.rs Adds sync storage-trie batch write path and shared inner implementation to avoid extra thread hops.
crates/common/trie/trie_sorted.rs Adds flush timing and uses available_parallelism() for thread pool sizing.
crates/networking/p2p/sync/snap_sync.rs Single-pass account insertion (collect code hashes during trie build) + timing logs.
crates/networking/p2p/sync/healing/state.rs Allows 2 in-flight write tasks, fixes parent-delete overwrites via entry, adds timing logs.
crates/networking/p2p/sync/healing/storage.rs Allows 2 in-flight write tasks, fixes parent-delete overwrites via entry, switches to sync DB write, adds timing logs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 960 to 962
let start = std::time::Instant::now();
let mut code_hashes: Vec<H256> = Vec::new();
let iter = db.full_iterator(rocksdb::IteratorMode::Start);
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code_hashes collects every non-empty account code hash into a Vec during the trie-build pass. On large networks this can grow very large (and includes duplicates), increasing peak RSS and risking OOM compared to the previous streaming flush.

Consider avoiding the intermediate Vec by either (a) adding hashes directly into CodeHashCollector during .inspect() (no await), and flushing/spawning file dumps in a bounded way after the trie build, or (b) sending hashes to a background writer via a channel so the buffer can be flushed incrementally while the trie is being built.

Copilot uses AI. Check for mistakes.
if !db_joinset.is_empty() {
// Allow up to 2 tasks in flight to overlap encoding with DB commit
while db_joinset.len() >= 2 {
db_joinset.join_next().await;
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The join_next().await result is ignored here. If a background DB write task panics or returns an error, this will be silently dropped and healing may continue/finish without persisting data.

Handle the returned Option<Result<_, JoinError>> (and propagate/log failures) when draining the JoinSet, similar to how state healing uses expect(...) + ? on join_next().

Suggested change
db_joinset.join_next().await;
match db_joinset.join_next().await {
Some(Ok(_)) => {
// successfully completed background DB write task
}
Some(Err(e)) => {
error!(?e, "background DB write task failed during storage healing");
}
None => {
// no more tasks to join
break;
}
}

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 08ba6f92c. The join_next() result is now checked and JoinError is propagated as SyncError::JoinHandle(e), which will surface task panics instead of silently dropping them.

@pablodeymo
Copy link
Contributor Author

Re: 8. Code Duplication

The shared part between state.rs and storage.rs is only the inner 4-line encoding loop (for i in 0..path.len() { entry().or_insert }; insert(path, encode)). The surrounding structures are different:

  • State healing: flat Vec<(Nibbles, Node)> → single BTreeMapdb.put_batch()
  • Storage healing: HashMap<H256, Vec<(Nibbles, Node)>> → per-account BTreeMapstore.write_storage_trie_nodes_batch_sync()

Extracting a shared function for a 4-line loop that operates on different input/output types would add a function boundary and import without meaningful deduplication. Keeping them inline is simpler and more readable.

The original single-task constraint exists to prevent out-of-order deletes:
batch N generates empty markers for ancestor paths of its committed nodes,
and batch N+1 may write real node data at those same ancestor paths. If
RocksDB serializes batch N's commit after batch N+1's, the empty marker
overwrites the valid branch node, corrupting the trie.
@pablodeymo
Copy link
Contributor Author

Re: 1. Storage healing: Potential data race with concurrent writes

Good catch — this is a real race condition. Here's the concrete scenario:

  1. Batch N commits leaf [0,1,2,3] → generates empty markers for ancestors [0,1,2], [0,1], [0]
  2. Between flushes, leaf [0,1,2,7] arrives. Its commit triggers commit_node for parent [0,1,2] (pending_children_count reaches 0), which goes into batch N+1 with real branch data
  3. Both batches are in flight. RocksDB's write lock doesn't guarantee FIFO ordering — if batch N's WriteBatch commits after N+1, the empty marker at [0,1,2] overwrites the valid branch node

The original comment ("avoid out of order deletes") was there for exactly this reason.

Reverted to single-task writes in aeea7f0ce, with an expanded comment explaining the safety invariant.

…rror handling

- Remove intermediate Vec<H256> for code hashes in insert_accounts. Instead,
  add hashes directly into CodeHashCollector's HashSet during the .inspect()
  callback. The HashSet deduplicates, bounding memory to unique contract
  accounts (~5M on mainnet = ~160MB) instead of growing with duplicates.
  flush_if_needed() is called once after the sync trie build completes.

- Handle JoinSet errors in storage healing: propagate JoinError from
  db_joinset.join_next() instead of silently dropping task panics.
@pablodeymo pablodeymo changed the title Optimize snap sync insertion and healing write paths perf(l1): optimize snap sync insertion and healing write paths Feb 9, 2026
@github-actions github-actions bot added L1 Ethereum client performance Block execution throughput and performance in general labels Feb 9, 2026
@pablodeymo
Copy link
Contributor Author

@greptile-apps review again

@greptile-apps
Copy link

greptile-apps bot commented Feb 9, 2026

Greptile Overview

Greptile Summary

This PR focuses on reducing snap-sync/healing write-path overhead by removing avoidable allocations and iterations, and adding lightweight timing instrumentation.

Key changes:

  • BackendTrieDB::put_batch no longer allocates a 1-element Vec per key; it writes each KV directly into the underlying write batch and commits once.
  • BackendTrieDB now overrides put_batch_no_alloc to encode-and-write in a single pass without building an intermediate Vec.
  • insert_accounts removes an extra full scan of the temp RocksDB; code hashes are collected during the trie-build scan and flushed after.
  • Healing paths improve batch encoding correctness (entry().or_insert to avoid overwriting real nodes with delete markers) and reduce scheduler overhead by adding a synchronous storage write method.

Issue to fix before merge:

  • Both healing modules use db_joinset.join_all().await on exit paths without checking the returned task results, which can swallow panics from DB write tasks and allow healing to report success despite missing writes.

Confidence Score: 4/5

  • Generally safe performance-focused changes, but healing completion paths can still ignore DB write task panics.
  • Most changes are localized perf/instrumentation and keep semantics, and prior review-thread issues were addressed; the remaining concrete issue is JoinSet::join_all() dropping JoinErrors in both healing modules, which can mask failed DB writes at the end of healing.
  • crates/networking/p2p/sync/healing/state.rs, crates/networking/p2p/sync/healing/storage.rs

Important Files Changed

Filename Overview
crates/common/trie/trie_sorted.rs Adds timing debug logs to flushes and uses available_parallelism() for thread pool size; no functional changes found.
crates/networking/p2p/sync/healing/state.rs Fixes parent-path deletion markers using BTreeMap::entry, adds timing, but still drops JoinSet task errors on join_all() at completion/stale exits.
crates/networking/p2p/sync/healing/storage.rs Switches storage batch encoding to dedup via BTreeMap and uses sync DB write; however db_joinset.join_all().await ignores possible JoinErrors from DB tasks.
crates/networking/p2p/sync/snap_sync.rs Merges account scan passes and adds timing; code hash collection now happens during trie build with a single flush afterward (per prior thread fix).
crates/storage/store.rs Extracts shared storage trie node write logic and adds a synchronous variant to avoid block_on inside spawn_blocking; no correctness issues found.
crates/storage/trie.rs Optimizes BackendTrieDB::put_batch to avoid per-key Vec alloc and adds a non-allocating batch write override with timing logs; behavior preserved.

Sequence Diagram

sequenceDiagram
    participant HealState as heal_state_trie/heal_storage_trie
    participant JS as db_joinset
    participant Block as spawn_blocking
    participant Store as Store
    participant DB as TrieDB/BackendTrieDB

    loop Healing loop
        HealState->>HealState: accumulate nodes_to_write
        alt flush threshold OR done OR stale
            opt previous task exists
                HealState->>JS: join_next().await
                JS-->>HealState: Result<(), JoinError>
            end
            HealState->>Block: spawn_blocking(encode + write)
            Block->>Store: open trie / begin_write
            Store->>DB: put_batch / write_storage_trie_nodes_batch_sync
            DB-->>Store: commit
            Store-->>Block: Ok
            Block-->>JS: task completes
        end
    end

    alt exit (done/stale)
        HealState->>JS: join_all().await
        JS-->>HealState: "JoinErrors currently ignored"
    end
Loading

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

@greptile-apps
Copy link

greptile-apps bot commented Feb 9, 2026

Additional Comments (2)

crates/networking/p2p/sync/healing/storage.rs
Errors from DB tasks ignored

db_joinset.join_all().await; (and the equivalent on the stale path) drops JoinErrors from the spawn_blocking DB write tasks, so a panic in the write closure can be silently ignored and healing can return Ok(true/false) as if data was committed. This should propagate/abort on JoinError (at least panic) similarly to the earlier join_next() handling.

Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/networking/p2p/sync/healing/storage.rs
Line: 250:252

Comment:
**Errors from DB tasks ignored**

`db_joinset.join_all().await;` (and the equivalent on the stale path) drops `JoinError`s from the `spawn_blocking` DB write tasks, so a panic in the write closure can be silently ignored and healing can return `Ok(true/false)` as if data was committed. This should propagate/abort on `JoinError` (at least panic) similarly to the earlier `join_next()` handling.

How can I resolve this? If you propose a fix, please make it concise.

crates/networking/p2p/sync/healing/state.rs
JoinSet results are dropped

db_joinset.join_all().await; discards JoinErrors from the spawn_blocking batch write tasks. If any DB write panics, state healing can still complete and return Ok(paths.is_empty()) with missing data. Consider iterating join_next().await and returning an error on Err(JoinError) (and/or logging panics) so write failures don’t get swallowed.

Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/networking/p2p/sync/healing/state.rs
Line: 323:326

Comment:
**JoinSet results are dropped**

`db_joinset.join_all().await;` discards `JoinError`s from the `spawn_blocking` batch write tasks. If any DB write panics, state healing can still complete and return `Ok(paths.is_empty())` with missing data. Consider iterating `join_next().await` and returning an error on `Err(JoinError)` (and/or logging panics) so write failures don’t get swallowed.

How can I resolve this? If you propose a fix, please make it concise.

…in healing

state.rs used .expect().? while storage.rs used pattern matching — now both
use a shared helper that eliminates the .expect() panic path.
…awn_blocking closures

- Closures now return Result<(), SyncError> instead of panicking on DB errors
- Replace join_all() with drain_pending_tasks() that checks each task result
- Skip spawning no-op tasks when the write buffer is empty
@pablodeymo
Copy link
Contributor Author

Addressed the remaining concern about JoinSet::join_all() dropping JoinErrors in both healing modules:

  • spawn_blocking closures now return Result<(), SyncError> instead of panicking with .expect() — DB errors propagate as recoverable errors
  • Replaced join_all() with drain_pending_tasks() that checks each task's Result and propagates the first error
  • Extracted shared helpers wait_for_pending_task() and drain_pending_tasks() in healing/mod.rs so both state and storage healing use the same pattern
  • Added if !to_write.is_empty() guard to avoid spawning no-op tasks

Commits: 2fb356a8e, 4c30718d8

@pablodeymo
Copy link
Contributor Author

Re: Greptile's two additional comments about db_joinset.join_all().await in both storage.rs and state.rs:

Both are fixed in commits 2fb356a8e and 4c30718d8:

  • join_all() replaced with drain_pending_tasks() — iterates join_next().await in a loop, checks each task's Result, and returns the first error encountered (either a JoinError from a panic or a SyncError from the closure)
  • Closures return Result<(), SyncError> instead of panicking with .expect() — DB failures are now propagated as recoverable errors, not panics
  • Both state.rs and storage.rs use the same shared helpers from healing/mod.rs

@pablodeymo
Copy link
Contributor Author

@greptile-apps review again
comments fixed

@greptile-apps
Copy link

greptile-apps bot commented Feb 9, 2026

Greptile Overview

Greptile Summary

This PR optimizes snap-sync and healing write hot paths by (1) removing per-key Vec allocations in BackendTrieDB::put_batch, (2) adding a put_batch_no_alloc override that encodes/writes nodes in one pass, (3) merging the prior two-pass temp-RocksDB scan in insert_accounts by collecting code hashes during trie build, and (4) reducing healing overhead by avoiding async block_on bridges, fixing parent-marker overwrites via BTreeMap::entry, and centralizing JoinSet error propagation helpers.

Overall the changes fit cleanly into the existing storage/trie abstractions and the healing pipeline (single in-flight DB write), and the added tracing helps validate perf wins in production runs. The main merge blockers are remaining expect() panics in insert_accounts (external inputs) and a silent invariant-masking None branch in wait_for_pending_task.

Confidence Score: 3/5

  • This PR is close to mergeable but still has a couple of crash-risk / invariant-handling issues to address.
  • Core performance refactors look consistent with existing abstractions and add error propagation in healing, but insert_accounts still contains new expect() panics on external data paths, and the JoinSet helper currently masks an invariant violation by treating None as success after checking non-empty.
  • crates/networking/p2p/sync/snap_sync.rs, crates/networking/p2p/sync/healing/mod.rs

Important Files Changed

Filename Overview
crates/common/trie/trie_sorted.rs Adds debug timing around flush_nodes_to_write and switches trie build ThreadPool size to available_parallelism().
crates/networking/p2p/sync/healing/mod.rs Introduces JoinSet helpers wait_for_pending_task/drain_pending_tasks for propagating DB write task errors.
crates/networking/p2p/sync/healing/state.rs Changes healing DB write batching to use BTreeMap entry markers, adds instrumentation, and moves JoinSet handling to shared helpers; still does sync decoding inside iterator with panics/ignores.
crates/networking/p2p/sync/healing/storage.rs Fixes parent-marker overwrite via BTreeMap entry API, switches to sync Store write inside spawn_blocking, and adopts JoinSet helpers for error propagation.
crates/networking/p2p/sync/snap_sync.rs Merges account scan passes by collecting code hashes during trie build; introduces debug timing but adds unconditional expect()s on RocksDB and RLP decode.
crates/storage/store.rs Extracts storage trie node write inner loop and adds a synchronous write_storage_trie_nodes_batch_sync variant used by healing.
crates/storage/trie.rs Optimizes BackendTrieDB put_batch to avoid per-key Vec allocations and adds put_batch_no_alloc override with debug timing; still clones Nibbles per node.

Sequence Diagram

sequenceDiagram
    participant Snap as snap_sync::insert_accounts
    participant TempDB as RocksDB temp accounts DB
    participant Trie as trie_from_sorted_accounts_wrap
    participant Collector as CodeHashCollector
    participant StateDB as BackendTrieDB
    participant Heal as healing::{state,storage}
    participant JS as JoinSet

    Snap->>TempDB: ingest_external_file(snapshots)
    Snap->>TempDB: full_iterator(Start)
    Snap->>Trie: build trie from sorted accounts
    Trie-->>Snap: iter.inspect(decode AccountState)
    Snap->>Collector: add(code_hash)
    Snap->>StateDB: put_batch_no_alloc(nodes)
    Snap->>Collector: flush_if_needed() (after trie build)

    Heal->>JS: spawn_blocking(DB write task)
    Heal->>JS: wait_for_pending_task(join_next)
    Heal->>StateDB: put_batch(encoded nodes)
    Heal->>JS: drain_pending_tasks() on done/stale
Loading

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

7 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Comment on lines 972 to 974
.account_tries_inserted
.fetch_add(1, Ordering::Relaxed);
let account_state = AccountState::decode(v).expect("We should have accounts here");
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Panics on bad input

insert_accounts now hard-panics on any temp RocksDB iterator error (.expect(...)) and on any malformed account value (AccountState::decode(v).expect(...)). These are external inputs (snapshot files / DB iterator) and can be corrupted; panicking will crash the sync process instead of returning SyncError like the rest of this function does.

Consider propagating these as SyncError::RocksDBError / SyncError::Rlp instead of using expect inside the iterator (map_err(...)?).

Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/networking/p2p/sync/snap_sync.rs
Line: 972:974

Comment:
**Panics on bad input**

`insert_accounts` now hard-panics on any temp RocksDB iterator error (`.expect(...)`) and on any malformed account value (`AccountState::decode(v).expect(...)`). These are external inputs (snapshot files / DB iterator) and can be corrupted; panicking will crash the sync process instead of returning `SyncError` like the rest of this function does.

Consider propagating these as `SyncError::RocksDBError` / `SyncError::Rlp` instead of using `expect` inside the iterator (`map_err(...)?`).

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This absolutely shouldn't happen. The only scenario is where the user has a faulty file system.

Comment on lines 15 to +22

// Re-export shared types for external use
#[allow(unused_imports)]
pub use types::{HealingQueueEntry, StateHealingQueue};

/// Waits for a pending task in the JoinSet to complete, propagating any error.
/// Used to ensure only a single background DB write is in flight at a time.
async fn wait_for_pending_task(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorrect JoinSet None case

In wait_for_pending_task, the None arm is treated as success (Some(Ok(Ok(()))) | None => {}), but None from join_next().await can only happen when the JoinSet is empty. Since this function only calls join_next() after checking !joinset.is_empty(), hitting None indicates a logic invariant violation and would silently drop a pending write.

It’s safer to handle None as unreachable!()/error (or remove it) so a broken invariant doesn’t get masked.

Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/networking/p2p/sync/healing/mod.rs
Line: 15:22

Comment:
**Incorrect JoinSet None case**

In `wait_for_pending_task`, the `None` arm is treated as success (`Some(Ok(Ok(()))) | None => {}`), but `None` from `join_next().await` can only happen when the JoinSet is empty. Since this function only calls `join_next()` after checking `!joinset.is_empty()`, hitting `None` indicates a logic invariant violation and would silently drop a pending write.

It’s safer to handle `None` as `unreachable!()`/error (or remove it) so a broken invariant doesn’t get masked.

How can I resolve this? If you propose a fix, please make it concise.

@pablodeymo
Copy link
Contributor Author

Re: Greptile's two latest comments:

1. Panics on bad input in insert_accounts (.expect() at lines 969, 974)

These are pre-existing — line 969 already has a // TODO: remove unwrap comment. We didn't introduce them; we only added code_hash_collector.add() inside the existing .inspect() closure. Fixing them properly requires changing the trie_from_sorted_accounts_wrap API to accept fallible iterators (Result<(H256, Vec<u8>), _> instead of (H256, Vec<u8>)), which is a separate refactor beyond this PR's scope.

2. Incorrect JoinSet None case in wait_for_pending_task

Valid point. join_next() can only return None when the set is empty, and we check !is_empty() first, so None is unreachable. Changed from silently ignoring it to unreachable!("join_next returned None after is_empty check") so a broken invariant doesn't get masked.

Fixed in commit 0c10e154c.

@pablodeymo
Copy link
Contributor Author

@greptile-apps review it again

@greptile-apps
Copy link

greptile-apps bot commented Feb 9, 2026

Greptile Overview

Greptile Summary

This PR optimizes snap sync’s hottest write paths by (1) removing per-key Vec allocations in BackendTrieDB::put_batch and adding a true streaming put_batch_no_alloc implementation, (2) merging the temp-RocksDB account scan into the trie-build pass while collecting code hashes during the same iteration, and (3) simplifying healing writes by avoiding a double thread-hop, preventing parent-marker overwrites via BTreeMap::entry().or_insert, and propagating errors out of background DB tasks using shared JoinSet helpers.

The changes fit into the codebase’s existing architecture by keeping trie insertion logic within the TrieDB abstraction, keeping RocksDB write batching inside Store transactions, and keeping healing’s concurrency model (at most one DB write in flight to preserve delete ordering) but making it explicit and error-aware.

Confidence Score: 3/5

  • Mostly safe performance refactor, but one remaining panic-on-input path should be fixed before merging.
  • The write-path and healing changes are localized and preserve existing transaction/commit semantics, and the JoinSet helpers improve correctness by propagating background task failures. However, insert_accounts still uses expect() on RocksDB iterator items and RLP decoding, which will deterministically crash snap sync on iterator errors or malformed snapshot data; this should be converted to SyncError propagation to match surrounding error handling.
  • crates/networking/p2p/sync/snap_sync.rs

Important Files Changed

Filename Overview
crates/networking/p2p/sync/snap_sync.rs Merged temp-DB scan into trie-build pass and added timings; still contains expect() panics on RocksDB iterator errors and RLP decode in the new iterator pipeline.
crates/storage/trie.rs Optimized trie batch write path by removing per-key Vec allocation, added put_batch_no_alloc override, and added debug timing; change is localized and preserves commit semantics.
crates/storage/store.rs Extracted shared storage-trie write inner loop and added synchronous wrapper for use inside spawn_blocking, reducing thread-hopping without altering write/delete behavior.
crates/networking/p2p/sync/healing/mod.rs Introduced JoinSet helpers to wait/drain background DB tasks with proper error propagation; None case is now treated as unreachable after is_empty check.
crates/networking/p2p/sync/healing/state.rs Fixed parent-marker overwrite bug using entry().or_insert, avoided spawning no-op DB tasks, added timing, and switched to helper-based JoinSet error handling.
crates/networking/p2p/sync/healing/storage.rs Applied same parent-marker overwrite fix via BTreeMap entry, switched to synchronous store write inside spawn_blocking, and adopted shared JoinSet draining with error propagation.
crates/common/trie/trie_sorted.rs Added debug timing for flush_nodes_to_write and replaced hardcoded thread count with available_parallelism() fallback; no functional trie logic changes.
CHANGELOG.md Added perf changelog entry describing snap sync insertion/healing write-path optimizations.

Sequence Diagram

sequenceDiagram
    participant Snap as snap_sync::insert_accounts
    participant Rocks as Temp RocksDB iterator
    participant Trie as trie_sorted::trie_from_sorted_accounts_wrap
    participant DB as BackendTrieDB/Store
    participant HealS as healing::state
    participant HealSt as healing::storage
    participant JS as JoinSet<Result<(), SyncError>>

    Snap->>Rocks: full_iterator(Start)
    Snap->>Trie: trie_from_sorted_accounts_wrap(db, iter.map/inspect)
    loop for each (k,v)
        Rocks-->>Trie: (key,value)
        Trie-->>Snap: inspect(k,v)
        Snap->>Snap: AccountState::decode(v)
        Snap->>Snap: code_hash_collector.add(code_hash)
        Snap->>Snap: storage_accounts insert(storage_root)
    end
    Trie-->>Snap: compute_state_root
    Snap->>Snap: code_hash_collector.flush_if_needed().await

    par Healing state writes
        HealS->>JS: wait_for_pending_task(join_next)
        HealS->>JS: spawn_blocking(encode + db.put_batch)
        JS->>DB: open_direct_state_trie + put_batch + commit
        HealS->>JS: drain_pending_tasks()
    and Healing storage writes
        HealSt->>JS: wait_for_pending_task(join_next)
        HealSt->>JS: spawn_blocking(encode + write_storage_trie_nodes_batch_sync)
        JS->>DB: begin_write + put/delete + commit
        HealSt->>JS: drain_pending_tasks()
    end
Loading

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

8 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment on lines 966 to 974
@@ -976,6 +972,9 @@ async fn insert_accounts(
.account_tries_inserted
.fetch_add(1, Ordering::Relaxed);
let account_state = AccountState::decode(v).expect("We should have accounts here");
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Panics on iterator/rlp decode

insert_accounts still has expect(...) calls inside the RocksDB iterator (.map(|k| k.expect(...)) and AccountState::decode(v).expect(...)). These are external inputs (temp RocksDB iterator + snapshot content), so a single iterator error or malformed RLP will crash the whole sync instead of returning a SyncError like the rest of the function does. This should be converted to map_err(...)? / Result propagation so snap sync can fail gracefully with a useful error.

Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/networking/p2p/sync/snap_sync.rs
Line: 966:974

Comment:
**Panics on iterator/rlp decode**

`insert_accounts` still has `expect(...)` calls inside the RocksDB iterator (`.map(|k| k.expect(...))` and `AccountState::decode(v).expect(...)`). These are external inputs (temp RocksDB iterator + snapshot content), so a single iterator error or malformed RLP will crash the whole sync instead of returning a `SyncError` like the rest of the function does. This should be converted to `map_err(...)?` / `Result` propagation so snap sync can fail gracefully with a useful error.

How can I resolve this? If you propose a fix, please make it concise.

@github-actions
Copy link

github-actions bot commented Feb 9, 2026

Benchmark Block Execution Results Comparison Against Main

Command Mean [s] Min [s] Max [s] Relative
base 63.109 ± 0.359 62.547 63.467 1.00
head 63.228 ± 0.269 62.743 63.754 1.00 ± 0.01

}
scope(|s| {
let pool = ThreadPool::new(12, s);
let thread_count = std::thread::available_parallelism()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be at least more than 2. If there is only a single thread it will wait on the write buffers getting freed.

Comment on lines 285 to 292
let store = store.clone();
// NOTE: we keep only a single task in the background to avoid out of order deletes
if !db_joinset.is_empty() {
// Allow up to 2 tasks in flight to overlap encoding with DB commit
while db_joinset.len() >= 2 {
db_joinset
.join_next()
.await
.expect("we just checked joinset is not empty")?;
.expect("joinset is not empty")?;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the suggestion from the bot is wrong, but I would check the types of the task to make it return errors.

Comment on lines 972 to 974
.account_tries_inserted
.fetch_add(1, Ordering::Relaxed);
let account_state = AccountState::decode(v).expect("We should have accounts here");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This absolutely shouldn't happen. The only scenario is where the user has a faulty file system.


/// Drains all pending tasks from the JoinSet, propagating the first error encountered.
async fn drain_pending_tasks(
joinset: &mut JoinSet<Result<(), SyncError>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When drain_pending_tasks encounters the first error, it returns immediately and remaining tasks stay in the JoinSet. On drop, those tasks get cancelled mid-execution. Since WriteBatch::commit is atomic at the RocksDB level, partial writes shouldn't happen — but any tasks that haven't called commit yet will lose their work silently.

Is that acceptable here? The caller (healing loop) presumably retries from scratch on error, so losing in-flight work is fine. Just want to confirm that's the intent, since the old join_all would have waited for all tasks to complete (though it didn't propagate errors).

let account_count = to_write.len();
let mut encoded_to_write = vec![];
for (hashed_account, nodes) in to_write {
let mut account_nodes = std::collections::BTreeMap::new();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: std::collections::BTreeMap::new() — this could use a use std::collections::BTreeMap; import at the top of the closure or file, matching the pattern in state.rs where BTreeMap is already imported.

let start = std::time::Instant::now();
// We collect code hashes directly into the collector's HashSet during the trie
// build pass. The collector deduplicates, so memory is bounded by unique contract
// accounts (~5M on mainnet = ~160MB). We can't call flush_if_needed() here because
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says ~5M entries = ~160MB, but HashSet<H256> has per-entry overhead beyond the raw 32-byte key — typically ~72 bytes/entry on 64-bit (key + hash + bucket metadata + load factor headroom). At 5M entries that's closer to 350-400MB peak during the synchronous trie build.

This might be fine for the target hardware running snap sync, but the comment understates actual memory usage. Worth either correcting the estimate or noting that it's the raw key size only.

@github-project-automation github-project-automation bot moved this to In Progress in ethrex_l1 Feb 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

L1 Ethereum client performance Block execution throughput and performance in general snapsync

Projects

Status: In Progress
Status: Todo

Development

Successfully merging this pull request may close these issues.

4 participants