perf(l1): optimize snap sync insertion and healing write paths by pablodeymo · Pull Request #6159 · lambdaclass/ethrex

pablodeymo · 2026-02-09T15:52:34Z

Motivation

During snap sync, the node downloads the entire Ethereum state (accounts, storage slots, bytecodes) and rebuilds the state trie locally. This process involves writing tens of millions of trie nodes to RocksDB. Profiling revealed several bottlenecks in the write path that caused unnecessary memory allocations, redundant iterations, and thread scheduling overhead. These optimizations target the hot paths to reduce wall-clock time for both initial state insertion and the subsequent healing phase.

Bottleneck Summary

#	Location	Issue	Impact
1	`BackendTrieDB::put_batch`	Creates a single-element `Vec` per key (~20k allocs per flush)	HIGH
2	`insert_accounts` (rocksdb)	Two full iterations over temp RocksDB (code hashes + trie building)	HIGH
3	`put_batch_no_alloc` default	"no_alloc" actually allocates: encodes all nodes, collects into Vec, then hits #1	HIGH
4	Healing batch writes	BTreeMap `insert` overwrites real node data with empty markers; `block_on` double thread hop	MEDIUM
5	Healing error handling	`.expect()` panics instead of returning errors; `join_all()` swallows task failures	MEDIUM
6	`trie_from_sorted_accounts_wrap`	Hardcoded 12 threads regardless of machine	LOW

Changes Implemented

Phase 0: Instrumentation

Added tracing::debug! timing to key code paths for ongoing performance monitoring. No correctness changes.

Step 0.1 — Time BackendTrieDB::put_batch and put_batch_no_alloc

File: crates/storage/trie.rs
Logs total items, total time, and commit time per call.

Step 0.2 — Time flush_nodes_to_write

File: crates/common/trie/trie_sorted.rs
Logs node count and elapsed time per flush.

Step 0.3 — Time healing batch writes

Files: crates/networking/p2p/sync/healing/state.rs, crates/networking/p2p/sync/healing/storage.rs
Logs encoding time vs DB write time inside spawn_blocking.

Step 0.4 — Time insert_accounts

File: crates/networking/p2p/sync/snap_sync.rs
Logs wall time for the trie build pass.

Phase 1: High-Impact Fixes

Step 1.1 — Fix `BackendTrieDB::put_batch` per-key allocation

Files modified: crates/storage/trie.rs
Problem: Each key called tx.put_batch(table, vec![(key, value)]), allocating a new Vec for every single key-value pair. With ~20,000 nodes per flush, this meant ~20,000 unnecessary heap allocations per flush cycle.
Fix: Changed to call tx.put(table, &key, &value) directly, which calls batch.put_cf on the RocksDB WriteBatch without any intermediate allocation. The StorageWriteBatch trait already had a put method, and RocksDBWriteTx already overrides it to call self.batch.put_cf directly.
Expected impact: Eliminates ~20k Vec allocations per flush. With multiple flushes per sync, this removes millions of allocations over a full sync.

Step 1.2 — Override `put_batch_no_alloc` on `BackendTrieDB`

Files modified: crates/storage/trie.rs
Problem: The default implementation in the TrieDB trait (crates/common/trie/db.rs) encodes all nodes via .map(|node| (path.clone(), node.encode_to_vec())).collect(), building a full Vec<(Nibbles, Vec<u8>)> in memory, then passes it to put_batch which previously allocated another Vec per key.
Fix: Added an override on BackendTrieDB that opens a single write transaction, iterates over the &[(Nibbles, Node)] slice, encodes each node and writes it directly via tx.put() in one pass, then commits once. No intermediate collection.
Expected impact: Eliminates the full intermediate Vec allocation (which at 20k nodes * ~100 bytes = ~2MB per flush) and removes the encode-then-collect-then-write pipeline in favor of encode-and-write-immediately.

Step 1.3 — Merge two-pass iteration in `insert_accounts`

Files modified: crates/networking/p2p/sync/snap_sync.rs
Problem: The insert_accounts function iterated over all accounts in the temporary RocksDB twice: once to collect code hashes, and again to build the trie. Both passes decoded AccountState from RLP. For mainnet with ~250M+ accounts, this doubled the I/O and CPU cost.
Fix: Removed the first pass entirely. Code hashes are now added directly into the CodeHashCollector's internal HashSet during the .inspect() callback of the trie build pass, which already decodes AccountState. The HashSet deduplicates, so memory is bounded by unique contract accounts (~5M on mainnet = ~160MB). After trie_from_sorted_accounts_wrap returns, remaining code hashes are flushed via flush_if_needed().
Trade-off: Code hash flushing now happens after trie building rather than interleaved (can't call async flush_if_needed() inside the synchronous iterator). This slightly delays bytecode download start but eliminates a full scan of the temp DB (~250M+ reads).

Phase 2: Medium-Impact Fixes

Step 2.1 — Remove `block_on` bridging in storage healing

Files modified: crates/storage/store.rs, crates/networking/p2p/sync/healing/storage.rs
Problem: Storage healing writes followed this path: spawn_blocking closure → spawned_rt::tasks::block_on → async write_storage_trie_nodes_batch → tokio::task::spawn_blocking (internally). This created a double thread hop: the outer spawn_blocking moves work to a blocking thread, then block_on parks that thread waiting for the async function, which itself spawns another blocking task. Two context switches for no benefit.
Fix: Added a synchronous method write_storage_trie_nodes_batch_sync to Store that shares the same inner logic (extracted to write_storage_trie_nodes_inner) but runs entirely synchronously. The healing code now calls this sync version directly from its spawn_blocking closure.
Expected impact: Eliminates one thread hop per healing batch write. With thousands of batches during healing, this removes thousands of unnecessary thread context switches.

Step 2.2 — Use `entry` API to fix healing data loss bug

Files modified: crates/networking/p2p/sync/healing/state.rs, crates/networking/p2p/sync/healing/storage.rs
Problem: In the healing batch writes, insert(path.slice(0, i), vec![]) was used to mark parent nodes for deletion. However, BTreeMap::insert always overwrites. If a real node at path P was processed before a longer path P' (where P is an ancestor prefix of P'), the parent marker loop would overwrite the real node data at P with an empty vec![], silently deleting a valid node. This would cause extra healing cycles to re-download those nodes.
Fix: Changed to entry(path.slice(0, i)).or_insert(vec![]), which only inserts the empty placeholder if no entry exists yet. This preserves any real node data that was already stored for that path. In storage healing, replaced the intermediate Vec of tuples with a BTreeMap to enable the same deduplication.
Impact: Correctness fix — prevents silent data loss during healing.

Step 2.3 — Proper error propagation in healing background tasks

Files modified: crates/networking/p2p/sync/healing/mod.rs, crates/networking/p2p/sync/healing/state.rs, crates/networking/p2p/sync/healing/storage.rs
Problem (a): The spawn_blocking closures used .expect() for DB errors, which panics instead of returning a recoverable error. A transient DB failure would crash the entire sync process.
Problem (b): join_all() was used to drain remaining tasks at loop exit. If a task panicked (from .expect()), join_all() propagates the panic rather than returning a Result. No way to handle the error gracefully.
Problem (c): State healing and storage healing had inconsistent patterns for checking JoinSet results — state.rs used .expect("...").? while storage.rs used pattern matching.
Fix:
- Closures now return Result<(), SyncError> with ? instead of .expect(). Both StoreError and TrieError convert to SyncError via existing From impls.
- Extracted shared helpers in healing/mod.rs: wait_for_pending_task() (waits for one in-flight task before spawning the next) and drain_pending_tasks() (drains all remaining tasks at loop exit). Both propagate task errors and JoinError as Result.
- Added if !to_write.is_empty() guard to skip spawning no-op tasks when the buffer is empty (happens when is_done/is_stale triggers a flush with nothing to write).

Phase 3: Lower-Impact Fixes

Step 3.1 — Use `available_parallelism` in `trie_from_sorted_accounts_wrap`

Files modified: crates/common/trie/trie_sorted.rs
Problem: ThreadPool::new(12, s) hardcoded 12 threads for the trie building thread pool, regardless of the machine's actual CPU count.
Fix: Changed to std::thread::available_parallelism().map(|n| n.get()).unwrap_or(8). On machines with fewer cores (e.g., 4-core CI runners), this avoids oversubscription. On machines with more cores (e.g., 32-core servers), this takes full advantage of available parallelism.
Expected impact: Better resource utilization across different hardware.

Files Modified

File	Changes
`crates/storage/trie.rs`	Fixed `put_batch` per-key allocation; added `put_batch_no_alloc` override; added timing
`crates/storage/store.rs`	Added `write_storage_trie_nodes_batch_sync`; extracted shared `write_storage_trie_nodes_inner`
`crates/common/trie/trie_sorted.rs`	Added flush timing; dynamic thread count via `available_parallelism`
`crates/networking/p2p/sync/snap_sync.rs`	Merged two-pass iteration; code hashes collected into `HashSet` directly; added timing
`crates/networking/p2p/sync/healing/mod.rs`	Added `wait_for_pending_task` and `drain_pending_tasks` shared helpers
`crates/networking/p2p/sync/healing/state.rs`	`entry` API fix; `Result` return from closures; `drain_pending_tasks`; empty batch guard; timing
`crates/networking/p2p/sync/healing/storage.rs`	`entry` API fix; sync DB write; `Result` return from closures; `drain_pending_tasks`; empty batch guard; timing

How to Test

Automated Tests (all passing)

cargo test -p ethrex-trie — 55/55 pass
cargo test -p ethrex-p2p --features rocksdb — 38/38 pass
cargo test -p ethrex-test snap_server — 12/12 pass
cargo clippy -p ethrex-trie -p ethrex-storage -p ethrex-p2p --features rocksdb — 0 warnings

Recommended Manual Validation

To fully validate, the changes require a live snap sync run:

Run snap sync against mainnet or a testnet (e.g., Holesky)
Compare state root at the pivot block — must match the expected root
Compare healing completion — healing_done = true with no extra cycles
Compare timing from instrumentation logs (before vs after each phase)
Monitor memory usage — should decrease noticeably due to reduced allocations

Risk Assessment

Change	Risk	Mitigation
`put_batch` → `put`	Very Low	Same underlying `batch.put_cf` call, just without Vec wrapper
`put_batch_no_alloc` override	Low	Same encode + write logic, just no intermediate collection
Two-pass merge	Low	Code hashes still collected into `HashSet`, flushed after trie build
Sync DB write	Very Low	Same logic, just without async/blocking bridge overhead
BTreeMap `entry` API	Very Low	Correctness fix — strictly safer than previous `insert` behavior
Result propagation in closures	Very Low	Same error conditions, now returned instead of panicking
`drain_pending_tasks`	Very Low	Replaces `join_all()` with explicit error checking per task
Dynamic thread count	Very Low	Standard library API; falls back to 8 on error

- Fix BackendTrieDB::put_batch to call tx.put() directly instead of tx.put_batch(vec![...]) per key, eliminating ~20k Vec allocations per flush - Override put_batch_no_alloc on BackendTrieDB to encode and write nodes in a single pass without intermediate Vec collection - Merge two-pass iteration in insert_accounts into a single pass, collecting code hashes during the trie-building .inspect() callback - Add write_storage_trie_nodes_batch_sync to Store to eliminate the spawn_blocking -> block_on -> spawn_blocking double thread hop in storage healing - Pipeline healing writes by allowing 2 tasks in flight (state and storage) to overlap encoding with DB commit - Use BTreeMap entry API in both state and storage healing to prevent empty placeholders from overwriting real node data - Replace hardcoded 12-thread pool with std::thread::available_parallelism - Add tracing::debug timing instrumentation to key write paths

github-actions · 2026-02-09T15:53:03Z

🤖 Kimi Code Review

Review Summary

This PR contains several performance optimizations and logging improvements. While most changes are reasonable, there are a few issues that need attention.

Critical Issues

1. Race Condition in State Healing (state.rs:286-287)

The change from single-task to dual-task execution introduces a potential race condition. The comment about avoiding out-of-order deletes is being ignored. When multiple tasks run concurrently, there's no guarantee that trie node deletions and insertions will be applied in the correct order, which could corrupt the state trie.

Recommendation: Revert to single-task execution or implement proper ordering guarantees.

2. Storage Healing Race Condition (storage.rs:215-216)

Same issue as above - allowing 2 concurrent tasks could lead to storage trie corruption.

Performance Issues

3. Inefficient Encoding in put_batch_no_alloc (trie.rs:134-144)

The put_batch_no_alloc method is allocating memory for each encoded node individually, defeating the purpose of "no_alloc". It's also creating a new transaction per call, which is inefficient.

Recommendation:

// Use a pre-allocated buffer or implement a proper zero-copy approach
// Consider batching multiple nodes into a single transaction

Code Quality Issues

4. Redundant Work in Storage Healing (storage.rs:220-230)

The code is rebuilding the BTreeMap for each account, but the input nodes is already a Vec. This adds unnecessary overhead.

Recommendation:

// Skip the BTreeMap creation if nodes are already sorted
let account_nodes: Vec<_> = nodes.into_iter().collect();

5. Magic Numbers

SIZE_TO_WRITE_DB constant is used but not visible in the diff
Hard-coded values like 100_000 (storage.rs:212) should be constants

6. Error Handling Inconsistency

flush_nodes_to_write uses put_batch_no_alloc but other places use put_batch
Some functions use expect() while others use proper error handling

Minor Issues

7. Tracing Level

Consider using trace! instead of debug! for performance metrics in production builds.

8. Code Duplication

The encoding logic in state.rs and storage.rs is nearly identical and could be extracted into a shared function.

Positive Changes

✅ Thread count based on available parallelism - Good use of system resources
✅ Synchronous storage write method - Avoids double thread hop
✅ Performance logging - Useful for debugging performance issues
✅ Single-pass account processing - More efficient than the previous two-pass approach

Automated review by Kimi (Moonshot AI) · custom prompt

github-actions · 2026-02-09T15:54:35Z

🤖 Codex Code Review

Findings

High: Possible out‑of‑order deletes reintroduced by allowing two DB tasks in flight in storage healing. The previous comment explicitly avoided this; now db_joinset permits 2 concurrent batches, and storage writes treat empty node data as delete. If two batches touch the same address_hash/path, an older task can delete nodes written by a newer one. Either prove disjointness per batch (e.g., partition by account hash), reintroduce a single in‑flight task, or add sequencing per account. crates/networking/p2p/sync/healing/storage.rs:216 crates/storage/store.rs:1158
High: Memory regression in snap sync: insert_accounts now collects all non‑empty code hashes into a Vec and flushes after the trie build. On mainnet this can be millions of hashes, defeating the previous streaming flush_if_needed() and risking large RAM spikes or OOM. Restore streaming or periodically flush while iterating. crates/networking/p2p/sync/snap_sync.rs:961 crates/networking/p2p/sync/snap_sync.rs:990

Notes / Suggestions

Medium: If state healing truly requires ordered DB writes, consider whether the same out‑of‑order risk applies there too (parent placeholders vs real nodes). If batches overlap keyspace, a 2‑task window can cause surprising results. crates/networking/p2p/sync/healing/state.rs:287
Low: BackendTrieDB::put_batch now calls tx.put per item, which is fine but may reduce batching efficiency if put_batch was optimized in the backend. Consider a bulk API or at least group by table to limit overhead. crates/storage/trie.rs:105

If you want, I can scan for overlaps in healing batch key ranges to confirm whether the concurrency change is safe.

Automated review by OpenAI Codex · custom prompt

github-actions · 2026-02-09T15:54:48Z

🤖 Claude Code Review

Perfect! Now I have enough context. Let me compile my review findings.

PR Review: Optimize snap sync insertion and healing write paths

Summary

This PR addresses performance bottlenecks in snap sync by optimizing hot paths in trie node writes and healing operations. The optimizations are well-researched, clearly documented, and demonstrate good understanding of the codebase. Overall, this is high-quality work with thoughtful performance improvements.

Critical Issues

1. Storage healing: Potential data race with concurrent writes ⚠️

File: crates/networking/p2p/sync/healing/storage.rs:215-217

while db_joinset.len() >= 2 {
    db_joinset.join_next().await;
}

Issue: The PR description states "each batch's deletes and inserts are self-contained within its WriteBatch" as justification for allowing 2 concurrent tasks. However, I need clarification on whether different subtrees can share parent nodes at the boundary.

Concern: If batch N processes subtree A and batch N+1 processes adjacent subtree B, and they share parent nodes near the root, the delete-by-empty-value pattern could race:

Batch N: marks parent P as empty (delete)
Batch N+1: writes parent P with real data
If these commit in the wrong order, the delete could overwrite the valid data

Question: How does the healing algorithm guarantee that parent node paths written by different batches don't overlap? If they do overlap, this could cause trie corruption.

Recommendation: Add a comment explaining the safety guarantee, or if unsure, revert to single-threaded writes (change >= 2 back to !is_empty()).

2. Missing error handling in storage healing

File: crates/networking/p2p/sync/healing/storage.rs:216

db_joinset.join_next().await;

Issue: Unlike state healing (line 290 which uses .expect("joinset is not empty")?), this doesn't handle the Result from join_next.

Fix:

db_joinset
    .join_next()
    .await
    .expect("joinset is not empty")?;

Correctness Issues

3. BTreeMap entry API: Good catch, but verify delete semantics

Files: crates/networking/p2p/sync/healing/state.rs:299, storage.rs:227

encoded_to_write.entry(path.slice(0, i)).or_insert(vec![]);

Good: This correctly prevents overwriting valid node data with empty placeholders.

Question: The comment says "mark parent nodes for deletion" - but how does an empty vec![] cause deletion? Looking at write_storage_trie_nodes_inner (store.rs:1155):

if node_data.is_empty() {
    txn.delete(STORAGE_TRIE_NODES, key.as_ref())?;
}

Potential issue: If a parent node legitimately doesn't exist in the DB yet (new branch), deleting it is a no-op. But if it exists with stale data, this correctly removes it. However, the entry().or_insert() pattern means if we later insert real data for that path, it will be preserved. This seems correct, but deserves a comment explaining the delete-by-empty pattern and why or_insert is critical.

Performance & Code Quality

4. put_batch optimization: Excellent ✅

File: crates/storage/trie.rs:113

tx.put(table, &prefixed_key, &value)

The change from put_batch(vec![(key, value)]) to put() is correct. The default trait implementation in StorageWriteBatch (api/mod.rs:77-79) shows that put() calls put_batch() anyway, so calling put_batch with a single-element Vec was redundant.

Note: This relies on RocksDBWriteTx having an override for put() that calls batch.put_cf directly (backend/rocksdb.rs:320-327), which it does. ✅

5. put_batch_no_alloc: Significant improvement ✅

File: crates/storage/trie.rs:129-151

The custom implementation correctly:

Opens a single transaction
Encodes nodes inline without collecting to Vec
Writes immediately per node
Single commit at the end

This eliminates the intermediate Vec allocation from the default trait impl (db.rs:17-24). Good optimization.

6. Two-pass merge: Verify no behavioral change

File: crates/networking/p2p/sync/snap_sync.rs:961-992

Before: Code hashes collected → flushed → trie built
After: Trie built while collecting code hashes → hashes flushed

Concern: The PR description says "Code hash flushing now happens after trie building rather than interleaved." This delays bytecode downloads. Have you verified that:

This doesn't cause downstream operations to wait on missing bytecodes?
The code hash collector buffer is large enough to hold all ~250M+ hashes before flush?

Looking at line 978 (flush_if_needed().await? was called per account), now it's only called once after the entire trie is built. If the collector has a fixed-size buffer, this could OOM.

Recommendation: Verify the code hash collector capacity or keep periodic flushing:

if code_hashes.len() >= 10000 {
    for hash in code_hashes.drain(..) {
        code_hash_collector.add(hash);
        code_hash_collector.flush_if_needed().await?;
    }
}

7. Thread pool sizing: Good improvement ✅

File: crates/common/trie/trie_sorted.rs:361-363

let thread_count = std::thread::available_parallelism()
    .map(|n| n.get())
    .unwrap_or(8);

This is the right approach. Fallback to 8 is reasonable.

8. Sync DB write: Clean refactor ✅

File: crates/storage/store.rs:1141-1173

Extracting the shared logic into write_storage_trie_nodes_inner and providing both async and sync versions is a clean solution. This avoids the spawn_blocking → block_on → spawn_blocking double-hop.

Minor Issues

9. Import organization

File: crates/storage/trie.rs:8

use std::time::Instant;

Minor: This import is only used for debug logging. Consider gating it behind a feature flag or using #[cfg(debug_assertions)] if the tracing statements are debug-only.

10. Timing instrumentation verbosity

All the timing logs are at debug! level, which is appropriate. However, consider:

Using tracing::instrument for cleaner span-based timing
Or using trace! instead of debug! for this level of detail

This is purely stylistic and not a blocking issue.

11. Missing documentation

File: crates/storage/trie.rs:129

The new put_batch_no_alloc override lacks a doc comment explaining why it exists and when it's used vs the default implementation.

Suggested addition:

/// Optimized implementation that avoids intermediate Vec allocation.
/// Encodes and writes nodes directly in a single pass.
fn put_batch_no_alloc(&self, key_values: &[(Nibbles, Node)]) -> Result<(), TrieError> {

Testing Concerns

12. Insufficient automated test coverage

The PR description states:

cargo test -p ethrex-trie -- 55/55 pass

However, these unit tests likely don't cover:

The concurrent write scenario (2 tasks in flight)
The two-pass merge with delayed code hash flushing
The entry API fix preventing overwrites

Recommendation: The PR correctly identifies this needs manual validation via live snap sync. Consider adding integration tests that:

Simulate concurrent healing writes to overlapping subtrees
Verify trie root correctness after healing with 2 concurrent tasks
Test code hash collection with very large account sets

Security Considerations

13. Atomic writes are preserved ✅

All optimizations maintain the atomic write guarantee of RocksDB's WriteBatch. Each batch commits as a single transaction, which is critical for blockchain state consistency.

14. No obvious attack vectors introduced

The optimizations don't introduce new external inputs or change validation logic, so attack surface remains unchanged.

Summary & Recommendation

Strengths:

Well-researched performance optimizations targeting real bottlenecks
Clear documentation and rationale
Maintains atomicity guarantees
Good code organization and refactoring

Must-fix before merge:

⚠️ Critical: Clarify/verify the safety of concurrent writes in storage healing (docs: add milestones #1)
Add error handling to storage healing join_next (chore: create project structure #2)
Verify code hash collector capacity for single-flush approach (Add lint and test jobs to CI #6)

Nice-to-have:

Add documentation to put_batch_no_alloc override (feat: add fmt, test and clippy to CI #11)
Add integration tests for concurrent healing (feat: implement rlp decoding for common types #12)
Consider commenting the delete-by-empty pattern (build: add Github actions boilerplate #3)

Overall verdict: This is solid performance engineering work, but needs verification of the concurrent write safety before merging. The risk assessment accurately identifies pipelined writes as medium risk - I recommend testing thoroughly on a full mainnet sync to validate trie root correctness.

Automated review by Claude (Anthropic) · custom prompt

greptile-apps · 2026-02-09T15:55:18Z

Greptile Overview

Greptile Summary

This PR optimizes snap sync’s hottest write paths by removing per-key allocations in trie batch writes, adding a streaming put_batch_no_alloc for BackendTrieDB, merging the temp-RocksDB scan in insert_accounts into a single pass, and reducing healing overhead by avoiding block_on inside spawn_blocking and allowing up to 2 DB write tasks in flight. It also adds debug timing instrumentation around major flush/write steps and sizes the trie-building thread pool using available_parallelism().

Key integration points: the trie builder (crates/common/trie/trie_sorted.rs) now relies on the faster TrieDB::put_batch_no_alloc implementation in crates/storage/trie.rs, while healing uses Store’s new synchronous storage-trie batch write helper to keep blocking work on blocking threads.

Issues to fix before merge: (1) insert_accounts now buffers all code hashes in-memory until trie build completes, which can cause unbounded memory growth on large snapshots; and (2) state healing’s JoinSet draining logic does not surface task panics, risking silent DB write failure while continuing the heal loop.

Confidence Score: 2/5

This PR has material merge blockers due to unbounded memory growth and insufficient error surfacing in background DB write tasks.
The performance optimizations are straightforward, but insert_accounts now accumulates potentially huge code-hash before flushing, which can OOM on large datasets, and state healing’s JoinSet handling does not propagate task panics, risking silent data loss or incomplete healing.
crates/networking/p2p/sync/snap_sync.rs and crates/networking/p2p/sync/healing/state.rs

Important Files Changed

Filename	Overview
crates/common/trie/trie_sorted.rs	Adds debug timing around `flush_nodes_to_write` and uses `available_parallelism()` to size the thread pool dynamically; no functional changes observed.
crates/networking/p2p/sync/healing/state.rs	Pipelines DB writes (2 in-flight) and avoids overwriting encoded nodes via `BTreeMap::entry`; however, current JoinSet draining logic does not surface task panics, risking silent DB write failures.
crates/networking/p2p/sync/healing/storage.rs	Pipelines storage healing DB writes and switches to sync `Store::write_storage_trie_nodes_batch_sync` in blocking tasks; uses `BTreeMap::entry` to avoid overwrites; changes look correct.
crates/networking/p2p/sync/snap_sync.rs	Merges the temp-DB scan into the trie build pass and adds timing logs, but now buffers all code hashes in-memory until after trie build, which can cause large memory growth/OOM on big snapshots.
crates/storage/store.rs	Extracts shared storage trie write logic and adds a sync batch writer to avoid async/blocking thread hops; behavior remains equivalent (commit at end).
crates/storage/trie.rs	Optimizes `put_batch` to avoid per-key Vec allocation and adds a streaming `put_batch_no_alloc` override; also adds debug timing. Main behavior is preserved.

Sequence Diagram

sequenceDiagram
    participant Snap as snap_sync::insert_accounts
    participant TmpDB as RocksDB(temp accounts)
    participant Trie as trie_sorted::trie_from_sorted_accounts_wrap
    participant TrieDB as BackendTrieDB(TrieDB)
    participant HealS as healing::state::heal_state_trie
    participant HealSt as healing::storage::heal_storage_trie
    participant Store as storage::Store

    Snap->>TmpDB: ingest_external_file(SSTs)
    Snap->>TmpDB: full_iterator(Start)
    Snap->>Trie: build trie from sorted iterator
    Trie->>TrieDB: flush_nodes_to_write()
    TrieDB->>TrieDB: put_batch_no_alloc(nodes)
    TrieDB->>TrieDB: begin_write
    loop for each node
        TrieDB->>TrieDB: encode_to_vec
        TrieDB->>TrieDB: tx.put(cf, key, value)
    end
    TrieDB->>TrieDB: tx.commit
    Snap->>Snap: collect code_hashes during inspect()
    Snap->>Store: code_hash_collector.add/flush (after trie build)

    HealS->>HealS: collect nodes_to_write
    HealS->>HealS: spawn_blocking (encode + db write)
    HealS->>Store: open_direct_state_trie
    Store->>TrieDB: db.put_batch(encoded)

    HealSt->>HealSt: drain nodes_to_write per account
    HealSt->>HealSt: spawn_blocking (encode + db write)
    HealSt->>Store: write_storage_trie_nodes_batch_sync(updates)
    Store->>Store: begin_write
    Store->>Store: txn.put/delete (per node)
    Store->>Store: txn.commit

github-actions · 2026-02-09T15:55:19Z

Lines of code report

Total lines added: 102
Total lines removed: 0
Total lines changed: 102

Detailed view

+------------------------------------------------------+-------+------+
| File                                                 | Lines | Diff |
+------------------------------------------------------+-------+------+
| ethrex/crates/common/trie/trie_sorted.rs             | 440   | +10  |
+------------------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/sync/healing/mod.rs     | 34    | +27  |
+------------------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/sync/healing/state.rs   | 392   | +5   |
+------------------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/sync/healing/storage.rs | 626   | +11  |
+------------------------------------------------------+-------+------+
| ethrex/crates/storage/store.rs                       | 2441  | +13  |
+------------------------------------------------------+-------+------+
| ethrex/crates/storage/trie.rs                        | 184   | +36  |
+------------------------------------------------------+-------+------+

greptile-apps

_{6 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-09T15:55:22Z

crates/networking/p2p/sync/snap_sync.rs

@@ -976,6 +969,9 @@ async fn insert_accounts(
                    .account_tries_inserted
                    .fetch_add(1, Ordering::Relaxed);
                let account_state = AccountState::decode(v).expect("We should have accounts here");
+                if account_state.code_hash != *EMPTY_KECCACK_HASH {
+                    code_hashes.push(account_state.code_hash);
+                }


Unbounded code hash buffer
insert_accounts now collects all non-empty code_hash values into code_hashes: Vec<H256> and only flushes after trie_from_sorted_accounts_wrap finishes. On large snapshots this can grow to millions of hashes and OOM before the trie build completes. This used to be bounded by flush_if_needed() during the scan. Consider flushing incrementally during the iterator pass (e.g., push+flush when the collector is full) or using a bounded buffer.

Prompt To Fix With AI

This is a comment left during a code review. Path: crates/networking/p2p/sync/snap_sync.rs Line: 960:974 Comment: **Unbounded code hash buffer** `insert_accounts` now collects all non-empty `code_hash` values into `code_hashes: Vec<H256>` and only flushes after `trie_from_sorted_accounts_wrap` finishes. On large snapshots this can grow to millions of hashes and OOM before the trie build completes. This used to be bounded by `flush_if_needed()` during the scan. Consider flushing incrementally during the iterator pass (e.g., push+flush when the collector is full) or using a bounded buffer. How can I resolve this? If you propose a fix, please make it concise.

Fixed in 08ba6f92c. Removed the intermediate Vec<H256> entirely — code hashes are now added directly into CodeHashCollector's internal HashSet during the .inspect() callback. The HashSet deduplicates, so memory is bounded by unique contract accounts (~5M on mainnet at 32 bytes each = ~160MB) instead of growing unboundedly with duplicates.

flush_if_needed() is called once after the trie build completes. We can't call it during the iterator because trie_from_sorted_accounts_wrap is synchronous and flush_if_needed() is async.

greptile-apps · 2026-02-09T15:55:22Z

crates/networking/p2p/sync/healing/state.rs

            let store = store.clone();
-            // NOTE: we keep only a single task in the background to avoid out of order deletes
-            if !db_joinset.is_empty() {
+            // Allow up to 2 tasks in flight to overlap encoding with DB commit
+            while db_joinset.len() >= 2 {
                db_joinset
                    .join_next()
                    .await
-                    .expect("we just checked joinset is not empty")?;
+                    .expect("joinset is not empty")?;
            }


JoinSet errors dropped
This loop uses join_next().await.expect(...)?; but the spawned closure returns (), so task panics/errors are effectively ignored (and ? here can only be applying to the Option, not the task result). That means DB write panics can be silently swallowed and healing can proceed as if writes succeeded. You likely want to match join_next().await and handle Err(JoinError) explicitly (and propagate/abort on panic).

Prompt To Fix With AI

This is a comment left during a code review. Path: crates/networking/p2p/sync/healing/state.rs Line: 285:292 Comment: **JoinSet errors dropped** This loop uses `join_next().await.expect(...)?;` but the spawned closure returns `()`, so task panics/errors are effectively ignored (and `?` here can only be applying to the `Option`, not the task result). That means DB write panics can be silently swallowed and healing can proceed as if writes succeeded. You likely want to `match join_next().await` and handle `Err(JoinError)` explicitly (and propagate/abort on panic). How can I resolve this? If you propose a fix, please make it concise.

I think the suggestion from the bot is wrong, but I would check the types of the task to make it return errors.

You're correct — my comment was wrong.

The spawned closure does return Result<(), SyncError> (you can see Ok(()) at the end of the closure). The JoinSet type is JoinSet<Result<(), SyncError>>, so join_next().await returns:

Some(Ok(Ok(()))) — task succeeded, closure returned Ok

Some(Ok(Err(e))) — task succeeded, closure returned Err

Some(Err(JoinError)) — task panicked/cancelled

The PR properly handles all three cases via the wait_for_pending_task() helper added in healing/mod.rs (lines 22-28 of the diff), which matches on the result and propagates errors correctly. This is actually a fix — before the PR, the closures used .expect() which would panic instead of returning errors.

The error handling is correct as-is. My apologies for the noise!

Copilot

Pull request overview

Optimizes snap-sync and healing write hot paths to reduce allocations, redundant iteration, and thread scheduling overhead while rebuilding / repairing the Ethereum state trie backed by RocksDB.

Changes:

Eliminates per-key Vec allocations in trie batch writes and adds a non-allocating put_batch_no_alloc fast path for BackendTrieDB.
Collapses insert_accounts from a two-pass temp-DB scan into a single pass during trie building, with added timing instrumentation.
Improves healing write throughput by allowing up to 2 in-flight DB write tasks, deduping parent-delete markers via BTreeMap::entry, and removing async block_on bridging for storage healing writes.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
crates/storage/trie.rs	Removes per-key batch allocations and adds `put_batch_no_alloc` override; adds timing logs.
crates/storage/store.rs	Adds sync storage-trie batch write path and shared inner implementation to avoid extra thread hops.
crates/common/trie/trie_sorted.rs	Adds flush timing and uses `available_parallelism()` for thread pool sizing.
crates/networking/p2p/sync/snap_sync.rs	Single-pass account insertion (collect code hashes during trie build) + timing logs.
crates/networking/p2p/sync/healing/state.rs	Allows 2 in-flight write tasks, fixes parent-delete overwrites via `entry`, adds timing logs.
crates/networking/p2p/sync/healing/storage.rs	Allows 2 in-flight write tasks, fixes parent-delete overwrites via `entry`, switches to sync DB write, adds timing logs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-09T15:57:19Z

crates/networking/p2p/sync/snap_sync.rs

+    let start = std::time::Instant::now();
+    let mut code_hashes: Vec<H256> = Vec::new();
    let iter = db.full_iterator(rocksdb::IteratorMode::Start);


code_hashes collects every non-empty account code hash into a Vec during the trie-build pass. On large networks this can grow very large (and includes duplicates), increasing peak RSS and risking OOM compared to the previous streaming flush.

Consider avoiding the intermediate Vec by either (a) adding hashes directly into CodeHashCollector during .inspect() (no await), and flushing/spawning file dumps in a bounded way after the trie build, or (b) sending hashes to a background writer via a channel so the buffer can be flushed incrementally while the trie is being built.

Copilot · 2026-02-09T15:57:20Z

crates/networking/p2p/sync/healing/storage.rs

-            if !db_joinset.is_empty() {
+            // Allow up to 2 tasks in flight to overlap encoding with DB commit
+            while db_joinset.len() >= 2 {
                db_joinset.join_next().await;


The join_next().await result is ignored here. If a background DB write task panics or returns an error, this will be silently dropped and healing may continue/finish without persisting data.

Handle the returned Option<Result<_, JoinError>> (and propagate/log failures) when draining the JoinSet, similar to how state healing uses expect(...) + ? on join_next().

Suggested change

db_joinset.join_next().await;

match db_joinset.join_next().await {

Some(Ok(_)) => {

// successfully completed background DB write task

}

Some(Err(e)) => {

error!(?e, "background DB write task failed during storage healing");

}

None => {

// no more tasks to join

break;

}

}

Fixed in 08ba6f92c. The join_next() result is now checked and JoinError is propagated as SyncError::JoinHandle(e), which will surface task panics instead of silently dropping them.

pablodeymo · 2026-02-09T20:34:49Z

Re: 8. Code Duplication

The shared part between state.rs and storage.rs is only the inner 4-line encoding loop (for i in 0..path.len() { entry().or_insert }; insert(path, encode)). The surrounding structures are different:

State healing: flat Vec<(Nibbles, Node)> → single BTreeMap → db.put_batch()
Storage healing: HashMap<H256, Vec<(Nibbles, Node)>> → per-account BTreeMap → store.write_storage_trie_nodes_batch_sync()

Extracting a shared function for a 4-line loop that operates on different input/output types would add a function boundary and import without meaningful deduplication. Keeping them inline is simpler and more readable.

The original single-task constraint exists to prevent out-of-order deletes: batch N generates empty markers for ancestor paths of its committed nodes, and batch N+1 may write real node data at those same ancestor paths. If RocksDB serializes batch N's commit after batch N+1's, the empty marker overwrites the valid branch node, corrupting the trie.

pablodeymo · 2026-02-09T20:38:26Z

Re: 1. Storage healing: Potential data race with concurrent writes

Good catch — this is a real race condition. Here's the concrete scenario:

Batch N commits leaf [0,1,2,3] → generates empty markers for ancestors [0,1,2], [0,1], [0]
Between flushes, leaf [0,1,2,7] arrives. Its commit triggers commit_node for parent [0,1,2] (pending_children_count reaches 0), which goes into batch N+1 with real branch data
Both batches are in flight. RocksDB's write lock doesn't guarantee FIFO ordering — if batch N's WriteBatch commits after N+1, the empty marker at [0,1,2] overwrites the valid branch node

The original comment ("avoid out of order deletes") was there for exactly this reason.

Reverted to single-task writes in aeea7f0ce, with an expanded comment explaining the safety invariant.

…rror handling - Remove intermediate Vec<H256> for code hashes in insert_accounts. Instead, add hashes directly into CodeHashCollector's HashSet during the .inspect() callback. The HashSet deduplicates, bounding memory to unique contract accounts (~5M on mainnet = ~160MB) instead of growing with duplicates. flush_if_needed() is called once after the sync trie build completes. - Handle JoinSet errors in storage healing: propagate JoinError from db_joinset.join_next() instead of silently dropping task panics.

pablodeymo · 2026-02-09T20:47:31Z

@greptile-apps review again

greptile-apps · 2026-02-09T20:50:35Z

Greptile Overview

Greptile Summary

This PR focuses on reducing snap-sync/healing write-path overhead by removing avoidable allocations and iterations, and adding lightweight timing instrumentation.

Key changes:

BackendTrieDB::put_batch no longer allocates a 1-element Vec per key; it writes each KV directly into the underlying write batch and commits once.
BackendTrieDB now overrides put_batch_no_alloc to encode-and-write in a single pass without building an intermediate Vec.
insert_accounts removes an extra full scan of the temp RocksDB; code hashes are collected during the trie-build scan and flushed after.
Healing paths improve batch encoding correctness (entry().or_insert to avoid overwriting real nodes with delete markers) and reduce scheduler overhead by adding a synchronous storage write method.

Issue to fix before merge:

Both healing modules use db_joinset.join_all().await on exit paths without checking the returned task results, which can swallow panics from DB write tasks and allow healing to report success despite missing writes.

Confidence Score: 4/5

Generally safe performance-focused changes, but healing completion paths can still ignore DB write task panics.
Most changes are localized perf/instrumentation and keep semantics, and prior review-thread issues were addressed; the remaining concrete issue is JoinSet::join_all() dropping JoinErrors in both healing modules, which can mask failed DB writes at the end of healing.
crates/networking/p2p/sync/healing/state.rs, crates/networking/p2p/sync/healing/storage.rs

Important Files Changed

Filename	Overview
crates/common/trie/trie_sorted.rs	Adds timing debug logs to flushes and uses `available_parallelism()` for thread pool size; no functional changes found.
crates/networking/p2p/sync/healing/state.rs	Fixes parent-path deletion markers using `BTreeMap::entry`, adds timing, but still drops `JoinSet` task errors on `join_all()` at completion/stale exits.
crates/networking/p2p/sync/healing/storage.rs	Switches storage batch encoding to dedup via `BTreeMap` and uses sync DB write; however `db_joinset.join_all().await` ignores possible `JoinError`s from DB tasks.
crates/networking/p2p/sync/snap_sync.rs	Merges account scan passes and adds timing; code hash collection now happens during trie build with a single flush afterward (per prior thread fix).
crates/storage/store.rs	Extracts shared storage trie node write logic and adds a synchronous variant to avoid block_on inside spawn_blocking; no correctness issues found.
crates/storage/trie.rs	Optimizes `BackendTrieDB::put_batch` to avoid per-key Vec alloc and adds a non-allocating batch write override with timing logs; behavior preserved.

Sequence Diagram

sequenceDiagram
    participant HealState as heal_state_trie/heal_storage_trie
    participant JS as db_joinset
    participant Block as spawn_blocking
    participant Store as Store
    participant DB as TrieDB/BackendTrieDB

    loop Healing loop
        HealState->>HealState: accumulate nodes_to_write
        alt flush threshold OR done OR stale
            opt previous task exists
                HealState->>JS: join_next().await
                JS-->>HealState: Result<(), JoinError>
            end
            HealState->>Block: spawn_blocking(encode + write)
            Block->>Store: open trie / begin_write
            Store->>DB: put_batch / write_storage_trie_nodes_batch_sync
            DB-->>Store: commit
            Store-->>Block: Ok
            Block-->>JS: task completes
        end
    end

    alt exit (done/stale)
        HealState->>JS: join_all().await
        JS-->>HealState: "JoinErrors currently ignored"
    end

greptile-apps

_{6 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-09T20:50:47Z

Additional Comments (2)

crates/networking/p2p/sync/healing/storage.rs
Errors from DB tasks ignored

db_joinset.join_all().await; (and the equivalent on the stale path) drops JoinErrors from the spawn_blocking DB write tasks, so a panic in the write closure can be silently ignored and healing can return Ok(true/false) as if data was committed. This should propagate/abort on JoinError (at least panic) similarly to the earlier join_next() handling.

Prompt To Fix With AI

This is a comment left during a code review.
Path: crates/networking/p2p/sync/healing/storage.rs
Line: 250:252

Comment:
**Errors from DB tasks ignored**

`db_joinset.join_all().await;` (and the equivalent on the stale path) drops `JoinError`s from the `spawn_blocking` DB write tasks, so a panic in the write closure can be silently ignored and healing can return `Ok(true/false)` as if data was committed. This should propagate/abort on `JoinError` (at least panic) similarly to the earlier `join_next()` handling.

How can I resolve this? If you propose a fix, please make it concise.

crates/networking/p2p/sync/healing/state.rs
JoinSet results are dropped

db_joinset.join_all().await; discards JoinErrors from the spawn_blocking batch write tasks. If any DB write panics, state healing can still complete and return Ok(paths.is_empty()) with missing data. Consider iterating join_next().await and returning an error on Err(JoinError) (and/or logging panics) so write failures don’t get swallowed.

Prompt To Fix With AI

This is a comment left during a code review.
Path: crates/networking/p2p/sync/healing/state.rs
Line: 323:326

Comment:
**JoinSet results are dropped**

`db_joinset.join_all().await;` discards `JoinError`s from the `spawn_blocking` batch write tasks. If any DB write panics, state healing can still complete and return `Ok(paths.is_empty())` with missing data. Consider iterating `join_next().await` and returning an error on `Err(JoinError)` (and/or logging panics) so write failures don’t get swallowed.

How can I resolve this? If you propose a fix, please make it concise.

…in healing state.rs used .expect().? while storage.rs used pattern matching — now both use a shared helper that eliminates the .expect() panic path.

…awn_blocking closures - Closures now return Result<(), SyncError> instead of panicking on DB errors - Replace join_all() with drain_pending_tasks() that checks each task result - Skip spawning no-op tasks when the write buffer is empty

pablodeymo · 2026-02-09T21:09:52Z

Addressed the remaining concern about JoinSet::join_all() dropping JoinErrors in both healing modules:

spawn_blocking closures now return Result<(), SyncError> instead of panicking with .expect() — DB errors propagate as recoverable errors
Replaced join_all() with drain_pending_tasks() that checks each task's Result and propagates the first error
Extracted shared helpers wait_for_pending_task() and drain_pending_tasks() in healing/mod.rs so both state and storage healing use the same pattern
Added if !to_write.is_empty() guard to avoid spawning no-op tasks

Commits: 2fb356a8e, 4c30718d8

pablodeymo · 2026-02-09T21:10:56Z

Re: Greptile's two additional comments about db_joinset.join_all().await in both storage.rs and state.rs:

Both are fixed in commits 2fb356a8e and 4c30718d8:

join_all() replaced with drain_pending_tasks() — iterates join_next().await in a loop, checks each task's Result, and returns the first error encountered (either a JoinError from a panic or a SyncError from the closure)
Closures return Result<(), SyncError> instead of panicking with .expect() — DB failures are now propagated as recoverable errors, not panics
Both state.rs and storage.rs use the same shared helpers from healing/mod.rs

pablodeymo · 2026-02-09T21:11:09Z

@greptile-apps review again
comments fixed

greptile-apps · 2026-02-09T21:16:59Z

Greptile Overview

Greptile Summary

This PR optimizes snap-sync and healing write hot paths by (1) removing per-key Vec allocations in BackendTrieDB::put_batch, (2) adding a put_batch_no_alloc override that encodes/writes nodes in one pass, (3) merging the prior two-pass temp-RocksDB scan in insert_accounts by collecting code hashes during trie build, and (4) reducing healing overhead by avoiding async block_on bridges, fixing parent-marker overwrites via BTreeMap::entry, and centralizing JoinSet error propagation helpers.

Overall the changes fit cleanly into the existing storage/trie abstractions and the healing pipeline (single in-flight DB write), and the added tracing helps validate perf wins in production runs. The main merge blockers are remaining expect() panics in insert_accounts (external inputs) and a silent invariant-masking None branch in wait_for_pending_task.

Confidence Score: 3/5

This PR is close to mergeable but still has a couple of crash-risk / invariant-handling issues to address.
Core performance refactors look consistent with existing abstractions and add error propagation in healing, but insert_accounts still contains new expect() panics on external data paths, and the JoinSet helper currently masks an invariant violation by treating None as success after checking non-empty.
crates/networking/p2p/sync/snap_sync.rs, crates/networking/p2p/sync/healing/mod.rs

Important Files Changed

Filename	Overview
crates/common/trie/trie_sorted.rs	Adds debug timing around flush_nodes_to_write and switches trie build ThreadPool size to available_parallelism().
crates/networking/p2p/sync/healing/mod.rs	Introduces JoinSet helpers wait_for_pending_task/drain_pending_tasks for propagating DB write task errors.
crates/networking/p2p/sync/healing/state.rs	Changes healing DB write batching to use BTreeMap entry markers, adds instrumentation, and moves JoinSet handling to shared helpers; still does sync decoding inside iterator with panics/ignores.
crates/networking/p2p/sync/healing/storage.rs	Fixes parent-marker overwrite via BTreeMap entry API, switches to sync Store write inside spawn_blocking, and adopts JoinSet helpers for error propagation.
crates/networking/p2p/sync/snap_sync.rs	Merges account scan passes by collecting code hashes during trie build; introduces debug timing but adds unconditional expect()s on RocksDB and RLP decode.
crates/storage/store.rs	Extracts storage trie node write inner loop and adds a synchronous write_storage_trie_nodes_batch_sync variant used by healing.
crates/storage/trie.rs	Optimizes BackendTrieDB put_batch to avoid per-key Vec allocations and adds put_batch_no_alloc override with debug timing; still clones Nibbles per node.

Sequence Diagram

sequenceDiagram
    participant Snap as snap_sync::insert_accounts
    participant TempDB as RocksDB temp accounts DB
    participant Trie as trie_from_sorted_accounts_wrap
    participant Collector as CodeHashCollector
    participant StateDB as BackendTrieDB
    participant Heal as healing::{state,storage}
    participant JS as JoinSet

    Snap->>TempDB: ingest_external_file(snapshots)
    Snap->>TempDB: full_iterator(Start)
    Snap->>Trie: build trie from sorted accounts
    Trie-->>Snap: iter.inspect(decode AccountState)
    Snap->>Collector: add(code_hash)
    Snap->>StateDB: put_batch_no_alloc(nodes)
    Snap->>Collector: flush_if_needed() (after trie build)

    Heal->>JS: spawn_blocking(DB write task)
    Heal->>JS: wait_for_pending_task(join_next)
    Heal->>StateDB: put_batch(encoded nodes)
    Heal->>JS: drain_pending_tasks() on done/stale

greptile-apps

_{7 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-09T21:17:03Z

crates/networking/p2p/sync/snap_sync.rs

                    .account_tries_inserted
                    .fetch_add(1, Ordering::Relaxed);
                let account_state = AccountState::decode(v).expect("We should have accounts here");


Panics on bad input

insert_accounts now hard-panics on any temp RocksDB iterator error (.expect(...)) and on any malformed account value (AccountState::decode(v).expect(...)). These are external inputs (snapshot files / DB iterator) and can be corrupted; panicking will crash the sync process instead of returning SyncError like the rest of this function does.

Consider propagating these as SyncError::RocksDBError / SyncError::Rlp instead of using expect inside the iterator (map_err(...)?).

Prompt To Fix With AI

This is a comment left during a code review. Path: crates/networking/p2p/sync/snap_sync.rs Line: 972:974 Comment: **Panics on bad input** `insert_accounts` now hard-panics on any temp RocksDB iterator error (`.expect(...)`) and on any malformed account value (`AccountState::decode(v).expect(...)`). These are external inputs (snapshot files / DB iterator) and can be corrupted; panicking will crash the sync process instead of returning `SyncError` like the rest of this function does. Consider propagating these as `SyncError::RocksDBError` / `SyncError::Rlp` instead of using `expect` inside the iterator (`map_err(...)?`). How can I resolve this? If you propose a fix, please make it concise.

This absolutely shouldn't happen. The only scenario is where the user has a faulty file system.

greptile-apps · 2026-02-09T21:17:04Z

crates/networking/p2p/sync/healing/mod.rs


 // Re-export shared types for external use
 #[allow(unused_imports)]
 pub use types::{HealingQueueEntry, StateHealingQueue};
+
+/// Waits for a pending task in the JoinSet to complete, propagating any error.
+/// Used to ensure only a single background DB write is in flight at a time.
+async fn wait_for_pending_task(


Incorrect JoinSet None case

In wait_for_pending_task, the None arm is treated as success (Some(Ok(Ok(()))) | None => {}), but None from join_next().await can only happen when the JoinSet is empty. Since this function only calls join_next() after checking !joinset.is_empty(), hitting None indicates a logic invariant violation and would silently drop a pending write.

It’s safer to handle None as unreachable!()/error (or remove it) so a broken invariant doesn’t get masked.

Prompt To Fix With AI

This is a comment left during a code review. Path: crates/networking/p2p/sync/healing/mod.rs Line: 15:22 Comment: **Incorrect JoinSet None case** In `wait_for_pending_task`, the `None` arm is treated as success (`Some(Ok(Ok(()))) | None => {}`), but `None` from `join_next().await` can only happen when the JoinSet is empty. Since this function only calls `join_next()` after checking `!joinset.is_empty()`, hitting `None` indicates a logic invariant violation and would silently drop a pending write. It’s safer to handle `None` as `unreachable!()`/error (or remove it) so a broken invariant doesn’t get masked. How can I resolve this? If you propose a fix, please make it concise.

#6159

join_next() cannot return None after an is_empty() check, so treat it as an invariant violation rather than silently ignoring it.

pablodeymo · 2026-02-09T21:20:32Z

Re: Greptile's two latest comments:

1. Panics on bad input in insert_accounts (.expect() at lines 969, 974)

These are pre-existing — line 969 already has a // TODO: remove unwrap comment. We didn't introduce them; we only added code_hash_collector.add() inside the existing .inspect() closure. Fixing them properly requires changing the trie_from_sorted_accounts_wrap API to accept fallible iterators (Result<(H256, Vec<u8>), _> instead of (H256, Vec<u8>)), which is a separate refactor beyond this PR's scope.

2. Incorrect JoinSet None case in wait_for_pending_task

Valid point. join_next() can only return None when the set is empty, and we check !is_empty() first, so None is unreachable. Changed from silently ignoring it to unreachable!("join_next returned None after is_empty check") so a broken invariant doesn't get masked.

Fixed in commit 0c10e154c.

pablodeymo · 2026-02-09T21:21:35Z

@greptile-apps review it again

greptile-apps · 2026-02-09T21:31:09Z

Greptile Overview

Greptile Summary

This PR optimizes snap sync’s hottest write paths by (1) removing per-key Vec allocations in BackendTrieDB::put_batch and adding a true streaming put_batch_no_alloc implementation, (2) merging the temp-RocksDB account scan into the trie-build pass while collecting code hashes during the same iteration, and (3) simplifying healing writes by avoiding a double thread-hop, preventing parent-marker overwrites via BTreeMap::entry().or_insert, and propagating errors out of background DB tasks using shared JoinSet helpers.

The changes fit into the codebase’s existing architecture by keeping trie insertion logic within the TrieDB abstraction, keeping RocksDB write batching inside Store transactions, and keeping healing’s concurrency model (at most one DB write in flight to preserve delete ordering) but making it explicit and error-aware.

Confidence Score: 3/5

Mostly safe performance refactor, but one remaining panic-on-input path should be fixed before merging.
The write-path and healing changes are localized and preserve existing transaction/commit semantics, and the JoinSet helpers improve correctness by propagating background task failures. However, insert_accounts still uses expect() on RocksDB iterator items and RLP decoding, which will deterministically crash snap sync on iterator errors or malformed snapshot data; this should be converted to SyncError propagation to match surrounding error handling.
crates/networking/p2p/sync/snap_sync.rs

Important Files Changed

Filename	Overview
crates/networking/p2p/sync/snap_sync.rs	Merged temp-DB scan into trie-build pass and added timings; still contains `expect()` panics on RocksDB iterator errors and RLP decode in the new iterator pipeline.
crates/storage/trie.rs	Optimized trie batch write path by removing per-key Vec allocation, added `put_batch_no_alloc` override, and added debug timing; change is localized and preserves commit semantics.
crates/storage/store.rs	Extracted shared storage-trie write inner loop and added synchronous wrapper for use inside `spawn_blocking`, reducing thread-hopping without altering write/delete behavior.
crates/networking/p2p/sync/healing/mod.rs	Introduced JoinSet helpers to wait/drain background DB tasks with proper error propagation; `None` case is now treated as unreachable after `is_empty` check.
crates/networking/p2p/sync/healing/state.rs	Fixed parent-marker overwrite bug using `entry().or_insert`, avoided spawning no-op DB tasks, added timing, and switched to helper-based JoinSet error handling.
crates/networking/p2p/sync/healing/storage.rs	Applied same parent-marker overwrite fix via BTreeMap `entry`, switched to synchronous store write inside `spawn_blocking`, and adopted shared JoinSet draining with error propagation.
crates/common/trie/trie_sorted.rs	Added debug timing for `flush_nodes_to_write` and replaced hardcoded thread count with `available_parallelism()` fallback; no functional trie logic changes.
CHANGELOG.md	Added perf changelog entry describing snap sync insertion/healing write-path optimizations.

Sequence Diagram

sequenceDiagram
    participant Snap as snap_sync::insert_accounts
    participant Rocks as Temp RocksDB iterator
    participant Trie as trie_sorted::trie_from_sorted_accounts_wrap
    participant DB as BackendTrieDB/Store
    participant HealS as healing::state
    participant HealSt as healing::storage
    participant JS as JoinSet<Result<(), SyncError>>

    Snap->>Rocks: full_iterator(Start)
    Snap->>Trie: trie_from_sorted_accounts_wrap(db, iter.map/inspect)
    loop for each (k,v)
        Rocks-->>Trie: (key,value)
        Trie-->>Snap: inspect(k,v)
        Snap->>Snap: AccountState::decode(v)
        Snap->>Snap: code_hash_collector.add(code_hash)
        Snap->>Snap: storage_accounts insert(storage_root)
    end
    Trie-->>Snap: compute_state_root
    Snap->>Snap: code_hash_collector.flush_if_needed().await

    par Healing state writes
        HealS->>JS: wait_for_pending_task(join_next)
        HealS->>JS: spawn_blocking(encode + db.put_batch)
        JS->>DB: open_direct_state_trie + put_batch + commit
        HealS->>JS: drain_pending_tasks()
    and Healing storage writes
        HealSt->>JS: wait_for_pending_task(join_next)
        HealSt->>JS: spawn_blocking(encode + write_storage_trie_nodes_batch_sync)
        JS->>DB: begin_write + put/delete + commit
        HealSt->>JS: drain_pending_tasks()
    end

greptile-apps

_{8 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-09T21:31:13Z

crates/networking/p2p/sync/snap_sync.rs

@@ -976,6 +972,9 @@ async fn insert_accounts(
                    .account_tries_inserted
                    .fetch_add(1, Ordering::Relaxed);
                let account_state = AccountState::decode(v).expect("We should have accounts here");


Panics on iterator/rlp decode

insert_accounts still has expect(...) calls inside the RocksDB iterator (.map(|k| k.expect(...)) and AccountState::decode(v).expect(...)). These are external inputs (temp RocksDB iterator + snapshot content), so a single iterator error or malformed RLP will crash the whole sync instead of returning a SyncError like the rest of the function does. This should be converted to map_err(...)? / Result propagation so snap sync can fail gracefully with a useful error.

Prompt To Fix With AI

This is a comment left during a code review. Path: crates/networking/p2p/sync/snap_sync.rs Line: 966:974 Comment: **Panics on iterator/rlp decode** `insert_accounts` still has `expect(...)` calls inside the RocksDB iterator (`.map(|k| k.expect(...))` and `AccountState::decode(v).expect(...)`). These are external inputs (temp RocksDB iterator + snapshot content), so a single iterator error or malformed RLP will crash the whole sync instead of returning a `SyncError` like the rest of the function does. This should be converted to `map_err(...)?` / `Result` propagation so snap sync can fail gracefully with a useful error. How can I resolve this? If you propose a fix, please make it concise.

github-actions · 2026-02-09T21:48:05Z

Benchmark Block Execution Results Comparison Against Main

Command	Mean [s]	Min [s]	Max [s]	Relative
`base`	63.109 ± 0.359	62.547	63.467	1.00
`head`	63.228 ± 0.269	62.743	63.754	1.00 ± 0.01

fedacking · 2026-02-09T17:17:29Z

crates/common/trie/trie_sorted.rs

    }
    scope(|s| {
-        let pool = ThreadPool::new(12, s);
+        let thread_count = std::thread::available_parallelism()


This should be at least more than 2. If there is only a single thread it will wait on the write buffers getting freed.

fedacking · 2026-02-09T17:19:34Z

crates/networking/p2p/sync/healing/state.rs

            let store = store.clone();
-            // NOTE: we keep only a single task in the background to avoid out of order deletes
-            if !db_joinset.is_empty() {
+            // Allow up to 2 tasks in flight to overlap encoding with DB commit
+            while db_joinset.len() >= 2 {
                db_joinset
                    .join_next()
                    .await
-                    .expect("we just checked joinset is not empty")?;
+                    .expect("joinset is not empty")?;
            }


I think the suggestion from the bot is wrong, but I would check the types of the task to make it return errors.

fedacking · 2026-02-09T21:54:39Z

crates/networking/p2p/sync/snap_sync.rs

                    .account_tries_inserted
                    .fetch_add(1, Ordering::Relaxed);
                let account_state = AccountState::decode(v).expect("We should have accounts here");


This absolutely shouldn't happen. The only scenario is where the user has a faulty file system.

ElFantasma · 2026-02-10T01:15:43Z

crates/networking/p2p/sync/healing/mod.rs

+
+/// Drains all pending tasks from the JoinSet, propagating the first error encountered.
+async fn drain_pending_tasks(
+    joinset: &mut JoinSet<Result<(), SyncError>>,


When drain_pending_tasks encounters the first error, it returns immediately and remaining tasks stay in the JoinSet. On drop, those tasks get cancelled mid-execution. Since WriteBatch::commit is atomic at the RocksDB level, partial writes shouldn't happen — but any tasks that haven't called commit yet will lose their work silently.

Is that acceptable here? The caller (healing loop) presumably retries from scratch on error, so losing in-flight work is fine. Just want to confirm that's the intent, since the old join_all would have waited for all tasks to complete (though it didn't propagate errors).

ElFantasma · 2026-02-10T01:15:44Z

crates/networking/p2p/sync/healing/storage.rs

+                    let account_count = to_write.len();
+                    let mut encoded_to_write = vec![];
+                    for (hashed_account, nodes) in to_write {
+                        let mut account_nodes = std::collections::BTreeMap::new();


nit: std::collections::BTreeMap::new() — this could use a use std::collections::BTreeMap; import at the top of the closure or file, matching the pattern in state.rs where BTreeMap is already imported.

ElFantasma · 2026-02-11T18:47:08Z

crates/networking/p2p/sync/snap_sync.rs

+    let start = std::time::Instant::now();
+    // We collect code hashes directly into the collector's HashSet during the trie
+    // build pass. The collector deduplicates, so memory is bounded by unique contract
+    // accounts (~5M on mainnet = ~160MB). We can't call flush_if_needed() here because


The comment says ~5M entries = ~160MB, but HashSet<H256> has per-entry overhead beyond the raw 32-byte key — typically ~72 bytes/entry on 64-bit (key + hash + bucket metadata + load factor headroom). At 5M entries that's closer to 350-400MB peak during the synchronous trie build.

This might be fine for the target hardware running snap sync, but the comment understates actual memory usage. Worth either correcting the estimate or noting that it's the raw key size only.

pablodeymo requested a review from a team as a code owner February 9, 2026 15:52

Copilot AI review requested due to automatic review settings February 9, 2026 15:52

github-actions bot assigned pablodeymo Feb 9, 2026

Copilot started reviewing on behalf of pablodeymo February 9, 2026 15:53 View session

greptile-apps bot reviewed Feb 9, 2026

View reviewed changes

Copilot AI reviewed Feb 9, 2026

View reviewed changes

pablodeymo changed the title ~~Optimize snap sync insertion and healing write paths~~ perf(l1): optimize snap sync insertion and healing write paths Feb 9, 2026

Merge branch 'main' into perf/snap-sync-insertion-optimization

d998441

github-actions bot added L1 Ethereum client performance Block execution throughput and performance in general labels Feb 9, 2026

github-project-automation bot added this to ethrex_l1 and ethrex_performance Feb 9, 2026

github-project-automation bot moved this to Todo in ethrex_performance Feb 9, 2026

greptile-apps bot reviewed Feb 9, 2026

View reviewed changes

pablodeymo added 2 commits February 9, 2026 18:02

Extract wait_for_pending_task helper to unify JoinSet error handling …

2fb356a

…in healing state.rs used .expect().? while storage.rs used pattern matching — now both use a shared helper that eliminates the .expect() panic path.

greptile-apps bot reviewed Feb 9, 2026

View reviewed changes

pablodeymo added 3 commits February 9, 2026 18:17

Add changelog entry for snap sync insertion and healing optimization PR

d714d82

#6159

Merge branch 'main' into perf/snap-sync-insertion-optimization

668bd9f

Handle unreachable None case in wait_for_pending_task

0c10e15

join_next() cannot return None after an is_empty() check, so treat it as an invariant violation rather than silently ignoring it.

greptile-apps bot reviewed Feb 9, 2026

View reviewed changes

fedacking reviewed Feb 9, 2026

View reviewed changes

pablodeymo added the snapsync label Feb 10, 2026

ElFantasma requested changes Feb 11, 2026

View reviewed changes

github-project-automation bot moved this to In Progress in ethrex_l1 Feb 11, 2026

-                db_joinset.join_next().await;
+                match db_joinset.join_next().await {
+                    Some(Ok(_)) => {
+                        // successfully completed background DB write task
+                    }
+                    Some(Err(e)) => {
+                        error!(?e, "background DB write task failed during storage healing");
+                    }
+                    None => {
+                        // no more tasks to join
+                        break;
+                    }
+                }

Conversation

pablodeymo commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Bottleneck Summary

Changes Implemented

Phase 0: Instrumentation

Phase 1: High-Impact Fixes

Step 1.1 — Fix BackendTrieDB::put_batch per-key allocation

Step 1.2 — Override put_batch_no_alloc on BackendTrieDB

Step 1.3 — Merge two-pass iteration in insert_accounts

Phase 2: Medium-Impact Fixes

Step 2.1 — Remove block_on bridging in storage healing

Step 2.2 — Use entry API to fix healing data loss bug

Step 2.3 — Proper error propagation in healing background tasks

Phase 3: Lower-Impact Fixes

Step 3.1 — Use available_parallelism in trie_from_sorted_accounts_wrap

Files Modified

How to Test

Automated Tests (all passing)

Recommended Manual Validation

Risk Assessment

Uh oh!

github-actions bot commented Feb 9, 2026

🤖 Kimi Code Review

Review Summary

Critical Issues

1. Race Condition in State Healing (state.rs:286-287)

2. Storage Healing Race Condition (storage.rs:215-216)

Performance Issues

3. Inefficient Encoding in put_batch_no_alloc (trie.rs:134-144)

Code Quality Issues

4. Redundant Work in Storage Healing (storage.rs:220-230)

5. Magic Numbers

6. Error Handling Inconsistency

Minor Issues

7. Tracing Level

8. Code Duplication

Positive Changes

Uh oh!

github-actions bot commented Feb 9, 2026

🤖 Codex Code Review

Uh oh!

github-actions bot commented Feb 9, 2026

🤖 Claude Code Review

PR Review: Optimize snap sync insertion and healing write paths

Summary

Critical Issues

1. Storage healing: Potential data race with concurrent writes ⚠️

2. Missing error handling in storage healing

Correctness Issues

3. BTreeMap entry API: Good catch, but verify delete semantics

Performance & Code Quality

4. put_batch optimization: Excellent ✅

5. put_batch_no_alloc: Significant improvement ✅

6. Two-pass merge: Verify no behavioral change

7. Thread pool sizing: Good improvement ✅

8. Sync DB write: Clean refactor ✅

Minor Issues

9. Import organization

10. Timing instrumentation verbosity

11. Missing documentation

Testing Concerns

12. Insufficient automated test coverage

Security Considerations

13. Atomic writes are preserved ✅

14. No obvious attack vectors introduced

Summary & Recommendation

Uh oh!

greptile-apps bot commented Feb 9, 2026

Greptile Overview

Greptile Summary

Confidence Score: 2/5

Important Files Changed

Sequence Diagram

Uh oh!

github-actions bot commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Lines of code report

Uh oh!

pablodeymo commented Feb 9, 2026 •

edited

Loading

Step 1.1 — Fix `BackendTrieDB::put_batch` per-key allocation

Step 1.2 — Override `put_batch_no_alloc` on `BackendTrieDB`

Step 1.3 — Merge two-pass iteration in `insert_accounts`

Step 2.1 — Remove `block_on` bridging in storage healing

Step 2.2 — Use `entry` API to fix healing data loss bug

Step 3.1 — Use `available_parallelism` in `trie_from_sorted_accounts_wrap`

github-actions bot commented Feb 9, 2026 •

edited

Loading