perf(l1): disable WAL and improve concurrency in snap sync#6178
perf(l1): disable WAL and improve concurrency in snap sync#6178
Conversation
Add a commit_no_wal() method to StorageWriteBatch trait with a default fallback to commit(). The RocksDB implementation uses WriteOptions with disable_wal(true) to skip the write-ahead log, avoiding double-write I/O for data that can be re-downloaded (snap sync). Wire the flag through BackendTrieDB (with_no_wal() builder) and add _no_wal variants to Store: open_direct_state_trie_no_wal, open_direct_storage_trie_no_wal, write_storage_trie_nodes_batch_no_wal, write_account_code_batch_no_wal, and write_batch_async_no_wal.
Extract dispatch_state_healing_batches() that sends up to MAX_IN_FLIGHT_REQUESTS (77) concurrent trie node requests instead of one at a time. Replace try_recv() busy-polling with tokio::select! on channel recv + 1s timeout. Process all pending healed batches with while-let instead of if-let to drain the queue each iteration. Switch state trie DB writes to open_direct_state_trie_no_wal().
Replace all snap sync DB writes with their _no_wal variants: account trie insertion, storage trie insertion, storage trie node batch writes, bytecode batch writes, and storage root computation. In storage healing, replace try_recv() + yield_now() busy loop with tokio::select! on channel recv + 1s timeout to eliminate CPU waste.
🤖 Kimi Code ReviewReview SummaryThis PR introduces WAL (Write-Ahead Log) bypassing for snap sync operations and refactors state healing to use concurrent batch dispatching. The changes are generally well-structured, but there are several issues to address: Critical Issues
Security & Consensus Issues
Performance & Correctness Issues
Code Quality Issues
Suggested Fixes
// Replace the while loop with proper synchronization
while inflight_tasks > 0 || !nodes_to_heal.is_empty() {
tokio::select! {
Some((peer_id, response, batch)) = task_receiver.recv() => {
// Process response and add to nodes_to_heal
nodes_to_heal.push((nodes, batch));
}
}
// Process all available batches
while let Some((nodes, batch)) = nodes_to_heal.pop() {
// Process batch
}
}
// After snap sync completes
pub async fn verify_state_consistency(&self, expected_state_root: H256) -> Result<bool, StoreError> {
// Verify state root matches expected
}
const NO_PEERS_LOG_INTERVAL: u32 = 1000;The WAL bypass changes appear safe for snap sync use case, but should be clearly documented as only safe during initial sync. Automated review by Kimi (Moonshot AI) · custom prompt |
Lines of code reportTotal lines added: Detailed view |
🤖 Codex Code ReviewHere are the review findings I’d flag. Findings
If the no‑WAL paths are strictly snap‑sync only and you’re comfortable with possible partial writes on crash, the rest looks reasonable and I didn’t spot EVM/consensus‑critical changes. Natural next steps:
Automated review by OpenAI Codex · custom prompt |
Greptile OverviewGreptile SummaryThis PR significantly improves snap sync performance by disabling RocksDB's write-ahead log (WAL) during sync operations and introducing concurrent multi-peer healing requests. Key Changes:
Performance Impact: Safety: Confidence Score: 5/5
|
| Filename | Overview |
|---|---|
| crates/storage/backend/rocksdb.rs | Implemented commit_no_wal() to bypass write-ahead log using WriteOptions::disable_wal(true) for RocksDB |
| crates/storage/trie.rs | Added no_wal field to BackendTrieDB with with_no_wal() builder method, conditionally calls commit_no_wal() in put_batch() |
| crates/storage/store.rs | Added _no_wal variants for batch write operations and trie opening methods, routing through internal helpers with no_wal flag |
| crates/networking/p2p/sync/healing/state.rs | Replaced single-peer busy-polling with concurrent multi-peer dispatch (up to 77 requests) and tokio::select! with 1s timeout; switched to _no_wal DB writes |
Sequence Diagram
sequenceDiagram
participant Main as heal_state_trie
participant Dispatch as dispatch_state_healing_batches
participant Peers as PeerHandler
participant Worker as tokio::spawn
participant Channel as task_receiver
Main->>Dispatch: dispatch up to 77 batches
loop while inflight < MAX_IN_FLIGHT_REQUESTS
Dispatch->>Peers: get_best_peer()
Peers-->>Dispatch: (peer_id, connection)
Dispatch->>Worker: spawn(request_state_trienodes)
Worker->>Worker: fetch trie nodes from peer
Worker->>Channel: send(peer_id, response, batch)
end
Main->>Main: tokio::select! on recv + 1s timeout
Channel-->>Main: (peer_id, response, batch)
Main->>Main: process response
Main->>Main: while let Some() drain nodes_to_heal
Main->>Main: write to DB with no-WAL
🤖 Claude Code ReviewNow I have enough context to provide a comprehensive code review. Let me compile my findings: Code Review: PR #6178 - Disable WAL and Improve Concurrency in Snap SyncSummaryThis PR implements performance optimizations for snap sync by disabling RocksDB's write-ahead log (WAL) during sync operations and introducing concurrent multi-peer state healing. The changes achieve a 21% performance improvement (13:57 → 11:04) on Hoodi testnet. Critical Issues1. Unsafe Memory Management in RocksDBLocked
|
There was a problem hiding this comment.
Pull request overview
This PR optimizes snap sync throughput by allowing storage writes to bypass RocksDB WAL during sync and by increasing concurrency in state healing (plus removing busy-poll polling loops).
Changes:
- Add a
commit_no_wal()API to the storage write-batch abstraction and implement it for RocksDB (viaWriteOptions::disable_wal(true)), wiring it through trie DB and store helpers. - Switch snap sync write paths (tries, storage trie nodes, bytecodes) to use the new no-WAL variants.
- Improve healing performance by dispatching multiple concurrent state-healing requests and replacing
try_recv()/busy loops withtokio::select!+ timeouts.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| crates/storage/api/mod.rs | Extends StorageWriteBatch with commit_no_wal() (defaulting to commit()). |
| crates/storage/backend/rocksdb.rs | Implements commit_no_wal() using RocksDB write_opt with WAL disabled. |
| crates/storage/trie.rs | Adds BackendTrieDB::with_no_wal() and routes trie batch commits through commit_no_wal() when enabled. |
| crates/storage/store.rs | Adds _no_wal variants for key snap-sync write paths and async batch writes; adds no-WAL trie open helpers. |
| crates/networking/p2p/sync/snap_sync.rs | Switches snap sync writes and trie opens to no-WAL variants. |
| crates/networking/p2p/sync/healing/state.rs | Dispatches multiple healing requests concurrently and replaces busy polling with tokio::select!. |
| crates/networking/p2p/sync/healing/storage.rs | Replaces try_recv() + yield_now() polling with tokio::select! on recv + timeout; uses no-WAL writes for healed batches. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| } | ||
| Err(TryRecvError::Disconnected) => { | ||
| state.disconnected_count += 1; | ||
| _ = tokio::time::sleep(Duration::from_secs(1)) => { |
There was a problem hiding this comment.
state.empty_count is still reported via METRICS.healing_empty_try_recv, but after switching from try_recv() to tokio::select! it is never incremented anywhere, so this metric/log field will always stay at 0. Consider incrementing empty_count on the timeout branch (or removing the counter/metric if it’s no longer meaningful).
| _ = tokio::time::sleep(Duration::from_secs(1)) => { | |
| _ = tokio::time::sleep(Duration::from_secs(1)) => { | |
| state.empty_count += 1; |
| let batch: Vec<RequestMetadata> = | ||
| paths.drain(0..min(paths.len(), NODE_BATCH_SIZE)).collect(); |
There was a problem hiding this comment.
paths.drain(0..min(paths.len(), NODE_BATCH_SIZE)) repeatedly drains from the front of a Vec, which is O(n) per drain due to element shifting. With multi-dispatch enabled this can become a noticeable CPU cost. Consider switching paths to a VecDeque (pop from front) or draining from the end (e.g., split_off) if ordering isn’t important.
| let batch: Vec<RequestMetadata> = | |
| paths.drain(0..min(paths.len(), NODE_BATCH_SIZE)).collect(); | |
| let batch_size = min(paths.len(), NODE_BATCH_SIZE); | |
| let batch: Vec<RequestMetadata> = paths.split_off(paths.len() - batch_size); |
| )>, | ||
| logged_no_free_peers_count: &mut u32, | ||
| ) { | ||
| while (*inflight_tasks as u32) < MAX_IN_FLIGHT_REQUESTS && !paths.is_empty() { |
There was a problem hiding this comment.
(*inflight_tasks as u32) < MAX_IN_FLIGHT_REQUESTS — the as u32 truncates silently if inflight_tasks ever exceeds u32::MAX. Can't happen in practice (max 77), but the cast is unnecessary since MAX_IN_FLIGHT_REQUESTS is a constant: either change inflight_tasks to u32 (it's only used as a counter here), or cast the constant up: *inflight_tasks < MAX_IN_FLIGHT_REQUESTS as u64.
| *inflight_tasks += 1; | ||
| let peer_table = peers.peer_table.clone(); | ||
|
|
||
| tokio::spawn(async move { |
There was a problem hiding this comment.
Fire-and-forget tokio::spawn — if this task panics (e.g., request_state_trienodes panics), inflight_tasks is never decremented and the outer tokio::select! will poll the channel forever (retrying every 1s on timeout). Pre-existing issue, but now amplified since up to 77 tasks can be in flight.
Consider collecting the JoinHandle (or using JoinSet) and checking for panics when draining responses, similar to how db_joinset is handled elsewhere in this file.
Motivation
Snap sync performance profiling on Hoodi showed that RocksDB write-ahead log (WAL) overhead and single-peer healing dispatch were limiting throughput. Nethermind's snap sync (which achieves ~20 min on mainnet) disables WAL during sync since all data can be re-downloaded on crash, and uses multi-peer concurrent requests for healing.
Description
Three changes, each in a separate commit:
1. Storage layer:
commit_no_wal()(1848388)commit_no_wal()toStorageWriteBatchtrait with default fallback tocommit()WriteOptionswithdisable_wal(true)to skip the write-ahead logBackendTrieDBviawith_no_wal()builder and add_no_walvariants toStore2. Multi-peer state healing (7d18550)
dispatch_state_healing_batches()that sends up toMAX_IN_FLIGHT_REQUESTS(77) concurrent trie node requests instead of one at a timetry_recv()busy-polling withtokio::select!on channel recv + 1s timeoutwhile letinstead ofif letto drain the queue each iteration3. Wire no-WAL through snap sync + fix storage healing polling (5730189)
_no_walvariants (account trie, storage trie, bytecodes, storage roots)try_recv()+yield_now()busy loop in storage healing withtokio::select!on recv + 1s timeoutBenchmark results (Hoodi, ethrex-mainnet-4: 12 cores, 62 GB RAM)
The biggest win is the bytecodes phase (62% faster) where WAL-disabled writes dramatically reduce I/O. Phases 1, 2, 4 are network-bound so their deltas are peer variability. Storage healing healed more accounts (19,333 vs 16,555) due to slightly different chain state, explaining the +10s.
How to Test
Run snap sync on Hoodi (or any testnet) and compare total sync time against main:
Look for
Sync cycle finished successfully time_elapsed_s=...in the logs.