Skip to content

perf(l1): disable WAL and improve concurrency in snap sync#6178

Open
ilitteri wants to merge 3 commits intomainfrom
snap-sync-phase1-optimizations
Open

perf(l1): disable WAL and improve concurrency in snap sync#6178
ilitteri wants to merge 3 commits intomainfrom
snap-sync-phase1-optimizations

Conversation

@ilitteri
Copy link
Collaborator

Motivation

Snap sync performance profiling on Hoodi showed that RocksDB write-ahead log (WAL) overhead and single-peer healing dispatch were limiting throughput. Nethermind's snap sync (which achieves ~20 min on mainnet) disables WAL during sync since all data can be re-downloaded on crash, and uses multi-peer concurrent requests for healing.

Description

Three changes, each in a separate commit:

1. Storage layer: commit_no_wal() (1848388)

  • Add commit_no_wal() to StorageWriteBatch trait with default fallback to commit()
  • RocksDB implementation uses WriteOptions with disable_wal(true) to skip the write-ahead log
  • Wire through BackendTrieDB via with_no_wal() builder and add _no_wal variants to Store

2. Multi-peer state healing (7d18550)

  • Extract dispatch_state_healing_batches() that sends up to MAX_IN_FLIGHT_REQUESTS (77) concurrent trie node requests instead of one at a time
  • Replace try_recv() busy-polling with tokio::select! on channel recv + 1s timeout
  • Process all pending healed batches with while let instead of if let to drain the queue each iteration

3. Wire no-WAL through snap sync + fix storage healing polling (5730189)

  • Switch all snap sync DB writes to _no_wal variants (account trie, storage trie, bytecodes, storage roots)
  • Replace try_recv() + yield_now() busy loop in storage healing with tokio::select! on recv + 1s timeout

Benchmark results (Hoodi, ethrex-mainnet-4: 12 cores, 62 GB RAM)

Phase main (baseline) This PR Delta
1. Block Headers 1:30 1:20 -10s (network)
2. Account Ranges 0:20 0:20
3. Account Insertion 0:40 0:40
4. Storage Ranges 2:50 2:20 -30s (network)
5. Storage Insertion 4:10 4:10
6. State Healing 0:10 0:10
7. Storage Healing 0:40 0:50 +10s (more accounts)
8. Bytecodes 3:11 1:13 -118s
Finalize ~17s ~2s -15s
Total 837s (13:57) 664s (11:04) -173s (21%)

The biggest win is the bytecodes phase (62% faster) where WAL-disabled writes dramatically reduce I/O. Phases 1, 2, 4 are network-bound so their deltas are peer variability. Storage healing healed more accounts (19,333 vs 16,555) due to slightly different chain state, explaining the +10s.

How to Test

Run snap sync on Hoodi (or any testnet) and compare total sync time against main:

# Build and run
cargo build --release --bin ethrex
./target/release/ethrex --network hoodi --authrpc.jwtsecret <jwt> --p2p.lookup-interval 10

Look for Sync cycle finished successfully time_elapsed_s=... in the logs.

Add a commit_no_wal() method to StorageWriteBatch trait with a default
fallback to commit(). The RocksDB implementation uses WriteOptions with
disable_wal(true) to skip the write-ahead log, avoiding double-write
I/O for data that can be re-downloaded (snap sync).

Wire the flag through BackendTrieDB (with_no_wal() builder) and add
_no_wal variants to Store: open_direct_state_trie_no_wal,
open_direct_storage_trie_no_wal, write_storage_trie_nodes_batch_no_wal,
write_account_code_batch_no_wal, and write_batch_async_no_wal.
Extract dispatch_state_healing_batches() that sends up to
MAX_IN_FLIGHT_REQUESTS (77) concurrent trie node requests instead of
one at a time. Replace try_recv() busy-polling with tokio::select! on
channel recv + 1s timeout. Process all pending healed batches with
while-let instead of if-let to drain the queue each iteration.

Switch state trie DB writes to open_direct_state_trie_no_wal().
Replace all snap sync DB writes with their _no_wal variants:
account trie insertion, storage trie insertion, storage trie node
batch writes, bytecode batch writes, and storage root computation.

In storage healing, replace try_recv() + yield_now() busy loop with
tokio::select! on channel recv + 1s timeout to eliminate CPU waste.
Copilot AI review requested due to automatic review settings February 11, 2026 15:28
@ilitteri ilitteri requested a review from a team as a code owner February 11, 2026 15:28
@github-actions github-actions bot added L1 Ethereum client performance Block execution throughput and performance in general labels Feb 11, 2026
@github-actions
Copy link

🤖 Kimi Code Review

Review Summary

This PR introduces WAL (Write-Ahead Log) bypassing for snap sync operations and refactors state healing to use concurrent batch dispatching. The changes are generally well-structured, but there are several issues to address:

Critical Issues

  1. Race Condition in State Healing (state.rs:142-194)

    • The while let Some((nodes, batch)) = nodes_to_heal.pop() loop can process batches before all responses are received, potentially causing out-of-order processing
    • This could lead to incorrect state trie construction if parent nodes are processed before their children
  2. Missing Error Handling (state.rs:285-360)

    • dispatch_state_healing_batches doesn't handle the case where request_state_trienodes fails to spawn
    • No validation that dispatched tasks actually complete successfully

Security & Consensus Issues

  1. State Consistency Risk (state.rs:194-221)
    • The WAL bypass (commit_no_wal) during snap sync could lead to state corruption if the process crashes mid-sync
    • Consider adding a checkpoint mechanism or verification step after sync completion

Performance & Correctness Issues

  1. Inefficient Batch Processing (state.rs:142-194)

    • The while let Some(...) loop processes one batch at a time instead of processing all available batches
    • This reduces the benefits of concurrent dispatching
  2. Resource Leak Potential (state.rs:360)

    • JoinSet in heal_state_batch is created but never properly cleaned up if errors occur

Code Quality Issues

  1. Type Mismatch (state.rs:112)

    • logged_no_free_peers_count changed from u64 to u32 but the decrement logic (lines 156-158, 324-326) uses hardcoded values that assume u32
  2. Redundant Code (state.rs:285-360)

    • dispatch_state_healing_batches duplicates logic from the original inline code
    • Consider extracting common patterns into shared utilities

Suggested Fixes

  1. Fix race condition:
// Replace the while loop with proper synchronization
while inflight_tasks > 0 || !nodes_to_heal.is_empty() {
    tokio::select! {
        Some((peer_id, response, batch)) = task_receiver.recv() => {
            // Process response and add to nodes_to_heal
            nodes_to_heal.push((nodes, batch));
        }
    }
    
    // Process all available batches
    while let Some((nodes, batch)) = nodes_to_heal.pop() {
        // Process batch
    }
}
  1. Add WAL verification:
// After snap sync completes
pub async fn verify_state_consistency(&self, expected_state_root: H256) -> Result<bool, StoreError> {
    // Verify state root matches expected
}
  1. Fix type consistency:
const NO_PEERS_LOG_INTERVAL: u32 = 1000;

The WAL bypass changes appear safe for snap sync use case, but should be clearly documented as only safe during initial sync.


Automated review by Kimi (Moonshot AI) · custom prompt

@github-actions
Copy link

Lines of code report

Total lines added: 136
Total lines removed: 1
Total lines changed: 137

Detailed view
+------------------------------------------------------+-------+------+
| File                                                 | Lines | Diff |
+------------------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/sync/healing/state.rs   | 413   | +26  |
+------------------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/sync/healing/storage.rs | 614   | -1   |
+------------------------------------------------------+-------+------+
| ethrex/crates/storage/api/mod.rs                     | 40    | +3   |
+------------------------------------------------------+-------+------+
| ethrex/crates/storage/backend/rocksdb.rs             | 323   | +8   |
+------------------------------------------------------+-------+------+
| ethrex/crates/storage/store.rs                       | 2514  | +86  |
+------------------------------------------------------+-------+------+
| ethrex/crates/storage/trie.rs                        | 161   | +13  |
+------------------------------------------------------+-------+------+

@github-actions
Copy link

🤖 Codex Code Review

Here are the review findings I’d flag.

Findings

  • Durability risk if heal_state_trie can run outside snap sync. open_direct_state_trie_no_wal is now used for state healing writes, so a crash can drop recently written trie nodes with no WAL replay, and the in‑memory queue won’t be persisted. This is fine only if this code path is strictly snap‑sync (re-downloadable). Please confirm the invariant or guard it explicitly. crates/networking/p2p/sync/healing/state.rs:196-203, crates/storage/store.rs:2521-2529.
  • Potential code/metadata inconsistency on crash with WAL disabled. write_account_code_batch_no_wal writes ACCOUNT_CODES and ACCOUNT_CODE_METADATA in two separate batches with WAL disabled. A crash between them can leave metadata without code (or vice‑versa), which could surface as invalid code length or “missing code” during reads. If recovery relies on re-download, it’s probably OK, but consider a single batch across CFs or add validation/repair on startup. crates/storage/store.rs:1199-1223.
  • Metrics regression (may hide liveliness issues). heal_storage_trie no longer increments state.empty_count on empty receives; the select timeout just continues. If this metric is used for monitoring or backoff heuristics, it will now stay near zero. Not correctness, but worth verifying. crates/networking/p2p/sync/healing/storage.rs:257-270.

If the no‑WAL paths are strictly snap‑sync only and you’re comfortable with possible partial writes on crash, the rest looks reasonable and I didn’t spot EVM/consensus‑critical changes.

Natural next steps:

  1. Confirm whether healing paths ever run outside snap sync; if yes, add a guard or keep WAL.
  2. Decide whether code/metadata should be atomically written (single batch) or tolerate mismatches with a repair pass.

Automated review by OpenAI Codex · custom prompt

@greptile-apps
Copy link

greptile-apps bot commented Feb 11, 2026

Greptile Overview

Greptile Summary

This PR significantly improves snap sync performance by disabling RocksDB's write-ahead log (WAL) during sync operations and introducing concurrent multi-peer healing requests.

Key Changes:

  • Storage layer: Added commit_no_wal() method to StorageWriteBatch trait with RocksDB implementation using WriteOptions::disable_wal(true). Default trait implementation falls back to commit() for non-RocksDB backends
  • Multi-peer concurrency: Refactored state healing from single-peer sequential requests to concurrent dispatch of up to 77 requests (MAX_IN_FLIGHT_REQUESTS) across multiple peers
  • Async improvements: Replaced busy-polling (try_recv() + yield_now()) with proper tokio::select! on channel recv + 1s timeout in both state and storage healing
  • Snap sync integration: All DB write paths during snap sync now use _no_wal variants (account trie, storage trie, bytecodes, storage roots)

Performance Impact:
Benchmark on Hoodi shows 21% total speedup (13:57 → 11:04), with bytecodes phase improving 62% (3:11 → 1:13) due to reduced I/O overhead from WAL bypass. Network-bound phases show peer variability.

Safety:
WAL bypass is safe during snap sync since all data can be re-downloaded on crash. The change is isolated to sync operations and doesn't affect normal block execution paths.

Confidence Score: 5/5

  • This PR is safe to merge with minimal risk
  • The implementation is well-structured with clear separation of concerns. The WAL bypass is correctly isolated to snap sync paths where data loss on crash is acceptable (re-downloadable). The multi-peer concurrency changes improve performance without introducing race conditions. All changes maintain backward compatibility through trait default methods and dedicated _no_wal variants. The approach follows Nethermind's proven pattern for fast snap sync.
  • No files require special attention

Important Files Changed

Filename Overview
crates/storage/backend/rocksdb.rs Implemented commit_no_wal() to bypass write-ahead log using WriteOptions::disable_wal(true) for RocksDB
crates/storage/trie.rs Added no_wal field to BackendTrieDB with with_no_wal() builder method, conditionally calls commit_no_wal() in put_batch()
crates/storage/store.rs Added _no_wal variants for batch write operations and trie opening methods, routing through internal helpers with no_wal flag
crates/networking/p2p/sync/healing/state.rs Replaced single-peer busy-polling with concurrent multi-peer dispatch (up to 77 requests) and tokio::select! with 1s timeout; switched to _no_wal DB writes

Sequence Diagram

sequenceDiagram
    participant Main as heal_state_trie
    participant Dispatch as dispatch_state_healing_batches
    participant Peers as PeerHandler
    participant Worker as tokio::spawn
    participant Channel as task_receiver

    Main->>Dispatch: dispatch up to 77 batches
    loop while inflight < MAX_IN_FLIGHT_REQUESTS
        Dispatch->>Peers: get_best_peer()
        Peers-->>Dispatch: (peer_id, connection)
        Dispatch->>Worker: spawn(request_state_trienodes)
        Worker->>Worker: fetch trie nodes from peer
        Worker->>Channel: send(peer_id, response, batch)
    end
    
    Main->>Main: tokio::select! on recv + 1s timeout
    Channel-->>Main: (peer_id, response, batch)
    Main->>Main: process response
    Main->>Main: while let Some() drain nodes_to_heal
    Main->>Main: write to DB with no-WAL
Loading

@github-actions
Copy link

🤖 Claude Code Review

Now I have enough context to provide a comprehensive code review. Let me compile my findings:

Code Review: PR #6178 - Disable WAL and Improve Concurrency in Snap Sync

Summary

This PR implements performance optimizations for snap sync by disabling RocksDB's write-ahead log (WAL) during sync operations and introducing concurrent multi-peer state healing. The changes achieve a 21% performance improvement (13:57 → 11:04) on Hoodi testnet.

Critical Issues

1. Unsafe Memory Management in RocksDBLocked ⚠️

File: crates/storage/backend/rocksdb.rs:394-403

The RocksDBLocked uses unsafe memory operations with Box::leak and Box::from_raw:

fn begin_locked(&self, table_name: &'static str) -> Result<Box<dyn StorageLockedView>, StoreError> {
    let db = Box::leak(Box::new(self.db.clone()));  // Line 254
    // ...
}

impl Drop for RocksDBLocked {
    fn drop(&mut self) {
        unsafe {
            drop(Box::from_raw(
                self.db as *const Arc<DBWithThreadMode<MultiThreaded>>
                    as *mut Arc<DBWithThreadMode<MultiThreaded>>,
            ));
        }
    }
}

Issue: This pattern is error-prone and relies on manual lifetime management. While this code exists before this PR, the increased usage during snap sync (with _no_wal variants) amplifies the risk.

Recommendation: Consider using Arc directly or ensuring the leak/drop symmetry is well-documented and tested.


2. Panic on Missing Parent in State Healing 🔴

File: crates/networking/p2p/sync/healing/state.rs:406-408

let mut healing_queue_entry = healing_queue.remove(parent_path).unwrap_or_else(|| {
    panic!("The parent should exist. Parent: {parent_path:?}, path: {path:?}")
});

Issue: The code panics if a parent node is missing from the healing queue. In a distributed P2P environment with potentially malicious or buggy peers, this could crash the sync process.

Recommendation: Convert to proper error handling:

let mut healing_queue_entry = healing_queue.remove(parent_path)
    .ok_or_else(|| SyncError::Custom(format!("Parent missing: {parent_path:?}, path: {path:?}")))?;

3. Channel Send Error Ignored ⚠️

File: crates/networking/p2p/sync/healing/state.rs:348-353

let _ = tx.send((peer_id, response, batch)).await.inspect_err(
    |err| debug!(error=?err, "Failed to send state trie nodes response"),
);

Issue: Channel send errors are logged but ignored. If the receiver is dropped or the channel is full, tasks silently fail without updating the inflight counter, leading to potential deadlocks.

Recommendation: Propagate errors or implement retry logic. Consider bounded channels with back-pressure.


Performance & Correctness

4. Missing Error Propagation in Storage Healing ⚠️

File: crates/networking/p2p/sync/healing/storage.rs:228-229

spawned_rt::tasks::block_on(store.write_storage_trie_nodes_batch_no_wal(encoded_to_write))
    .expect("db write failed");

Issue: Using expect() in async code will panic the entire task on DB errors during snap sync. Recoverable errors (disk full, temporary I/O issues) become fatal.

Recommendation: Propagate errors properly:

spawned_rt::tasks::block_on(store.write_storage_trie_nodes_batch_no_wal(encoded_to_write))?;

5. Concurrent Multi-Peer Dispatch - Good Design

File: crates/networking/p2p/sync/healing/state.rs:288-356

The new dispatch_state_healing_batches() function properly:

  • Limits concurrent requests to MAX_IN_FLIGHT_REQUESTS (77)
  • Uses tokio::select! instead of busy-polling
  • Handles peer unavailability gracefully
  • Uses exponential back-off for logging

Minor suggestion: Consider making MAX_IN_FLIGHT_REQUESTS configurable for different network conditions.


6. WAL Bypass Implementation - Correct

Files: crates/storage/api/mod.rs, crates/storage/backend/rocksdb.rs, crates/storage/trie.rs

The WAL bypass implementation is sound:

  • Default fallback to commit() ensures compatibility
  • RocksDB WriteOptions with disable_wal(true) is correct
  • Builder pattern (with_no_wal()) is clean and explicit
  • All snap sync write paths properly use _no_wal variants

Concern: Ensure crash recovery documentation clearly states that snap sync must restart if crashed during WAL-disabled writes.


Code Quality

7. Batch Processing Logic

File: crates/networking/p2p/sync/healing/state.rs:159-173

The change from if let to while let for draining nodes_to_heal is correct and improves throughput:

// Process all pending healed node batches
while let Some((nodes, batch)) = nodes_to_heal.pop() {
    heals_per_cycle += 1;
    // ...
}

This ensures all completed batches are processed each iteration instead of one at a time.


8. Type Safety Improvement

File: crates/networking/p2p/sync/healing/state.rs:112

let mut logged_no_free_peers_count: u32 = 0;  // Changed from implicit to explicit u32

Good change for clarity and overflow protection.


Security Considerations

9. Node Hash Validation

File: crates/networking/p2p/sync/healing/storage.rs:469-476

The code properly validates node hashes from peers:

if node.compute_hash().finalize() != node_request.hash {
    trace!("Node Hash failed");
    Err(RLPDecodeError::MalformedData)
} else {
    Ok(NodeResponse { node_request: node_request.clone(), node })
}

This prevents malicious peers from injecting invalid trie nodes.


10. Potential DoS via Empty Responses ⚠️

File: crates/networking/p2p/sync/healing/storage.rs:438-448

if nodes_size == 0 {
    *failed_downloads += 1;
    peer_handler.peer_table.record_failure(&request.peer_id).await?;
    download_queue.extend(request.requests);
    return Ok(None);
}

Issue: A malicious peer could repeatedly return empty responses, causing the same work to be re-queued indefinitely while only incrementing a failure counter.

Recommendation: Implement progressive penalties (e.g., temporary peer bans after N consecutive empty responses).


Minor Issues

11. Metric Removed Without Replacement

File: crates/networking/p2p/sync/healing/state.rs:125-127 (removed)

The healing_empty_try_recv metric was removed when switching from try_recv() to tokio::select!. Consider adding a replacement metric for timeout events to monitor peer responsiveness.


12. Magic Number for Peer Logging

File: crates/networking/p2p/sync/healing/state.rs:329

*logged_no_free_peers_count = 1000;

This magic number (log once per ~1000 iterations) should be a named constant like NO_PEERS_LOG_THROTTLE.


Positive Observations

  1. Consistent naming: All _no_wal methods follow a clear naming convention
  2. Non-breaking: Default behavior preserved via trait default method
  3. Proper async boundaries: spawn_blocking used correctly for RocksDB operations
  4. Memory batching: The 100k node threshold before DB writes is reasonable
  5. Benchmark data: Comprehensive performance measurements provided in PR description

Recommendations

High Priority:

  1. Fix panic on missing parent (state.rs:406)
  2. Handle channel send errors properly (state.rs:348)
  3. Propagate DB write errors instead of panicking (storage.rs:228)

Medium Priority:
4. Add crash recovery documentation for WAL-disabled mode
5. Implement progressive peer penalties for empty responses
6. Consider making MAX_IN_FLIGHT_REQUESTS configurable

Low Priority:
7. Replace magic number 1000 with named constant
8. Add replacement metric for timeout events
9. Document unsafe memory usage in RocksDBLocked


Verdict

Conditional Approval - The performance improvements are significant and the core WAL bypass implementation is correct. However, the panic on missing parent and ignored channel errors need to be addressed before merging to production. The changes are well-structured and follow Rust best practices overall.


Automated review by Claude (Anthropic) · custom prompt

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes snap sync throughput by allowing storage writes to bypass RocksDB WAL during sync and by increasing concurrency in state healing (plus removing busy-poll polling loops).

Changes:

  • Add a commit_no_wal() API to the storage write-batch abstraction and implement it for RocksDB (via WriteOptions::disable_wal(true)), wiring it through trie DB and store helpers.
  • Switch snap sync write paths (tries, storage trie nodes, bytecodes) to use the new no-WAL variants.
  • Improve healing performance by dispatching multiple concurrent state-healing requests and replacing try_recv()/busy loops with tokio::select! + timeouts.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
crates/storage/api/mod.rs Extends StorageWriteBatch with commit_no_wal() (defaulting to commit()).
crates/storage/backend/rocksdb.rs Implements commit_no_wal() using RocksDB write_opt with WAL disabled.
crates/storage/trie.rs Adds BackendTrieDB::with_no_wal() and routes trie batch commits through commit_no_wal() when enabled.
crates/storage/store.rs Adds _no_wal variants for key snap-sync write paths and async batch writes; adds no-WAL trie open helpers.
crates/networking/p2p/sync/snap_sync.rs Switches snap sync writes and trie opens to no-WAL variants.
crates/networking/p2p/sync/healing/state.rs Dispatches multiple healing requests concurrently and replaces busy polling with tokio::select!.
crates/networking/p2p/sync/healing/storage.rs Replaces try_recv() + yield_now() polling with tokio::select! on recv + timeout; uses no-WAL writes for healed batches.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

}
Err(TryRecvError::Disconnected) => {
state.disconnected_count += 1;
_ = tokio::time::sleep(Duration::from_secs(1)) => {
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

state.empty_count is still reported via METRICS.healing_empty_try_recv, but after switching from try_recv() to tokio::select! it is never incremented anywhere, so this metric/log field will always stay at 0. Consider incrementing empty_count on the timeout branch (or removing the counter/metric if it’s no longer meaningful).

Suggested change
_ = tokio::time::sleep(Duration::from_secs(1)) => {
_ = tokio::time::sleep(Duration::from_secs(1)) => {
state.empty_count += 1;

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed

Comment on lines +304 to +305
let batch: Vec<RequestMetadata> =
paths.drain(0..min(paths.len(), NODE_BATCH_SIZE)).collect();
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

paths.drain(0..min(paths.len(), NODE_BATCH_SIZE)) repeatedly drains from the front of a Vec, which is O(n) per drain due to element shifting. With multi-dispatch enabled this can become a noticeable CPU cost. Consider switching paths to a VecDeque (pop from front) or draining from the end (e.g., split_off) if ordering isn’t important.

Suggested change
let batch: Vec<RequestMetadata> =
paths.drain(0..min(paths.len(), NODE_BATCH_SIZE)).collect();
let batch_size = min(paths.len(), NODE_BATCH_SIZE);
let batch: Vec<RequestMetadata> = paths.split_off(paths.len() - batch_size);

Copilot uses AI. Check for mistakes.
)>,
logged_no_free_peers_count: &mut u32,
) {
while (*inflight_tasks as u32) < MAX_IN_FLIGHT_REQUESTS && !paths.is_empty() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(*inflight_tasks as u32) < MAX_IN_FLIGHT_REQUESTS — the as u32 truncates silently if inflight_tasks ever exceeds u32::MAX. Can't happen in practice (max 77), but the cast is unnecessary since MAX_IN_FLIGHT_REQUESTS is a constant: either change inflight_tasks to u32 (it's only used as a counter here), or cast the constant up: *inflight_tasks < MAX_IN_FLIGHT_REQUESTS as u64.

*inflight_tasks += 1;
let peer_table = peers.peer_table.clone();

tokio::spawn(async move {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fire-and-forget tokio::spawn — if this task panics (e.g., request_state_trienodes panics), inflight_tasks is never decremented and the outer tokio::select! will poll the channel forever (retrying every 1s on timeout). Pre-existing issue, but now amplified since up to 77 tasks can be in flight.

Consider collecting the JoinHandle (or using JoinSet) and checking for panics when draining responses, similar to how db_joinset is handled elsewhere in this file.

@github-project-automation github-project-automation bot moved this to In Progress in ethrex_l1 Feb 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

L1 Ethereum client performance Block execution throughput and performance in general

Projects

Status: In Progress
Status: Todo

Development

Successfully merging this pull request may close these issues.

4 participants