perf(l1): adaptive request sizing, storage bisection, and parallel trie in snap sync#6181
perf(l1): adaptive request sizing, storage bisection, and parallel trie in snap sync#6181ilitteri wants to merge 2 commits intosnap-sync-phase1-optimizationsfrom
Conversation
Adaptive request sizing: add per-peer response_bytes tuning based on response latency. Start at 128KB, scale up 1.5x when latency < 2s, scale down 1.5x when > 10s, clamped to [50KB, 2MB]. Applied to all snap protocol request types (account ranges, storage ranges, trie nodes, bytecodes). Includes RequestSizerMap in PeerHandler with 7 unit tests. Storage range bisection: when a big-account sub-range returns incomplete, bisect the remaining range into two sub-tasks dispatched to different peers, doubling download bandwidth for large contracts. Falls back to single continuation for ranges smaller than 2.
Partition the sorted account iterator into 16 buckets by first nibble of H256 key and build each subtrie in parallel using std::thread::scope. Merge the 16 subtrie root hashes into a single root BranchNode. Refactor trie_from_sorted_accounts into inner + finalize layers so the core loop can be reused for subtrie construction. Falls back to sequential build when only 0-1 nibble buckets are non-empty. Includes 5 new tests for parallel construction. Call site in snap_sync.rs updated to use trie_from_sorted_accounts_parallel.
🤖 Kimi Code ReviewReview SummaryThis PR introduces parallel trie building and adaptive request sizing for snap sync. Overall, the changes are well-structured and improve performance, but there are several issues to address: Critical Issues1. Race Condition in Parallel Trie Building (trie_sorted.rs:320-330)The parallel trie building uses a shared Fix: Add explicit synchronization or use thread-local DB instances. 2. Incorrect Subtrie Root Handling (trie_sorted.rs:408-415)When merging subtrie results, the code incorrectly copies // Current (incorrect):
root_choices[nibble] = branch.choices[nibble].clone();
// Should be:
root_choices[nibble] = branch.compute_hash(...); // or similar3. Potential Overflow in Storage Range Bisection (client.rs:700-705)The bisection logic uses Fix: Add overflow checks or use saturating arithmetic. Security Issues4. Unbounded Channel Growth (trie_sorted.rs:349-351)The parallel trie builder creates bounded channels but doesn't handle the case where all buffers are exhausted, potentially causing deadlocks. 5. Request Sizer DoS Vector (request_sizer.rs:60-65)The adaptive request sizer can be manipulated by malicious peers sending artificially delayed responses to reduce request sizes for legitimate peers. Mitigation: Add peer reputation scoring or request size floors. Performance Issues6. Inefficient Partitioning (trie_sorted.rs:320-330)The current partitioning only uses the first nibble (4 bits), which may create very uneven distributions. Consider using more bits or dynamic partitioning. 7. Excessive Thread Spawning (trie_sorted.rs:349-375)Spawning 16 threads regardless of actual data distribution is wasteful. Only spawn threads for non-empty buckets. Code Quality Issues8. Type Complexity (trie_sorted.rs:163)The 9. Missing Error Handling (trie_sorted.rs:408)Thread join errors are converted to 10. Inconsistent Naming (trie_sorted.rs:309 vs 349)Function naming is inconsistent - Minor Issues
Recommendations
The parallel trie building is a significant performance improvement, but the synchronization issues must be addressed before merging. Automated review by Kimi (Moonshot AI) · custom prompt |
🤖 Claude Code ReviewNow I have a comprehensive view of the PR. Let me compile a detailed code review. Code Review: PR #6181 - Adaptive Request Sizing, Storage Bisection, and Parallel Trie ConstructionSummaryThis PR implements three performance optimizations for snap sync: adaptive request sizing, storage range bisection, and parallel account trie construction. The implementation is well-structured with good test coverage. Below are specific findings organized by severity. Critical Issues1. Incorrect Subtrie Root Merging Logic (trie_sorted.rs:420)root_choices[nibble] = branch.choices[nibble].clone();Issue: This is extracting the wrong choice from the subtrie. Each subtrie's Impact: This will produce incorrect state root hashes, breaking consensus. The parallel implementation will not match the sequential one. Fix: The subtrie root should be computed and stored differently. The Suggested approach:
High Priority Issues2. Race Condition in Parallel Trie Construction (trie_sorted.rs:376)let pool = Arc::new(ThreadPool::new(16, s));Issue: All 16 subtrie threads share a single ThreadPool with 16 workers for DB writes. This could cause contention when multiple subtries try to flush nodes simultaneously. Impact: Performance degradation, potential deadlock if all workers are occupied waiting for DB I/O. Suggestion: Consider per-subtrie ThreadPools or a larger shared pool (e.g., 32-64 workers) to avoid blocking. 3. Missing Synchronization for Request Timing (snap/client.rs:468, 610, etc.)request_sizer.record_response(&peer_id, request_start.elapsed());Issue: Impact: Continued allocation of large requests to unresponsive peers, wasting bandwidth and time. Fix: Record failures/timeouts with a penalty latency (e.g., 4. Storage Bisection Interval Tracking Bug (snap/client.rs:728-735)for (old_start, end) in old_intervals.iter_mut() {
if *end == hash_end {
*old_start = hash_start;
*end = midpoint;
break;
}
}
old_intervals.push((midpoint_next, hash_end));Issue: This modifies the first interval to end at Impact: Could lead to incorrect tracking of downloaded ranges, potential data corruption or incomplete downloads. Verify: Ensure 5. Panic in Error Path (snap/client.rs:780)if acc_hash.is_zero() {
panic!("Should have found the account hash");
}Issue: Using Impact: Node crash during sync instead of graceful error handling. Fix: Return a proper error: Medium Priority Issues6. Lock Poisoning Error Messages (snap/request_sizer.rs:75, 85).expect("RequestSizerMap lock poisoned")Issue: Lock poisoning from a panic in one thread will crash all other threads trying to access the map. Impact: Cascading failures during sync. Suggestion: Use 7. Integer Overflow in Bisection (snap/client.rs:707-710)let midpoint_u256 = start_u256 + range_size / 2;
let midpoint = H256::from_uint(&midpoint_u256);
let midpoint_next_u256 = midpoint_u256 + 1;Issue: If Impact: Incorrect range calculation leading to duplicate or missing storage slots. Suggestion: Add overflow check or use 8. Hardcoded Thread Count (trie_sorted.rs:376)let pool = Arc::new(ThreadPool::new(16, s));Issue: Uses 16 threads for the pool regardless of available CPU cores. On machines with <16 cores, this causes oversubscription; on machines with >16 cores, it underutilizes resources. Suggestion: Use 9. Missing Error Handling in Parallel Trie (trie_sorted.rs:408).expect("Failed to spawn subtrie thread");Issue: Thread spawn failures panic instead of returning an error. While rare, this could occur under resource exhaustion. Suggestion: Return Low Priority / Style Issues10. Unnecessary Clone (snap/client.rs:335)buffer_sender.clone(),Minor: The original 11. Test Coverage GapMissing test: No test for the case where parallel trie construction produces a different result than sequential (i.e., the bug in issue #1). The existing tests compare parallel vs sequential, which is good, but they would all fail together if both are wrong. Suggestion: Add a test with known state root from a real Ethereum block to validate correctness absolutely, not just consistency between implementations. 12. Magic Number Documentation (request_sizer.rs:18-21)const LATENCY_LOW: Duration = Duration::from_secs(2);
const LATENCY_HIGH: Duration = Duration::from_secs(10);Suggestion: Add comments explaining why these specific thresholds were chosen (e.g., based on Nethermind's implementation, empirical testing, or network characteristics). Positive Observations✅ Excellent test coverage for Performance Considerations13. Memory Usage in Parallel Trie ConstructionThe parallel implementation loads all accounts into memory (line 359-361) before processing, which doubles memory usage during account insertion phase. For chains with many accounts, this could be significant. Tradeoff: This is acceptable for the 33% speedup, but worth monitoring in production. 14. Bisection OverheadThe bisection logic (lines 702-757) is complex and runs on the hot path. Ensure this doesn't add more overhead than it saves, especially for accounts with small storage. Current mitigation: The Security Considerations15. No DoS Protection in Adaptive SizingA malicious peer could intentionally respond quickly to inflate its budget to 2MB, then become slow, wasting client bandwidth. Mitigation: The scale-down mechanism (lines 49-52) will eventually throttle bad peers, but it may take several rounds. Consider a faster backoff or reputation system. Verdict: Acceptable for current threat model, but document for future improvement. Recommendations
The optimization approach is sound and well-implemented overall, but the critical subtrie merging bug must be addressed before merge. Automated review by Claude (Anthropic) · custom prompt |
Greptile OverviewGreptile SummaryImplemented three performance optimizations for snap sync: adaptive request sizing (1.5x scaling based on peer latency), storage range bisection for parallel big-account downloads, and 16-way parallel account trie construction. Key Changes:
Critical Issue:
Testing:
Confidence Score: 3/5
|
| Filename | Overview |
|---|---|
| crates/networking/p2p/snap/request_sizer.rs | New file implementing adaptive request sizing with per-peer latency tracking. Implementation is clean with comprehensive unit tests. |
| crates/common/trie/trie_sorted.rs | Added parallel 16-way trie construction with refactored core logic. Critical bug found in line 420 where wrong branch choice is selected for root merge. |
| crates/networking/p2p/snap/client.rs | Integrated adaptive request sizing and storage range bisection. Changes are well-structured with proper latency tracking. |
Sequence Diagram
sequenceDiagram
participant Client as Snap Client
participant Sizer as RequestSizerMap
participant Peer1 as Fast Peer
participant Peer2 as Slow Peer
participant DB as Trie DB
Note over Client,DB: Adaptive Request Sizing Flow
Client->>Sizer: response_bytes_for_peer(peer1)
Sizer-->>Client: 128KB (initial)
Client->>Peer1: GetAccountRange(128KB)
Note over Peer1: Response in 1s
Peer1-->>Client: AccountRange data
Client->>Sizer: record_response(peer1, 1s)
Note over Sizer: Latency < 2s → scale up 1.5x
Client->>Sizer: response_bytes_for_peer(peer1)
Sizer-->>Client: 192KB (scaled up)
Client->>Sizer: response_bytes_for_peer(peer2)
Sizer-->>Client: 128KB (initial)
Client->>Peer2: GetStorageRanges(128KB)
Note over Peer2: Response in 12s
Peer2-->>Client: StorageRanges data
Client->>Sizer: record_response(peer2, 12s)
Note over Sizer: Latency > 10s → scale down 1.5x
Note over Client,DB: Storage Range Bisection
Client->>Peer1: GetStorageRanges[hash_start..hash_end]
Peer1-->>Client: Incomplete response (partial)
Note over Client: Bisect range at midpoint
Client->>Peer1: GetStorageRanges[hash_start..midpoint]
Client->>Peer2: GetStorageRanges[midpoint+1..hash_end]
Note over Client: Parallel download from 2 peers
Note over Client,DB: Parallel Trie Construction
par 16-way parallel build
Client->>DB: Subtrie 0 (nibble 0x0)
Client->>DB: Subtrie 1 (nibble 0x1)
Client->>DB: Subtrie F (nibble 0xF)
end
Note over Client,DB: Merge 16 subtrie roots
Client->>DB: Write root BranchNode
🤖 Codex Code ReviewHere are the findings from the PR diff. Findings
If you want, I can dig further into the trie build logic and verify the branch merge against the MPT invariants, or prototype a memory‑safe parallelization approach. Potential next steps:
Automated review by OpenAI Codex · custom prompt |
| .join() | ||
| .map_err(|_| TrieGenerationError::ThreadJoinError())??; | ||
| if let Some(branch) = branch_opt { | ||
| root_choices[nibble] = branch.choices[nibble].clone(); |
There was a problem hiding this comment.
incorrect branch choice selection
Each subtrie builds nodes for keys starting with nibble N, so the root choice at index N should reference the subtrie's root. However, this line uses branch.choices[nibble] which reads the Nth choice from the subtrie's root branch, not the subtrie's own root hash.
| root_choices[nibble] = branch.choices[nibble].clone(); | |
| root_choices[nibble] = branch.compute_hash_no_alloc(&mut Vec::new()).into(); |
This will cause incorrect trie root hashes when accounts span multiple nibbles.
Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/common/trie/trie_sorted.rs
Line: 420:420
Comment:
incorrect branch choice selection
Each subtrie builds nodes for keys starting with nibble N, so the root choice at index N should reference the subtrie's root. However, this line uses `branch.choices[nibble]` which reads the Nth choice from the subtrie's root branch, not the subtrie's own root hash.
```suggestion
root_choices[nibble] = branch.compute_hash_no_alloc(&mut Vec::new()).into();
```
This will cause incorrect trie root hashes when accounts span multiple nibbles.
How can I resolve this? If you propose a fix, please make it concise.
Lines of code reportTotal lines added: Detailed view |
ElFantasma
left a comment
There was a problem hiding this comment.
4 comments, but mostly observations, take a look to be sure you are ok with those or you prefer some fix. I'm approving this in any case since it LGTM
| .join() | ||
| .map_err(|_| TrieGenerationError::ThreadJoinError())??; | ||
| if let Some(branch) = branch_opt { | ||
| root_choices[nibble] = branch.choices[nibble].clone(); |
There was a problem hiding this comment.
This line relies on the invariant that trie_from_sorted_accounts_inner, when fed keys all sharing the same first nibble N, returns a BranchNode where only choices[N] is populated. That's true today because _inner processes full H256 paths (not stripped keys), but it's subtle and fragile — a future caller passing pre-stripped keys would silently produce a wrong trie.
Consider adding a debug assertion:
if let Some(branch) = branch_opt {
debug_assert!(
branch.choices.iter().enumerate()
.filter(|(_, c)| c.is_valid())
.all(|(i, _)| i == nibble),
"subtrie for nibble {nibble} has unexpected populated choices"
);
root_choices[nibble] = branch.choices[nibble].clone();
}|
|
||
| // Build 16 subtries in parallel using scoped threads + shared ThreadPool for DB writes | ||
| scope(|s| { | ||
| let pool = Arc::new(ThreadPool::new(16, s)); |
There was a problem hiding this comment.
This creates a ThreadPool with 16 threads for DB writes, plus 16 scoped threads (one per nibble) are spawned at line 388. That's up to 32 threads competing for CPU. On the benchmark machine (12 cores) this is fine, but on smaller machines the over-subscription could hurt.
Consider sizing the pool based on available parallelism, e.g.:
let num_threads = std::thread::available_parallelism()
.map(|n| n.get())
.unwrap_or(4);
let pool = Arc::new(ThreadPool::new(num_threads, s));| /// Cheaply cloneable (Arc-backed). Pass clones to spawned tasks. | ||
| #[derive(Debug, Clone, Default)] | ||
| pub struct RequestSizerMap { | ||
| inner: Arc<Mutex<HashMap<H256, PeerSizer>>>, |
There was a problem hiding this comment.
Entries for disconnected peers are never removed from this HashMap, so it grows monotonically over the lifetime of the sync. Each entry is small (~16 bytes), but during a long mainnet sync the node may see hundreds of unique peers.
Not critical, but worth noting — could add a simple cap (e.g., evict oldest entries when size exceeds 256) or tie cleanup to peer disconnection events.
| pub const MIN_RESPONSE_BYTES_ADAPTIVE: u64 = 50 * 1024; | ||
|
|
||
| /// Maximum response bytes for adaptive sizing (2 MB). | ||
| pub const MAX_RESPONSE_BYTES_ADAPTIVE: u64 = 2 * 1024 * 1024; |
There was a problem hiding this comment.
nit: MAX_RESPONSE_BYTES_ADAPTIVE (2 MB) exceeds the server-side cap MAX_RESPONSE_BYTES (512 KB). Against ethrex peers, requests above 512 KB will never return more data — the peer truncates at its own limit. This means the sizer will plateau at 512 KB effective throughput regardless of how high it scales.
Not a bug (peers just return less), but might be worth documenting the relationship, or capping at MAX_RESPONSE_BYTES unless explicitly targeting non-ethrex peers.
Motivation
Follow-up to #6178 (Tier 1: WAL disable, multi-peer healing, async select). Profiling Tier 1 results on Hoodi showed remaining bottlenecks in fixed-size network requests, sequential big-account storage downloads, and single-threaded account trie construction. Nethermind uses adaptive request sizing, range bisection, and parallel trie BulkSet to address these.
Description
Two commits implementing three Tier 2 optimizations:
1. Adaptive request sizing (5efae33)
RequestSizerMap— per-peer response_bytes tuning based on response latency2. Storage range bisection (5efae33)
3. Parallel 16-way account trie construction (fb91a34)
std::thread::scope(16 scoped threads)trie_from_sorted_accountsinto_inner+finalize_rootlayers for reuseBenchmark results (Hoodi, ethrex-mainnet-4: 12 cores, 62 GB RAM)
Phases 1, 2 are network-bound (peer variability at startup). CPU/IO phases show consistent improvements:
How to Test
Run snap sync on Hoodi and compare total sync time against main:
Look for
Sync cycle finished successfully time_elapsed_s=...in the logs. Runcargo test -p ethrex-p2p -- request_sizerandcargo test -p ethrex-trie -- trie_sortedfor unit tests.