Skip to content

perf(l1): block pipeline optimizations#6163

Draft
ilitteri wants to merge 6 commits intomainfrom
block-pipeline-optimizations
Draft

perf(l1): block pipeline optimizations#6163
ilitteri wants to merge 6 commits intomainfrom
block-pipeline-optimizations

Conversation

@ilitteri
Copy link
Collaborator

@ilitteri ilitteri commented Feb 9, 2026

Motivation

After profiling ethrex at steady-state on Hoodi (75% total CPU reduction across 3 prior rounds), the remaining bottlenecks include the TrieLayerCache deep clone in the RCU pattern (~875KB bloom filter copied per block), redundant trie node reads from RocksDB by both warmer and merkleizer threads, and receipt encoding done synchronously during validation.

Description

Three optimizations targeting the block pipeline:

  • Arc-wrap bloom filter in TrieLayerCache — eliminates ~875KB AtomicBloomFilter deep copy that happens every block in apply_trie_updates. Since AtomicBloomFilter::insert(&self) is already atomic, sharing via Arc is safe. The layers HashMap clone remains but is cheap (values are Arc<TrieLayer>).
  • Shared trie node read cache — 16-shard LRU cache (TrieNodeCache) at the BackendTrieDB level, shared across warmer, merkleizer, and execution threads via Arc. When the warmer reads a trie node ahead of execution, the merkleizer gets a cache hit instead of another RocksDB read transaction. Shard count matches the merkleizer's 16 worker threads.
  • Pre-encode receipts during execution — receipt bloom+RLP encoding (encode_inner_with_bloom) is done once per tx during execution and passed through BlockExecutionResult. Validation uses the pre-encoded bytes directly via validate_receipts_root_from_encoded, avoiding redundant re-encoding.

EXPB Benchmark Results

fast scenario

Metric main block-pipeline-optimizations Δ
mgas/s avg 660.9 656.5 -0.7% (noise)
mgas/s max 1659 1742 +5.0%
latency avg 47.68 ms 45.30 ms -5.0%
latency median 27.94 ms 25.73 ms -7.9%
latency p95 154.70 ms 147.27 ms -4.8%
latency p99 194.19 ms 273.75 ms +41.0% (outlier)

gigablocks scenario

Metric main (ID 233) block-pipeline-optimizations Δ
mgas/s avg 1546.9 1255.4 -18.8%

Note: The gigablocks main baseline (ID 233) was run on a different date (older main commit) and may not be directly comparable. The individual optimization branches are still running and will provide more granular data.

Test plan

  • All crate tests pass (ethrex-storage, ethrex-common, ethrex-blockchain, ethrex-vm)
  • No clippy warnings
  • EXPB fast benchmark: -5% latency avg, -7.9% latency median, -4.8% latency p95
  • Individual optimization branch benchmarks still running

@ilitteri ilitteri added the L1 Ethereum client label Feb 9, 2026
@github-actions github-actions bot added the performance Block execution throughput and performance in general label Feb 9, 2026
@ethrex-project-sync ethrex-project-sync bot moved this to In Progress in ethrex_l1 Feb 9, 2026
@github-actions
Copy link

github-actions bot commented Feb 9, 2026

🤖 Kimi Code Review

Review Summary

This PR introduces several performance optimizations focused on reducing redundant computation and memory copies. The changes look solid overall, with good attention to performance bottlenecks. Here are my findings:

Issues Found

  1. Potential Cache Coherency Issue (Critical)

    • File: crates/storage/store.rs, lines 2797-2802
    • Issue: The trie node cache is cleared after applying trie updates, but there's a race condition window where other threads might read stale data. The cache invalidation should happen atomically with the layer update.
  2. Missing Cache Invalidation (Medium)

    • File: crates/storage/trie.rs, CachingBackendTrieDB and CachingBackendTrieDBLocked
    • Issue: When trie nodes are written via put_batch, the cache isn't invalidated. This could lead to stale reads if the same key is read immediately after write.
  3. Inefficient Cache Key Construction (Low)

    • File: crates/storage/trie.rs, line 248 in CachingBackendTrieDB::get
    • Issue: The make_key method is called twice - once for cache lookup and once for DB lookup. Store it in a local variable to avoid double computation.
  4. Missing Error Handling (Low)

    • File: crates/storage/trie.rs, lines 190-193 in TrieNodeCache::new
    • Issue: The expect message is misleading - the capacity can be 1, not "always > 0". Consider using unwrap_or_else with a proper error message.

Positive Aspects

  1. Excellent Performance Focus: The PR correctly identifies and addresses real bottlenecks:

    • ~875KB bloom filter copy elimination via Arc
    • Redundant bloom computation avoidance
    • Shared trie node cache for cross-thread efficiency
  2. Good Thread Safety: The Arc wrapping of the bloom filter maintains thread safety while eliminating expensive copies.

  3. Clean API Design: The new validate_receipts_root_from_encoded function provides a clean interface for using pre-encoded receipts.

  4. Proper Cache Sharding: The 16-shard LRU cache design minimizes lock contention.

Suggested Fixes

  1. Fix cache invalidation race:
// In store.rs, make cache invalidation atomic with layer update
let mut trie_cache_guard = trie_cache.lock().map_err(|_| StoreError::LockError)?;
trie_node_cache.clear();
*trie_cache_guard = Arc::new(trie_mut);
  1. Add cache invalidation on writes:
// In CachingBackendTrieDB::put_batch, add:
for (key, _) in &key_values {
    let prefixed_key = self.inner.make_key(*key);
    self.cache.insert(prefixed_key, Vec::new()); // or remove
}
  1. Optimize key construction:
let prefixed_key = self.inner.make_key(key);
if let Some(value) = self.cache.get(&prefixed_key) { ... }

Overall, this is a well-executed performance optimization PR with minor issues that should be addressed before merging.


Automated review by Kimi (Moonshot AI) · custom prompt

@github-actions
Copy link

github-actions bot commented Feb 9, 2026

🤖 Claude Code Review

Perfect! Now I have a comprehensive understanding of the PR. Let me compile my code review.

Code Review: PR #6163 - Block Pipeline Performance Optimizations

Overview

This PR targets three performance bottlenecks in the block processing pipeline through targeted optimizations. The approach is solid and well-documented.


Critical Issues

1. Trie Node Cache Invalidation Too Aggressive ⚠️

Location: crates/storage/store.rs:2797

trie_node_cache.clear();

Issue: The cache is cleared after every apply_trie_updates call when bottom layers are committed. This defeats the purpose of the cache since:

  • The cache will be cold for every subsequent block's warmer/merkleizer threads
  • Only the removed bottom layer nodes are stale, not the entire cache
  • Most trie nodes are shared across blocks (especially upper trie levels)

Recommendation: Instead of clearing the entire cache:

  1. Track which keys are affected by the commit operation
  2. Only invalidate those specific keys, or
  3. Accept the small risk of stale entries (they'll naturally age out via LRU) since trie nodes are content-addressed (hash → data), making stale entries unlikely to cause correctness issues

Severity: High - This significantly reduces the effectiveness of the optimization.


2. Cache Uses Write Lock for Reads

Location: crates/storage/trie.rs:215-222

pub fn get(&self, key: &[u8]) -> Option<Vec<u8>> {
    let idx = Self::shard_index(key);
    self.shards[idx]
        .write()  // ← Should be read()
        .ok()?
        .get(key)
        .cloned()
}

Issue: TrieNodeCache::get() acquires a write lock when it should use a read lock. LRU cache's get() updates internal metadata (moves item to front), but most LRU implementations offer a peek() method for read-only access.

Impact: Unnecessary lock contention - multiple threads trying to read from the same shard will serialize unnecessarily.

Fix: Either:

  • Use self.shards[idx].read() and call peek() instead of get(), or
  • Accept the write lock but document why (LRU metadata updates)

Check if the lru crate provides peek() - if so, switch to read lock + peek for better concurrency.


Major Issues

3. Missing Cache Size Tuning Guidance

Location: crates/storage/store.rs:1468

trie_node_cache: Arc::new(crate::trie::TrieNodeCache::new(4096)),

Issue: The hardcoded 4096 capacity per shard (= 64K total entries with 16 shards) has no justification. For context:

  • ~875KB bloom filter was worth optimizing
  • Trie nodes vary in size but average ~100-500 bytes
  • 64K entries × 300 bytes ≈ 19MB, which seems reasonable but arbitrary

Recommendation:

  • Add a comment explaining the sizing rationale (profiling data, memory budget, etc.)
  • Consider making this configurable via Store::new() parameters
  • Document memory implications in PR description

4. Receipt Encoding Inconsistency Risk

Location: crates/common/types/block.rs:354

pub fn compute_receipts_root_from_encoded(encoded_receipts: &[Vec<u8>]) -> H256 {
    let iter = encoded_receipts
        .iter()
        .enumerate()
        .map(|(idx, encoded)| (idx.encode_to_vec(), encoded.clone()));
    Trie::compute_hash_from_unsorted_iter(iter)
}

Concern: The contract between Receipt::encode_inner_with_bloom() and compute_receipts_root_from_encoded() is implicit. If someone later changes the encoding format, this could break silently.

The encoding must match: The code in crates/vm/backends/levm/mod.rs:96 calls:

encoded_receipts.push(receipt.encode_inner_with_bloom());

And encode_inner_with_bloom() produces the correct format (verified in receipt.rs:49-64), but there's no type safety ensuring this.

Recommendation:

  • Add a doc comment on compute_receipts_root_from_encoded() specifying it expects output from encode_inner_with_bloom()
  • Consider adding a test that verifies compute_receipts_root(receipts) == compute_receipts_root_from_encoded(encoded) for the same receipts

Minor Issues

5. Cargo.toml Missing lru Dependency Declaration

The PR adds lru::LruCache usage but I don't see it in the diff. Verify that crates/storage/Cargo.toml has:

lru = "0.x.y"

This might already exist, but it's not visible in the PR diff.


6. L2 Paths Leave encoded_receipts Empty

Locations:

  • crates/l2/sequencer/block_producer.rs:227
  • crates/l2/sequencer/l1_committer.rs:824
encoded_receipts: Vec::new(),

Issue: L2 code paths don't populate encoded_receipts, passing empty vectors. While this works (L2 might not validate receipts the same way), it's inconsistent and could cause issues if someone adds receipt validation to L2 later.

Recommendation: Add a comment explaining why L2 doesn't need pre-encoded receipts, or better yet, remove the field from BlockExecutionResult for L2 code paths by making it Option<Vec<Vec<u8>>>.


7. AtomicBloomFilter Thread Safety Assumption

Location: crates/storage/layering.rs:34 (comment)

AtomicBloomFilter::insert(&self) is already atomic, so sharing via Arc is safe.

Verification Needed: Confirm that fastbloom::AtomicBloomFilter actually guarantees thread-safe concurrent inserts. The name suggests it, but the implementation should be verified. If it uses internal Mutex/RwLock, this is fine. If it uses lock-free atomics, verify it handles concurrent inserts correctly (not just concurrent insert+check).

Action: Add a // SAFETY: comment with a reference to the fastbloom docs/source confirming thread safety.


8. Potential Clone Overhead in Cache Miss

Location: crates/storage/trie.rs:277

if let Some(ref value) = result {
    self.cache.insert(prefixed_key, value.clone());
}

Issue: On cache miss, the value is cloned to insert into the cache. For large trie nodes, this adds overhead. Not critical since it only happens on misses, but worth noting.

Consideration: Could use Arc<[u8]> in the cache to share ownership, but this adds complexity. Current approach is acceptable given the alternative complexity.


Positive Observations ✅

  1. Arc-wrapping bloom filter: Clean solution to eliminate deep copies. The manual Clone impl is well-implemented.

  2. Sharded cache design: 16 shards matching merkleizer worker count is excellent - shows understanding of the threading model.

  3. Receipt pre-encoding: Smart optimization - doing the work once during execution when the CPU has time is better than repeating it during validation.

  4. Type safety preserved: New validate_receipts_root_from_encoded runs alongside the old path, making rollback easy if issues arise.

  5. Consistent error handling: Cache failures (lock poisoning) are handled gracefully with ok()? and silent failures.


Testing Recommendations

  1. Add unit test for TrieNodeCache:

    • Verify multi-threaded access
    • Verify LRU eviction works correctly
    • Verify clear() actually clears all shards
  2. Add integration test verifying:

    • compute_receipts_root(receipts) == compute_receipts_root_from_encoded(encode_receipts(receipts))
    • For both legacy and typed receipts
  3. Benchmark validation: Measure actual performance gains on testnet/mainnet blocks with the optimizations disabled vs enabled.


Security Considerations

No critical security issues identified, but note:

  1. Cache poisoning: If the trie cache contains stale data after commits, it could cause incorrect state reads. The current clear() approach is conservative (safe but slow). The recommended selective invalidation must be carefully implemented.

  2. Receipt root validation: The optimization maintains correctness since the encoding format is identical between the two paths. However, future changes to receipt encoding need to update both paths.


Summary

Overall Assessment: Good performance optimizations with sound reasoning, but the trie cache invalidation strategy undermines its effectiveness. The write-lock-for-reads issue also reduces the parallelism benefits.

Recommendation:

Estimated Impact After Fixes: Should see measurable reduction in RocksDB reads and faster block processing on multi-core systems. The receipt pre-encoding is a clear win with no downside.


Automated review by Claude (Anthropic) · custom prompt

@github-actions
Copy link

github-actions bot commented Feb 9, 2026

🤖 Codex Code Review

Findings

  • Critical – snapshot consistency violation in shared trie node cache: CachingBackendTrieDBLocked::get consults the shared cache before reading from the locked snapshot. If the cache was filled by a newer (non-locked) reader, the locked snapshot can return newer trie nodes, breaking snapshot isolation and risking consensus-critical state mismatches. Consider isolating cache per snapshot or disabling cache for locked reads, or keying cache entries by snapshot version/root. crates/storage/trie.rs:305-316
  • High – cache can serve stale nodes after writes: CachingBackendTrieDB::put_batch forwards writes to DB but does not update or invalidate cached entries. Any read-after-write in the same process can return stale data. Either update cache entries for written keys or clear/invalidate the cache on write paths. crates/storage/trie.rs:283-285
  • Medium – L2 paths build BlockExecutionResult with empty encoded_receipts: If any validation path calls validate_receipts_root_from_encoded, empty encoded_receipts will cause a mismatch even when receipts are correct. Either populate encoded_receipts from receipts here or add a fallback to compute from receipts when encoded_receipts is empty. crates/l2/sequencer/block_producer.rs:222-228, crates/l2/sequencer/l1_committer.rs:819-825

If you want, I can dig into the call paths to confirm whether the L2 store_block flow currently validates receipts roots, and propose a targeted fix.


Automated review by OpenAI Codex · custom prompt

@github-actions
Copy link

github-actions bot commented Feb 9, 2026

Lines of code report

Total lines added: 189
Total lines removed: 0
Total lines changed: 189

Detailed view
+------------------------------------------------+-------+------+
| File                                           | Lines | Diff |
+------------------------------------------------+-------+------+
| ethrex/crates/blockchain/blockchain.rs         | 1904  | +9   |
+------------------------------------------------+-------+------+
| ethrex/crates/common/types/block.rs            | 954   | +7   |
+------------------------------------------------+-------+------+
| ethrex/crates/common/validation.rs             | 192   | +11  |
+------------------------------------------------+-------+------+
| ethrex/crates/l2/sequencer/block_producer.rs   | 321   | +1   |
+------------------------------------------------+-------+------+
| ethrex/crates/l2/sequencer/l1_committer.rs     | 1386  | +1   |
+------------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/sync/snap_sync.rs | 954   | +3   |
+------------------------------------------------+-------+------+
| ethrex/crates/storage/layering.rs              | 203   | +9   |
+------------------------------------------------+-------+------+
| ethrex/crates/storage/store.rs                 | 2525  | +39  |
+------------------------------------------------+-------+------+
| ethrex/crates/storage/trie.rs                  | 277   | +102 |
+------------------------------------------------+-------+------+
| ethrex/crates/vm/backends/levm/mod.rs          | 798   | +6   |
+------------------------------------------------+-------+------+
| ethrex/crates/vm/backends/mod.rs               | 188   | +1   |
+------------------------------------------------+-------+------+

… per block

The RCU pattern in apply_trie_updates clones the entire TrieLayerCache on every
block. The AtomicBloomFilter (~875KB for 1M items at 2% FPR) was the dominant
cost of this clone.

Wrapping bloom in Arc makes the clone a pointer copy. This is safe because
AtomicBloomFilter::insert/contains take &self (atomic operations), and the
filter is only replaced (not mutated in place) during rebuild_bloom() which
creates a fresh Arc.

The layers HashMap clone remains but is cheap (FxHashMap of Arc<TrieLayer>
pointers + metadata only).
…ss threads

Both the warmer thread and 16 merkleizer shards traverse the same trie nodes
from RocksDB. The TrieLayerCache only caches committed trie updates (diff
layers), not trie node reads.

This adds a 16-shard LRU cache (4096 entries per shard) at the BackendTrieDB
level, shared via Arc across all trie instances. When the warmer reads a trie
node, it populates the cache; when the merkleizer reads the same node moments
later, it's a cache hit instead of a full RocksDB read transaction.

Changes:
- Add TrieNodeCache (sharded LRU) and CachingBackendTrieDB/CachingBackendTrieDBLocked
  wrappers in trie.rs
- Add trie_node_cache field to Store, shared across all trie constructors
- Wrap BackendTrieDB with CachingBackendTrieDB in open_state_trie,
  open_storage_trie, and their locked variants
- Clear cache in apply_trie_updates after disk commit to invalidate stale entries
- Make BackendTrieDB::make_key, table_for_key, db and
  BackendTrieDBLocked::tx_for_key pub(crate) for the wrapper
…validation

validate_receipts_root calls compute_receipts_root which re-encodes every
receipt (bloom computation + RLP) serially after all txs complete. This work
can be done incrementally during per-tx execution, overlapping with
merkleization.

Changes:
- Add encoded_receipts field to BlockExecutionResult
- Pre-encode each receipt via encode_inner_with_bloom() immediately after
  each tx in both execute_block and execute_block_pipeline
- Add compute_receipts_root_from_encoded and
  validate_receipts_root_from_encoded that skip re-encoding
- Update all 3 call sites in blockchain.rs to use the pre-encoded variant
- Update L2 BlockExecutionResult constructors with empty encoded_receipts
…reads

The shared TrieNodeCache caches trie node reads across all trie instances,
but during snap sync the trie is mutated directly (account insertion, state
healing) without invalidating the cache. This caused request_storage_ranges
to read stale trie nodes after state healing, failing with "Could not find
account that should have been downloaded or healed".

Add Store::clear_trie_node_cache() and call it after account insertion,
after storage insertion, and before request_storage_ranges (after state
healing).
…ming

make_key takes Nibbles by value, so key must be cloned before passing to
inner.get() on cache miss.
@ilitteri ilitteri force-pushed the block-pipeline-optimizations branch from cf73c95 to 1d6f901 Compare February 23, 2026 20:48
@github-actions
Copy link

Benchmark Block Execution Results Comparison Against Main

Command Mean [s] Min [s] Max [s] Relative
base 64.587 ± 0.126 64.403 64.786 1.00
head 65.199 ± 0.217 64.891 65.598 1.01 ± 0.00

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

L1 Ethereum client performance Block execution throughput and performance in general

Projects

Status: In Progress
Status: Todo

Development

Successfully merging this pull request may close these issues.

1 participant