perf(l1): block pipeline optimizations by ilitteri · Pull Request #6163 · lambdaclass/ethrex

ilitteri · 2026-02-09T20:37:47Z

Motivation

After profiling ethrex at steady-state on Hoodi (75% total CPU reduction across 3 prior rounds), the remaining bottlenecks include the TrieLayerCache deep clone in the RCU pattern (~875KB bloom filter copied per block), redundant trie node reads from RocksDB by both warmer and merkleizer threads, and receipt encoding done synchronously during validation.

Description

Three optimizations targeting the block pipeline:

Arc-wrap bloom filter in TrieLayerCache — eliminates ~875KB AtomicBloomFilter deep copy that happens every block in apply_trie_updates. Since AtomicBloomFilter::insert(&self) is already atomic, sharing via Arc is safe. The layers HashMap clone remains but is cheap (values are Arc<TrieLayer>).
Shared trie node read cache — 16-shard LRU cache (TrieNodeCache) at the BackendTrieDB level, shared across warmer, merkleizer, and execution threads via Arc. When the warmer reads a trie node ahead of execution, the merkleizer gets a cache hit instead of another RocksDB read transaction. Shard count matches the merkleizer's 16 worker threads.
Pre-encode receipts during execution — receipt bloom+RLP encoding (encode_inner_with_bloom) is done once per tx during execution and passed through BlockExecutionResult. Validation uses the pre-encoded bytes directly via validate_receipts_root_from_encoded, avoiding redundant re-encoding.

EXPB Benchmark Results

fast scenario

Metric	main	block-pipeline-optimizations	Δ
mgas/s avg	660.9	656.5	-0.7% (noise)
mgas/s max	1659	1742	+5.0%
latency avg	47.68 ms	45.30 ms	-5.0%
latency median	27.94 ms	25.73 ms	-7.9%
latency p95	154.70 ms	147.27 ms	-4.8%
latency p99	194.19 ms	273.75 ms	+41.0% (outlier)

gigablocks scenario

Metric	main (ID 233)	block-pipeline-optimizations	Δ
mgas/s avg	1546.9	1255.4	-18.8%

Note: The gigablocks main baseline (ID 233) was run on a different date (older main commit) and may not be directly comparable. The individual optimization branches are still running and will provide more granular data.

Test plan

All crate tests pass (ethrex-storage, ethrex-common, ethrex-blockchain, ethrex-vm)
No clippy warnings
EXPB fast benchmark: -5% latency avg, -7.9% latency median, -4.8% latency p95
Individual optimization branch benchmarks still running

github-actions · 2026-02-09T20:38:50Z

🤖 Kimi Code Review

Review Summary

This PR introduces several performance optimizations focused on reducing redundant computation and memory copies. The changes look solid overall, with good attention to performance bottlenecks. Here are my findings:

Issues Found

Potential Cache Coherency Issue (Critical)
- File: crates/storage/store.rs, lines 2797-2802
- Issue: The trie node cache is cleared after applying trie updates, but there's a race condition window where other threads might read stale data. The cache invalidation should happen atomically with the layer update.
Missing Cache Invalidation (Medium)
- File: crates/storage/trie.rs, CachingBackendTrieDB and CachingBackendTrieDBLocked
- Issue: When trie nodes are written via put_batch, the cache isn't invalidated. This could lead to stale reads if the same key is read immediately after write.
Inefficient Cache Key Construction (Low)
- File: crates/storage/trie.rs, line 248 in CachingBackendTrieDB::get
- Issue: The make_key method is called twice - once for cache lookup and once for DB lookup. Store it in a local variable to avoid double computation.
Missing Error Handling (Low)
- File: crates/storage/trie.rs, lines 190-193 in TrieNodeCache::new
- Issue: The expect message is misleading - the capacity can be 1, not "always > 0". Consider using unwrap_or_else with a proper error message.

Positive Aspects

Excellent Performance Focus: The PR correctly identifies and addresses real bottlenecks:
- ~875KB bloom filter copy elimination via Arc
- Redundant bloom computation avoidance
- Shared trie node cache for cross-thread efficiency
Good Thread Safety: The Arc wrapping of the bloom filter maintains thread safety while eliminating expensive copies.
Clean API Design: The new validate_receipts_root_from_encoded function provides a clean interface for using pre-encoded receipts.
Proper Cache Sharding: The 16-shard LRU cache design minimizes lock contention.

Suggested Fixes

Fix cache invalidation race:

// In store.rs, make cache invalidation atomic with layer update
let mut trie_cache_guard = trie_cache.lock().map_err(|_| StoreError::LockError)?;
trie_node_cache.clear();
*trie_cache_guard = Arc::new(trie_mut);

Add cache invalidation on writes:

// In CachingBackendTrieDB::put_batch, add:
for (key, _) in &key_values {
    let prefixed_key = self.inner.make_key(*key);
    self.cache.insert(prefixed_key, Vec::new()); // or remove
}

Optimize key construction:

let prefixed_key = self.inner.make_key(key);
if let Some(value) = self.cache.get(&prefixed_key) { ... }

Overall, this is a well-executed performance optimization PR with minor issues that should be addressed before merging.

Automated review by Kimi (Moonshot AI) · custom prompt

github-actions · 2026-02-09T20:40:21Z

🤖 Claude Code Review

Perfect! Now I have a comprehensive understanding of the PR. Let me compile my code review.

Code Review: PR #6163 - Block Pipeline Performance Optimizations

Overview

This PR targets three performance bottlenecks in the block processing pipeline through targeted optimizations. The approach is solid and well-documented.

Critical Issues

1. Trie Node Cache Invalidation Too Aggressive ⚠️

Location: crates/storage/store.rs:2797

trie_node_cache.clear();

Issue: The cache is cleared after every apply_trie_updates call when bottom layers are committed. This defeats the purpose of the cache since:

The cache will be cold for every subsequent block's warmer/merkleizer threads
Only the removed bottom layer nodes are stale, not the entire cache
Most trie nodes are shared across blocks (especially upper trie levels)

Recommendation: Instead of clearing the entire cache:

Track which keys are affected by the commit operation
Only invalidate those specific keys, or
Accept the small risk of stale entries (they'll naturally age out via LRU) since trie nodes are content-addressed (hash → data), making stale entries unlikely to cause correctness issues

Severity: High - This significantly reduces the effectiveness of the optimization.

2. Cache Uses Write Lock for Reads

Location: crates/storage/trie.rs:215-222

pub fn get(&self, key: &[u8]) -> Option<Vec<u8>> {
    let idx = Self::shard_index(key);
    self.shards[idx]
        .write()  // ← Should be read()
        .ok()?
        .get(key)
        .cloned()
}

Issue: TrieNodeCache::get() acquires a write lock when it should use a read lock. LRU cache's get() updates internal metadata (moves item to front), but most LRU implementations offer a peek() method for read-only access.

Impact: Unnecessary lock contention - multiple threads trying to read from the same shard will serialize unnecessarily.

Fix: Either:

Use self.shards[idx].read() and call peek() instead of get(), or
Accept the write lock but document why (LRU metadata updates)

Check if the lru crate provides peek() - if so, switch to read lock + peek for better concurrency.

Major Issues

3. Missing Cache Size Tuning Guidance

Location: crates/storage/store.rs:1468

trie_node_cache: Arc::new(crate::trie::TrieNodeCache::new(4096)),

Issue: The hardcoded 4096 capacity per shard (= 64K total entries with 16 shards) has no justification. For context:

~875KB bloom filter was worth optimizing
Trie nodes vary in size but average ~100-500 bytes
64K entries × 300 bytes ≈ 19MB, which seems reasonable but arbitrary

Recommendation:

Add a comment explaining the sizing rationale (profiling data, memory budget, etc.)
Consider making this configurable via Store::new() parameters
Document memory implications in PR description

4. Receipt Encoding Inconsistency Risk

Location: crates/common/types/block.rs:354

pub fn compute_receipts_root_from_encoded(encoded_receipts: &[Vec<u8>]) -> H256 {
    let iter = encoded_receipts
        .iter()
        .enumerate()
        .map(|(idx, encoded)| (idx.encode_to_vec(), encoded.clone()));
    Trie::compute_hash_from_unsorted_iter(iter)
}

Concern: The contract between Receipt::encode_inner_with_bloom() and compute_receipts_root_from_encoded() is implicit. If someone later changes the encoding format, this could break silently.

The encoding must match: The code in crates/vm/backends/levm/mod.rs:96 calls:

encoded_receipts.push(receipt.encode_inner_with_bloom());

And encode_inner_with_bloom() produces the correct format (verified in receipt.rs:49-64), but there's no type safety ensuring this.

Recommendation:

Add a doc comment on compute_receipts_root_from_encoded() specifying it expects output from encode_inner_with_bloom()
Consider adding a test that verifies compute_receipts_root(receipts) == compute_receipts_root_from_encoded(encoded) for the same receipts

Minor Issues

5. Cargo.toml Missing `lru` Dependency Declaration

The PR adds lru::LruCache usage but I don't see it in the diff. Verify that crates/storage/Cargo.toml has:

lru = "0.x.y"

This might already exist, but it's not visible in the PR diff.

6. L2 Paths Leave `encoded_receipts` Empty

Locations:

crates/l2/sequencer/block_producer.rs:227
crates/l2/sequencer/l1_committer.rs:824

encoded_receipts: Vec::new(),

Issue: L2 code paths don't populate encoded_receipts, passing empty vectors. While this works (L2 might not validate receipts the same way), it's inconsistent and could cause issues if someone adds receipt validation to L2 later.

Recommendation: Add a comment explaining why L2 doesn't need pre-encoded receipts, or better yet, remove the field from BlockExecutionResult for L2 code paths by making it Option<Vec<Vec<u8>>>.

7. AtomicBloomFilter Thread Safety Assumption

Location: crates/storage/layering.rs:34 (comment)

AtomicBloomFilter::insert(&self) is already atomic, so sharing via Arc is safe.

Verification Needed: Confirm that fastbloom::AtomicBloomFilter actually guarantees thread-safe concurrent inserts. The name suggests it, but the implementation should be verified. If it uses internal Mutex/RwLock, this is fine. If it uses lock-free atomics, verify it handles concurrent inserts correctly (not just concurrent insert+check).

Action: Add a // SAFETY: comment with a reference to the fastbloom docs/source confirming thread safety.

8. Potential Clone Overhead in Cache Miss

Location: crates/storage/trie.rs:277

if let Some(ref value) = result {
    self.cache.insert(prefixed_key, value.clone());
}

Issue: On cache miss, the value is cloned to insert into the cache. For large trie nodes, this adds overhead. Not critical since it only happens on misses, but worth noting.

Consideration: Could use Arc<[u8]> in the cache to share ownership, but this adds complexity. Current approach is acceptable given the alternative complexity.

Positive Observations ✅

Arc-wrapping bloom filter: Clean solution to eliminate deep copies. The manual Clone impl is well-implemented.
Sharded cache design: 16 shards matching merkleizer worker count is excellent - shows understanding of the threading model.
Receipt pre-encoding: Smart optimization - doing the work once during execution when the CPU has time is better than repeating it during validation.
Type safety preserved: New validate_receipts_root_from_encoded runs alongside the old path, making rollback easy if issues arise.
Consistent error handling: Cache failures (lock poisoning) are handled gracefully with ok()? and silent failures.

Testing Recommendations

Add unit test for TrieNodeCache:
- Verify multi-threaded access
- Verify LRU eviction works correctly
- Verify clear() actually clears all shards
Add integration test verifying:
- compute_receipts_root(receipts) == compute_receipts_root_from_encoded(encode_receipts(receipts))
- For both legacy and typed receipts
Benchmark validation: Measure actual performance gains on testnet/mainnet blocks with the optimizations disabled vs enabled.

Security Considerations

No critical security issues identified, but note:

Cache poisoning: If the trie cache contains stale data after commits, it could cause incorrect state reads. The current clear() approach is conservative (safe but slow). The recommended selective invalidation must be carefully implemented.
Receipt root validation: The optimization maintains correctness since the encoding format is identical between the two paths. However, future changes to receipt encoding need to update both paths.

Summary

Overall Assessment: Good performance optimizations with sound reasoning, but the trie cache invalidation strategy undermines its effectiveness. The write-lock-for-reads issue also reduces the parallelism benefits.

Recommendation:

Fix issue docs: add milestones #1 (cache invalidation) before merging - it's defeating the optimization
Fix issue chore: create project structure #2 (read vs write lock) for better concurrency
Address other issues as time permits

Estimated Impact After Fixes: Should see measurable reduction in RocksDB reads and faster block processing on multi-core systems. The receipt pre-encoding is a clear win with no downside.

Automated review by Claude (Anthropic) · custom prompt

github-actions · 2026-02-09T20:41:03Z

🤖 Codex Code Review

Findings

Critical – snapshot consistency violation in shared trie node cache: CachingBackendTrieDBLocked::get consults the shared cache before reading from the locked snapshot. If the cache was filled by a newer (non-locked) reader, the locked snapshot can return newer trie nodes, breaking snapshot isolation and risking consensus-critical state mismatches. Consider isolating cache per snapshot or disabling cache for locked reads, or keying cache entries by snapshot version/root. crates/storage/trie.rs:305-316
High – cache can serve stale nodes after writes: CachingBackendTrieDB::put_batch forwards writes to DB but does not update or invalidate cached entries. Any read-after-write in the same process can return stale data. Either update cache entries for written keys or clear/invalidate the cache on write paths. crates/storage/trie.rs:283-285
Medium – L2 paths build BlockExecutionResult with empty encoded_receipts: If any validation path calls validate_receipts_root_from_encoded, empty encoded_receipts will cause a mismatch even when receipts are correct. Either populate encoded_receipts from receipts here or add a fallback to compute from receipts when encoded_receipts is empty. crates/l2/sequencer/block_producer.rs:222-228, crates/l2/sequencer/l1_committer.rs:819-825

If you want, I can dig into the call paths to confirm whether the L2 store_block flow currently validates receipts roots, and propose a targeted fix.

Automated review by OpenAI Codex · custom prompt

github-actions · 2026-02-09T20:41:28Z

Lines of code report

Total lines added: 189
Total lines removed: 0
Total lines changed: 189

Detailed view

+------------------------------------------------+-------+------+
| File                                           | Lines | Diff |
+------------------------------------------------+-------+------+
| ethrex/crates/blockchain/blockchain.rs         | 1904  | +9   |
+------------------------------------------------+-------+------+
| ethrex/crates/common/types/block.rs            | 954   | +7   |
+------------------------------------------------+-------+------+
| ethrex/crates/common/validation.rs             | 192   | +11  |
+------------------------------------------------+-------+------+
| ethrex/crates/l2/sequencer/block_producer.rs   | 321   | +1   |
+------------------------------------------------+-------+------+
| ethrex/crates/l2/sequencer/l1_committer.rs     | 1386  | +1   |
+------------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/sync/snap_sync.rs | 954   | +3   |
+------------------------------------------------+-------+------+
| ethrex/crates/storage/layering.rs              | 203   | +9   |
+------------------------------------------------+-------+------+
| ethrex/crates/storage/store.rs                 | 2525  | +39  |
+------------------------------------------------+-------+------+
| ethrex/crates/storage/trie.rs                  | 277   | +102 |
+------------------------------------------------+-------+------+
| ethrex/crates/vm/backends/levm/mod.rs          | 798   | +6   |
+------------------------------------------------+-------+------+
| ethrex/crates/vm/backends/mod.rs               | 188   | +1   |
+------------------------------------------------+-------+------+

… per block The RCU pattern in apply_trie_updates clones the entire TrieLayerCache on every block. The AtomicBloomFilter (~875KB for 1M items at 2% FPR) was the dominant cost of this clone. Wrapping bloom in Arc makes the clone a pointer copy. This is safe because AtomicBloomFilter::insert/contains take &self (atomic operations), and the filter is only replaced (not mutated in place) during rebuild_bloom() which creates a fresh Arc. The layers HashMap clone remains but is cheap (FxHashMap of Arc<TrieLayer> pointers + metadata only).

…ss threads Both the warmer thread and 16 merkleizer shards traverse the same trie nodes from RocksDB. The TrieLayerCache only caches committed trie updates (diff layers), not trie node reads. This adds a 16-shard LRU cache (4096 entries per shard) at the BackendTrieDB level, shared via Arc across all trie instances. When the warmer reads a trie node, it populates the cache; when the merkleizer reads the same node moments later, it's a cache hit instead of a full RocksDB read transaction. Changes: - Add TrieNodeCache (sharded LRU) and CachingBackendTrieDB/CachingBackendTrieDBLocked wrappers in trie.rs - Add trie_node_cache field to Store, shared across all trie constructors - Wrap BackendTrieDB with CachingBackendTrieDB in open_state_trie, open_storage_trie, and their locked variants - Clear cache in apply_trie_updates after disk commit to invalidate stale entries - Make BackendTrieDB::make_key, table_for_key, db and BackendTrieDBLocked::tx_for_key pub(crate) for the wrapper

…validation validate_receipts_root calls compute_receipts_root which re-encodes every receipt (bloom computation + RLP) serially after all txs complete. This work can be done incrementally during per-tx execution, overlapping with merkleization. Changes: - Add encoded_receipts field to BlockExecutionResult - Pre-encode each receipt via encode_inner_with_bloom() immediately after each tx in both execute_block and execute_block_pipeline - Add compute_receipts_root_from_encoded and validate_receipts_root_from_encoded that skip re-encoding - Update all 3 call sites in blockchain.rs to use the pre-encoded variant - Update L2 BlockExecutionResult constructors with empty encoded_receipts

…reads The shared TrieNodeCache caches trie node reads across all trie instances, but during snap sync the trie is mutated directly (account insertion, state healing) without invalidating the cache. This caused request_storage_ranges to read stale trie nodes after state healing, failing with "Could not find account that should have been downloaded or healed". Add Store::clear_trie_node_cache() and call it after account insertion, after storage insertion, and before request_storage_ranges (after state healing).

…ming make_key takes Nibbles by value, so key must be cloned before passing to inner.get() on cache miss.

github-actions · 2026-02-23T21:34:06Z

Benchmark Block Execution Results Comparison Against Main

Command	Mean [s]	Min [s]	Max [s]	Relative
`base`	64.587 ± 0.126	64.403	64.786	1.00
`head`	65.199 ± 0.217	64.891	65.598	1.01 ± 0.00

ilitteri added the L1 Ethereum client label Feb 9, 2026

github-project-automation bot added this to ethrex_l1 Feb 9, 2026

github-actions bot assigned ilitteri Feb 9, 2026

github-actions bot added the performance Block execution throughput and performance in general label Feb 9, 2026

github-project-automation bot added this to ethrex_performance Feb 9, 2026

ethrex-project-sync bot moved this to In Progress in ethrex_l1 Feb 9, 2026

github-project-automation bot moved this to Todo in ethrex_performance Feb 9, 2026

ilitteri added 6 commits February 23, 2026 17:30

Update changelog with block pipeline performance optimizations

5840fa9

Fix Nibbles move in CachingBackendTrieDB::get by cloning before consu…

1d6f901

…ming make_key takes Nibbles by value, so key must be cloned before passing to inner.get() on cache miss.

ilitteri force-pushed the block-pipeline-optimizations branch from cf73c95 to 1d6f901 Compare February 23, 2026 20:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(l1): block pipeline optimizations#6163

perf(l1): block pipeline optimizations#6163
ilitteri wants to merge 6 commits intomainfrom
block-pipeline-optimizations

ilitteri commented Feb 9, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Feb 9, 2026

Uh oh!

github-actions bot commented Feb 9, 2026

Uh oh!

github-actions bot commented Feb 9, 2026

Uh oh!

github-actions bot commented Feb 9, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Feb 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ilitteri commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Description

EXPB Benchmark Results

fast scenario

gigablocks scenario

Test plan

Uh oh!

github-actions bot commented Feb 9, 2026

🤖 Kimi Code Review

Review Summary

Issues Found

Positive Aspects

Suggested Fixes

Uh oh!

github-actions bot commented Feb 9, 2026

🤖 Claude Code Review

Code Review: PR #6163 - Block Pipeline Performance Optimizations

Overview

Critical Issues

1. Trie Node Cache Invalidation Too Aggressive ⚠️

2. Cache Uses Write Lock for Reads

Major Issues

3. Missing Cache Size Tuning Guidance

4. Receipt Encoding Inconsistency Risk

Minor Issues

5. Cargo.toml Missing lru Dependency Declaration

6. L2 Paths Leave encoded_receipts Empty

7. AtomicBloomFilter Thread Safety Assumption

8. Potential Clone Overhead in Cache Miss

Positive Observations ✅

Testing Recommendations

Security Considerations

Summary

Uh oh!

github-actions bot commented Feb 9, 2026

🤖 Codex Code Review

Uh oh!

github-actions bot commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Lines of code report

Uh oh!

github-actions bot commented Feb 23, 2026

Benchmark Block Execution Results Comparison Against Main

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ilitteri commented Feb 9, 2026 •

edited

Loading

5. Cargo.toml Missing `lru` Dependency Declaration

6. L2 Paths Leave `encoded_receipts` Empty

github-actions bot commented Feb 9, 2026 •

edited

Loading