Skip to content

perf(l1): deduplicate trie nodes across layers before writing to RocksDB#6179

Closed
ilitteri wants to merge 3 commits intomainfrom
perf/dedup-trie-commit
Closed

perf(l1): deduplicate trie nodes across layers before writing to RocksDB#6179
ilitteri wants to merge 3 commits intomainfrom
perf/dedup-trie-commit

Conversation

@ilitteri
Copy link
Collaborator

@ilitteri ilitteri commented Feb 11, 2026

Motivation

CPU profiling on the slow EXPB scenario (5000 mainnet blocks, ~30 min) shows InlineSkipList::Insert at 3.90% flat and MemTable::KeyComparator at 1.17% flat — both driven by WriteBatch size in TrieLayerCache::commit(). When 128 layers are committed, flat_map collects ALL trie node entries from ALL layers into a Vec with no deduplication. A node modified in multiple layers (e.g. the root node modified in all 128 layers) produces 128 duplicate WriteBatch entries. Only the latest value matters.

Description

Change TrieLayerCache::commit() to collect nodes into an FxHashMap instead of a Vec, iterating oldest-to-newest so newer values overwrite older ones. This eliminates duplicate entries before the WriteBatch is constructed, reducing InlineSkipList::Insert and MemTable::KeyComparator overhead proportionally.

No breaking changes

The return type remains Vec<(Vec<u8>, Vec<u8>)> — the HashMap is converted back to Vec before returning. The only behavioral change is fewer entries in the output (duplicates removed).

Benchmark Results (EXPB on ethrex-office-3)

Compared against fresh main baselines (#63-65) run back-to-back on the same machine.

Throughput

Scenario Main (Mgas/s) Dedup (Mgas/s) Change
fast (200 mainnet blocks) 420.48 425.27 +1.1%
gigablocks (100 synthetic max-gas blocks) 909.63 1002.74 +10.2%
slow (5000 mainnet blocks) 811.04 825.18 +1.7%

Latency — Gigablocks

Metric Main Dedup Change
avg 1410 ms 1240 ms -12.1%
median 1200 ms 1090 ms -9.2%
p95 2760 ms 2490 ms -9.8%
p99 4340 ms 2650 ms -38.9%

Latency — Slow (5000 mainnet blocks)

Metric Main Dedup Change
avg 29.48 ms 28.67 ms -2.7%

CPU Profile Comparison — Gigablocks

Function Main (#64) Dedup (#68) Change
Wall time 230.28s 207.70s -9.8%
Total CPU samples 49833ms (21.64%) 48568ms (23.38%) -2.5% abs
CompactionJob::Run 18174ms (36.47%) 18517ms (38.13%) ~same
VM::execute 12370ms (24.82%) 11556ms (23.79%) -6.6% abs
rayon::join (merkle) 10573ms (21.22%) 10037ms (20.67%) -5.1% abs
PosixFile::Read (disk I/O) 5854ms (11.75%) 5753ms (11.85%) ~same

The gigablocks improvement is distributed across functions rather than concentrated in the write path, suggesting reduced overall pressure from smaller WriteBatches.

CPU Profile Comparison — Slow

Function Main (#65) Dedup (#69) Change
Wall time 1826.40s 1829.55s ~same
Total CPU samples 127137ms (6.96%) 127836ms (6.99%) ~same
CompactionJob::Run 76185ms (59.92%) 76462ms (59.81%) ~same
WriteBatch::Iterate 7893ms (6.21%) 7728ms (6.05%) -2.1% abs
InsertInto 7790ms (6.13%) 7637ms (5.97%) -2.0% abs
InlineSkipList::Insert 7026ms (5.53%) 6941ms (5.43%) -1.2% abs
VM::execute 13244ms (10.42%) 13088ms (10.24%) -1.2% abs

The slow profile shows modest reductions in the WriteBatch/SkipList path (~2%), indicating the actual duplicate count across 128 layers is lower than the worst case (root node x128). Most leaf nodes are only touched once across the commit window.

Note on gigablocks variance: The gigablocks baselines show natural variance between runs (old baseline #58: 942.98 vs fresh #64: 909.63 = 3.5% difference). The +10.2% improvement should be interpreted with this variance in mind; the slow scenario (+1.7%) is more reliable given its longer duration.

How to Test

  1. Run EXPB benchmarks: fast, gigablocks, and slow scenarios with CPU profiling
  2. Compare WriteBatch sizes in CPU profile (InlineSkipList::Insert and WriteBatchInternal::InsertInto should decrease)

TrieLayerCache::commit() collects nodes from ~128 layers via flat_map
into a Vec, producing duplicate entries when the same trie path is
modified in multiple layers. Only the latest value matters, but all
copies end up in the WriteBatch — the root node alone could be written
128 times.

Collect into FxHashMap instead, iterating oldest-to-newest so newer
values overwrite older ones. This eliminates duplicates before the
WriteBatch is constructed, reducing InlineSkipList::Insert and
MemTable::KeyComparator overhead proportionally.
Copilot AI review requested due to automatic review settings February 11, 2026 15:55
@ilitteri ilitteri requested a review from a team as a code owner February 11, 2026 15:55
@ilitteri ilitteri added performance Block execution throughput and performance in general L1 Ethereum client labels Feb 11, 2026
@ethrex-project-sync ethrex-project-sync bot moved this to In Review in ethrex_l1 Feb 11, 2026
@github-actions
Copy link

🤖 Kimi Code Review

Review Summary

The change is correct and valuable - it fixes a performance issue where duplicate trie nodes across layers would create unnecessary RocksDB write operations.

Specific Feedback

File: crates/storage/layering.rs

  • Line 185: Good use of FxHashMap for deduplication, but consider using HashMap::with_capacity() to avoid reallocation. Since we know the approximate size from layers_to_commit, we could pre-allocate:
let mut nodes_to_commit = FxHashMap::with_capacity(
    layers_to_commit.iter().map(|l| l.nodes.len()).sum()
);
  • Line 189: The .into_iter().collect() at the end is correct but creates an intermediate Vec<(Vec<u8>, Vec<u8>)> which could be memory-intensive for large state. Consider if the caller can work with the HashMap directly.

Security/Consensus Impact

No security issues - This is purely a performance optimization that doesn't affect consensus or state correctness. The deduplication preserves the "newest wins" semantics which is correct for layered state.

Performance

Significant improvement - Eliminates O(n×m) duplicate writes where n=number of layers and m=duplicate nodes. This is especially impactful for the root node which gets modified in every layer.

The change is ready to merge.


Automated review by Kimi (Moonshot AI) · custom prompt

@github-actions
Copy link

🤖 Codex Code Review

Findings (ordered by severity)

  • Potential nondeterminism in commit batch ordering: FxHashMap iteration makes the returned vector order arbitrary, whereas previously the order was deterministic by layer traversal. If any downstream behavior or tests rely on stable ordering (e.g., for reproducible snapshots or debug logs), this could introduce flaky outcomes. Consider preserving order while deduping (e.g., track FxHashSet of seen keys while pushing into a Vec, or sort keys before returning). crates/storage/layering.rs:181-189
  • Peak memory overhead: building a full FxHashMap and then materializing a Vec duplicates storage for large commits. If commit batches are large, this could increase peak memory. Consider precomputing total entries for with_capacity, or dedupe while building the output Vec to avoid the extra full map. crates/storage/layering.rs:181-189

If ordering is not relied upon anywhere, the change looks good and is a solid performance win by avoiding duplicate writes.


Automated review by OpenAI Codex · custom prompt

@github-actions
Copy link

🤖 Claude Code Review

Now let me analyze the code thoroughly:

Code Review: PR #6179 - Deduplicate trie nodes across layers before writing to RocksDB

Summary

This PR optimizes the TrieLayerCache::commit() method by deduplicating trie nodes across layers before writing to RocksDB. The change replaces a flat_map collection with a HashMap-based approach to eliminate duplicate entries.


✅ Positive Aspects

  1. Clear performance motivation: Well-documented problem with profiling data showing InlineSkipList::Insert at 3.90% and targeting hot paths (root nodes modified in all 128 layers)

  2. Simple, focused change: The optimization is minimal and surgical - only affects the deduplication logic

  3. Benchmark results provided: Shows meaningful improvements (10.2% throughput increase in gigablocks scenario)

  4. Excellent code comments: The inline comment clearly explains the deduplication strategy


🔍 Correctness Analysis

✅ Correctness: VERIFIED

The change is semantically correct:

  1. Iteration order: layers_to_commit.into_iter().rev() iterates oldest→newest (since layers_to_commit was built newest→oldest in lines 172-176). This ensures newer values overwrite older ones via HashMap::extend().

  2. HashMap extend behavior: FxHashMap::extend() correctly overwrites existing keys with new values, which is the desired behavior for keeping the latest version of each node.

  3. Return type preserved: The collect() converts back to Vec<(Vec<u8>, Vec<u8>)>, maintaining API compatibility.


🔒 Security & Consensus Implications

✅ No security concerns identified

  • The deduplication is purely a performance optimization
  • The final state is identical: only the latest value for each trie node key is preserved
  • No changes to cryptographic operations, state root calculations, or consensus logic
  • The trie node keys and values remain unchanged

⚡ Performance Considerations

✅ Performance improvement validated

  1. WriteBatch size reduction: 20-50% reduction is realistic for hot paths modified across many layers
  2. HashMap overhead: The temporary HashMap allocation is negligible compared to the savings from reduced RocksDB operations
  3. FxHashMap choice: Appropriate for non-cryptographic hashing of Vec keys

Minor consideration: The collect() at the end creates a Vec with arbitrary ordering. If deterministic ordering matters for debugging or reproducibility, consider using BTreeMap or sorting. However, since RocksDB WriteBatch order doesn't affect final state, this is acceptable.


🦀 Rust Best Practices

✅ Idiomatic Rust

  1. Good use of FxHashMap::default(): Appropriate for non-crypto hashing
  2. Correct ownership: into_iter().rev() properly consumes the Vec
  3. Proper use of extend(): Leverages the standard library's optimized bulk insertion

Minor style note: Consider adding a capacity hint for the HashMap:

let mut nodes_to_commit = FxHashMap::with_capacity_and_hasher(
    layers_to_commit.iter().map(|l| l.nodes.len()).sum(),
    Default::default()
);

This would avoid rehashing during growth, though the impact may be minimal since extend() already reserves capacity.


🧪 Testing

⚠️ Observation: No unit tests exist for the layering.rs module (searched but found none).

Recommendation: While the benchmarks validate the performance improvement, consider adding a unit test specifically for the deduplication behavior:

  • Test that a key modified in multiple layers only appears once in the output
  • Verify that the newest value is preserved
  • Test edge cases (single layer, all layers modify same key, no overlapping keys)

This would prevent regression if the code is refactored.


📝 Minor Suggestions

  1. crates/storage/layering.rs:185-188 - Consider pre-allocating the HashMap capacity:

    let total_capacity: usize = layers_to_commit.iter().map(|l| l.nodes.len()).sum();
    let mut nodes_to_commit = FxHashMap::with_capacity_and_hasher(total_capacity, Default::default());
  2. Documentation: The inline comment is excellent, but consider adding a doc comment to the commit method itself explaining the deduplication behavior.


🎯 Verdict

APPROVED

This is a well-crafted performance optimization with:

  • Clear motivation backed by profiling data
  • Correct implementation that preserves semantics
  • Meaningful benchmark results showing 10%+ throughput improvement
  • No security or correctness concerns
  • Good code quality and documentation

The change is low-risk and high-value. The minor suggestions about capacity pre-allocation and testing are nice-to-haves but not blockers.


Automated review by Claude (Anthropic) · custom prompt

@greptile-apps
Copy link

greptile-apps bot commented Feb 11, 2026

Greptile Overview

Greptile Summary

Optimizes TrieLayerCache::commit() by deduplicating trie nodes across layers before writing to RocksDB. Previously, nodes modified in multiple layers (e.g., root node modified in all 128 layers) produced duplicate WriteBatch entries. The change collects nodes into an FxHashMap instead of a Vec, iterating oldest-to-newest so newer values overwrite older ones, then converts back to Vec for the return value. Benchmark results show impressive gains: +10.2% throughput on gigablocks and -12.1% latency. The implementation correctly reverses layers_to_commit (which is built newest-to-oldest via parent chain traversal) to ensure proper overwrite semantics.

Confidence Score: 5/5

Important Files Changed

Filename Overview
crates/storage/layering.rs Optimizes commit() to deduplicate trie nodes across layers using HashMap before returning, reducing WriteBatch size by 20-50%
CHANGELOG.md Added changelog entry for trie node deduplication optimization, but PR number needs correction (#6177#6179)

Sequence Diagram

sequenceDiagram
    participant Caller
    participant TrieLayerCache
    participant HashMap as FxHashMap
    participant RocksDB
    
    Caller->>TrieLayerCache: commit(state_root)
    TrieLayerCache->>TrieLayerCache: Remove layers following parent chain
    Note over TrieLayerCache: layers_to_commit built newest→oldest
    TrieLayerCache->>HashMap: Create empty FxHashMap
    loop For each layer (reversed: oldest→newest)
        TrieLayerCache->>HashMap: extend(layer.nodes)
        Note over HashMap: Newer values overwrite older duplicates
    end
    TrieLayerCache->>TrieLayerCache: Convert HashMap to Vec
    TrieLayerCache->>Caller: Return Vec<(key, value)>
    Caller->>RocksDB: WriteBatch with deduplicated nodes
    Note over RocksDB: 20-50% fewer entries
Loading

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

@github-actions
Copy link

Lines of code report

Total lines added: 0
Total lines removed: 1
Total lines changed: 1

Detailed view
+-----------------------------------+-------+------+
| File                              | Lines | Diff |
+-----------------------------------+-------+------+
| ethrex/crates/storage/layering.rs | 171   | -1   |
+-----------------------------------+-------+------+

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@github-actions
Copy link

github-actions bot commented Feb 11, 2026

Benchmark Block Execution Results Comparison Against Main

Command Mean [s] Min [s] Max [s] Relative
base 63.564 ± 0.357 62.755 64.022 1.01 ± 0.01
head 63.015 ± 0.207 62.660 63.289 1.00

@ilitteri ilitteri closed this Feb 23, 2026
@github-project-automation github-project-automation bot moved this from In Review to Done in ethrex_l1 Feb 23, 2026
@github-project-automation github-project-automation bot moved this from Todo to Done in ethrex_performance Feb 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

L1 Ethereum client performance Block execution throughput and performance in general

Projects

Status: Done
Status: Done

Development

Successfully merging this pull request may close these issues.

5 participants