Skip to content

perf(l1): reduce trie/FKV memtable size and raise L0 compaction trigger#6183

Open
ilitteri wants to merge 2 commits intomainfrom
perf/rocksdb-memtable-compaction-tuning
Open

perf(l1): reduce trie/FKV memtable size and raise L0 compaction trigger#6183
ilitteri wants to merge 2 commits intomainfrom
perf/rocksdb-memtable-compaction-tuning

Conversation

@ilitteri
Copy link
Collaborator

@ilitteri ilitteri commented Feb 11, 2026

Motivation

Two complementary RocksDB bottlenecks identified from CPU profiles:

  1. Deep SkipList trees (Rank 5): Trie and FKV column families have 512MB write buffers x 6 = 3GB per CF. Deep SkipList trees make InlineSkipList::Insert (4.02% flat in slow) and MemTable::KeyComparator (1.20% flat) expensive. This grows superlinearly with memtable depth.

  2. Aggressive L0 compaction (Rank 6): Global L0 compaction trigger is 2 (extremely aggressive), causing many small compaction events. Compaction dominates CPU: 93% fast, 37% gigablocks, 60% slow.

Description

Rank 5 — Reduce memtable size:

  • ACCOUNT_TRIE_NODES / STORAGE_TRIE_NODES: write_buffer_size 512MB → 128MB, max_write_buffer_number 6 → 4
  • ACCOUNT_FLATKEYVALUE / STORAGE_FLATKEYVALUE: same reduction (identical write pattern)
  • Shallower SkipList at any given time → cheaper inserts and key comparisons

Rank 6 — Raise L0 compaction trigger:

  • Global: compaction trigger 2 → 8, slowdown 10 → 20, stop 16 → 36
  • Per-CF: compaction trigger 4 → 8 (slowdown/stop already at 20/36)
  • Batches more L0 files per compaction event, reducing total compaction work
  • Higher triggers accommodate the more frequent flushes from smaller memtables

Benchmark Results

All benchmarks run on ethrex-office-3 with CPU profiling enabled. Baseline: main at commit 1a47fb18.

Throughput (Mgas/s)

Scenario Baseline (main) Tuned Change
Fast (200 blocks) 420.48 431.43 +2.6%
Gigablocks (100 synthetic) 909.63 1033.63 +13.6%
Slow (5000 blocks) 811.04 843.31 +4.0%

Latency

Scenario Baseline avg / p95 Tuned avg / p95
Fast 67.81ms / 148.88ms 71.65ms / 173.43ms
Gigablocks 1410ms / 2760ms 1180ms / 2100ms
Slow 29.48ms / 76.03ms 27.53ms / 73.65ms

CPU Profile Comparison — Fast (runs #63 vs #76)

Function Baseline Tuned Change
Total samples 38,678ms 37,917ms -2.0%
CompactionJob::Run 35,870ms cum 34,983ms cum -2.5%
VM::execute 691ms cum 715ms cum +3.5%
rayon::join 599ms cum 590ms cum -1.5%
PosixFile::Read 4,321ms flat 4,377ms flat +1.3%
RocksDBReadTx::get 477ms cum 501ms cum +5.0%
BlockBasedTable::Get 397ms cum 415ms cum +4.5%

CPU Profile Comparison — Gigablocks (runs #64 vs #77)

Function Baseline Tuned Change
Total samples 49,833ms 48,119ms -3.4%
CompactionJob::Run 18,174ms cum 18,461ms cum +1.6%
VM::execute 12,370ms cum 11,090ms cum -10.3%
rayon::join 10,573ms cum 9,569ms cum -9.5%
PosixFile::Read 5,854ms flat 5,763ms flat -1.6%
RocksDBReadTx::get 6,297ms cum 5,769ms cum -8.4%
BlockBasedTable::Get 5,327ms cum 4,940ms cum -7.3%
apply_trie_updates 318ms cum 299ms cum -6.0%

CPU Profile Comparison — Slow (runs #65 vs #78)

Function Baseline Tuned Change
Total samples 127,137ms 125,587ms -1.2%
CompactionJob::Run 76,185ms cum 75,729ms cum -0.6%
VM::execute 13,244ms cum 12,846ms cum -3.0%
rayon::join 15,560ms cum 15,383ms cum -1.1%
PosixFile::Read 11,178ms flat 11,454ms flat +2.5%
InlineSkipList::Insert 5,105ms flat 3,936ms flat -22.9%
MemTable::KeyComparator 1,525ms flat 1,436ms flat -5.8%
RecomputeSpliceLevels 1,755ms cum 1,607ms cum -8.4%
apply_trie_updates 10,368ms cum 9,089ms cum -12.3%
MemTableInserter::PutCFImpl 7,570ms cum 6,216ms cum -17.9%
WriteBatchInternal::Put 634ms flat 683ms flat +7.7%
BlockBasedTable::Get 5,215ms cum 5,275ms cum +1.2%

Analysis

The optimization targets are clearly confirmed in the slow profile:

  • InlineSkipList::Insert dropped 22.9% (5.1s → 3.9s) — shallower memtables mean fewer skiplist levels to traverse during insertion
  • MemTableInserter::PutCFImpl dropped 17.9% (7.6s → 6.2s) — the entire memtable insertion path benefits
  • apply_trie_updates dropped 12.3% (10.4s → 9.1s) — cumulative savings from cheaper insertions

The gigablocks scenario shows +13.6% throughput with broad improvements across EVM (-10%), rayon (-9.5%), and read path (-8.4%). This suggests the higher L0 trigger reduces contention between compaction and foreground work.

Fast shows modest +2.6% as expected (short runs don't build deep enough memtables for Rank 5 to matter much).

How to Test

Run the EXPB benchmark suite:

# Fast scenario (200 mainnet blocks)
# Gigablocks scenario (100 synthetic max-gas blocks)
# Slow scenario (5000 mainnet blocks)

Two complementary RocksDB tuning changes:

1. Reduce write_buffer_size for trie and FKV column families from 512MB to
   128MB, and max_write_buffer_number from 6 to 4. This keeps SkipList trees
   shallower, reducing InlineSkipList::Insert and MemTable::KeyComparator
   overhead that grows superlinearly with memtable depth.

2. Raise L0 compaction trigger from 2/4 to 8 (global and per-CF), slowdown
   from 10 to 20, stop from 16 to 36. This batches more L0 files per
   compaction event, reducing total compaction work. Higher triggers are
   needed to accommodate the more frequent flushes from smaller memtables.
Copilot AI review requested due to automatic review settings February 11, 2026 18:25
@ilitteri ilitteri requested a review from a team as a code owner February 11, 2026 18:25
@ilitteri ilitteri added performance Block execution throughput and performance in general L1 Ethereum client labels Feb 11, 2026
@ethrex-project-sync ethrex-project-sync bot moved this to In Review in ethrex_l1 Feb 11, 2026
@github-actions
Copy link

🤖 Kimi Code Review

Review Summary

This PR adjusts RocksDB configuration parameters to optimize memory usage and reduce compaction overhead for trie and flat key-value storage. The changes appear reasonable for performance tuning, but I have a few observations:

Issues Found

  1. Potential Memory Pressure Risk (lines 121-122, 134-135):

    • Reducing write_buffer_size from 512MB to 128MB for trie/flatkeyvalue CFs while keeping max_write_buffer_number at 4 means total memtable memory could still reach 512MB (128MB × 4)
    • However, the reduction in max_write_buffer_number from 6 to 4 actually reduces maximum memtable memory from 3GB to 512MB, which is a significant reduction
    • Concern: This aggressive memory reduction could potentially impact write performance under heavy load
  2. Inconsistent Compaction Trigger Values (lines 41, 91):

    • Global/default compaction trigger is set to 8 (line 41)
    • But specific column families (line 91) also set it to 8
    • This creates redundancy - the global setting already covers all CFs unless overridden
  3. Missing Validation for Extreme Values:

    • The new level_zero_stop_writes_trigger of 36 (lines 42, 93) seems high compared to the slowdown trigger of 20
    • No validation that these ratios make sense (typically stop should be ~2x slowdown, which 36/20=1.8 satisfies)

Recommendations

  1. Add configuration validation to ensure the ratios between L0 triggers are reasonable:

    // Consider adding validation like:
    assert!(stop_writes_trigger > slowdown_writes_trigger);
    assert!(slowdown_writes_trigger > compaction_trigger);
  2. Document the rationale for these specific values in comments, especially the 4x reduction in write buffer size

  3. Consider making these tunable via configuration file rather than hardcoded, as optimal values may vary by deployment

The changes appear safe from a consensus perspective since they only affect performance characteristics, not data correctness. The memory reduction is substantial and should help with resource usage, though it may need monitoring in production.


Automated review by Kimi (Moonshot AI) · custom prompt

@github-actions
Copy link

🤖 Codex Code Review

Review

  • Potential performance regression risk: raising L0 compaction trigger while shrinking memtables can increase L0 file count/compaction lag and read amplification under heavy write load. Consider validating with workload traces or adding a brief rationale in code/comments, especially for trie/flat KV CFs. crates/storage/backend/rocksdb.rs:41-52, crates/storage/backend/rocksdb.rs:91-94, crates/storage/backend/rocksdb.rs:121-137

No correctness, security, or Ethereum‑specific logic issues spotted in this diff; it’s configuration-only.

If you want, I can suggest a small comment or add a lightweight benchmark note.

Next steps (optional):

  1. Run a targeted RocksDB perf regression benchmark for trie-heavy workloads.

Automated review by OpenAI Codex · custom prompt

@github-actions
Copy link

🤖 Claude Code Review

Perfect! Now I have all the context I need. Let me provide a comprehensive code review.


PR Review: Reduce trie/FKV memtable size and raise L0 compaction trigger

Summary

This PR implements two complementary RocksDB performance optimizations based on profiling data: reducing memtable size to decrease SkipList depth, and raising L0 compaction triggers to reduce compaction overhead. The changes are well-motivated and the implementation is clean.

✅ Strengths

  1. Evidence-based optimization: Changes are driven by actual CPU profiling data (Rank 5 and Rank 6 bottlenecks)
  2. Complementary fixes: The smaller memtable size (causing more frequent flushes) is balanced by higher L0 compaction triggers
  3. Focused scope: Only modifies performance-critical column families (trie and FKV) rather than global changes
  4. No consensus impact: Performance tuning doesn't affect correctness, state transitions, or EIP compliance

🔍 Detailed Review

crates/storage/backend/rocksdb.rs

Lines 41-43: Global L0 compaction triggers (2→8, 10→20, 16→36)

Status: ✅ Good

  • The 4x increase in compaction trigger is well-justified by the profiling data
  • Ratios are reasonable: compaction(8) → slowdown(20=2.5x) → stop(36=1.8x)
  • Aligns with RocksDB best practices for write-heavy workloads

Lines 91-93: Per-CF L0 compaction triggers

Status: ⚠️ Minor redundancy

  • Line 91 sets compaction_trigger=8 for all CFs, matching the global setting
  • This creates redundancy but no functional issue
  • Suggestion: Consider removing line 91 since it duplicates the global setting (line 41), unless you plan to differentiate CFs in future tuning
// Line 91 could be removed if all CFs should use global defaults
// cf_opts.set_level_zero_file_num_compaction_trigger(8); // Already set globally

Lines 122-123 & 134-135: Trie/FKV memtable reduction (512MB→128MB, 6→4 buffers)

Status: ✅ Good, with monitoring needed

Positive aspects:

  • Reduces total memtable memory from 3GB to 512MB per CF (75% reduction!)
  • Shallower SkipList trees will reduce InlineSkipList::Insert and MemTable::KeyComparator overhead
  • The min_write_buffer_number_to_merge=2 setting is unchanged and appropriate

Considerations:

  • More frequent flushes (every 128MB instead of 512MB) → more L0 files created
  • This is exactly why L0 compaction trigger was raised (good design!)
  • Under extreme write pressure, the 4 buffers × 128MB = 512MB limit could cause slowdowns
  • Recommendation: Monitor write stall metrics in production benchmarks

Math check:

Before: 512MB × 6 buffers = 3GB max memtable memory
After:  128MB × 4 buffers = 512MB max memtable memory
Reduction: 83% less memory usage ✓

CHANGELOG.md:5-6

Status: ✅ Good

  • Properly documented in the performance section
  • Includes PR reference for traceability

🔒 Security & Correctness

  • No security impact: RocksDB tuning parameters don't affect cryptographic operations, state validation, or consensus rules
  • No data correctness risk: These parameters only affect performance characteristics, not data integrity
  • Crash recovery: WAL settings unchanged, recovery behavior unaffected
  • Memory safety: Rust's type system + RocksDB bindings prevent buffer overflows

⚡ Performance Considerations

Expected improvements:

  1. ✅ Reduced CPU time in SkipList operations (shallower trees)
  2. ✅ Fewer compaction events (batched L0→L1 merges)
  3. ✅ Lower memory footprint (83% reduction in memtable memory)

Potential risks:

  1. ⚠️ More frequent flushes could increase disk I/O (small writes)
  2. ⚠️ Higher L0 file counts before compaction might slow reads temporarily
  3. ⚠️ Write stalls possible if 4×128MB fills faster than flush rate

Mitigation: These are balanced by the complementary changes (higher L0 triggers accommodate more flushes)

📊 Testing Recommendations

Since benchmarks are still running (per PR description), monitor:

  1. Write throughput: Ensure smaller memtables don't bottleneck under sustained writes
  2. Read latency: Check if higher L0 file counts affect lookup performance
  3. Compaction stats: Verify that compaction CPU usage actually decreases
  4. Memory usage: Confirm the expected 83% memtable reduction
  5. Write stall frequency: Watch rocksdb.level0.slowdown and rocksdb.level0.num_files

🎯 Recommendations

High Priority

None - the changes are sound as-is.

Low Priority (Nice to have)

  1. Remove redundant setting at line 91 (duplicates global setting at line 41)

  2. Add inline comments explaining the rationale:

    // Reduced from 512MB to limit SkipList depth and insertion overhead
    cf_opts.set_write_buffer_size(128 * 1024 * 1024); // 128MB
    // Reduced from 6 to balance memory usage with flush rate
    cf_opts.set_max_write_buffer_number(4);
  3. Consider making these tunable via config file for easier A/B testing in production

✅ Final Verdict

LGTM with minor suggestions. The changes are:

  • Well-motivated by profiling data
  • Correctly implemented
  • Appropriately scoped to affected column families
  • Free from security/correctness concerns
  • Worth merging pending benchmark validation

The complementary nature of the two optimizations (smaller memtables + higher L0 triggers) shows good systems thinking. The 83% memory reduction is substantial and should help with resource-constrained deployments.

Approval status: ✅ Approved (pending successful benchmark results)


Automated review by Claude (Anthropic) · custom prompt

@ilitteri ilitteri changed the title perf(l1): Reduce trie/FKV memtable size and raise L0 compaction trigger perf(l1): reduce trie/FKV memtable size and raise L0 compaction trigger Feb 11, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates RocksDB tuning parameters to reduce write-path CPU overhead in the trie/flat-kv column families and reduce compaction churn by batching more L0 files per compaction.

Changes:

  • Increase L0 compaction triggers (trigger/slowdown/stop) to reduce frequency of small L0 compactions.
  • Reduce memtable (write_buffer_size) and max_write_buffer_number for trie + flat-kv CFs to keep SkipList depth smaller.
  • Add a Perf changelog entry for the tuning change.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
crates/storage/backend/rocksdb.rs Adjusts RocksDB L0 compaction thresholds and trie/FKV memtable sizing parameters.
CHANGELOG.md Documents the performance tuning change.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +41 to +43
opts.set_level_zero_file_num_compaction_trigger(8);
opts.set_level_zero_slowdown_writes_trigger(20);
opts.set_level_zero_stop_writes_trigger(36);
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The L0 compaction triggers are set both on opts and again on every cf_opts in the CF descriptor loop. Since cf_opts already applies these values for all column families, this duplication can drift over time and makes it harder to tell which settings are intended to be authoritative. Consider defining shared constants and applying them only once (preferably on cf_opts, or remove the opts.set_level_zero_* calls if they’re not needed).

Suggested change
opts.set_level_zero_file_num_compaction_trigger(8);
opts.set_level_zero_slowdown_writes_trigger(20);
opts.set_level_zero_stop_writes_trigger(36);

Copilot uses AI. Check for mistakes.
@greptile-apps
Copy link

greptile-apps bot commented Feb 11, 2026

Greptile Overview

Greptile Summary

This PR tunes RocksDB configuration for the L1 storage backend by (1) reducing write buffer sizes / max memtables for trie and flat key-value column families to reduce memtable skiplist depth and insert overhead, and (2) increasing L0 compaction triggers (global and per-CF) to batch more L0 files per compaction and reduce compaction churn.

Changes are localized to the RocksDB backend’s open() path where global Options and per-column-family Options are constructed before opening the DB, plus a changelog entry documenting the perf motivation.

Confidence Score: 5/5

  • This PR is safe to merge with minimal risk.
  • Changes are limited to RocksDB tuning knobs (memtable sizes and compaction triggers) and a changelog entry; no API/logic changes were introduced, and the new values are consistent across global and per-CF options.
  • crates/storage/backend/rocksdb.rs (monitor for performance/regression under heavy write workloads)

Important Files Changed

Filename Overview
CHANGELOG.md Adds a perf changelog entry for 2026-02-11 referencing #6183; no functional impact.
crates/storage/backend/rocksdb.rs Adjusts RocksDB Options/CF options: reduces write buffers for trie/FKV and raises L0 compaction triggers. No correctness issues found, but behavior change increases write stall thresholds and flush frequency.

Sequence Diagram

sequenceDiagram
    participant App as Ethrex
    participant Backend as RocksDBBackend::open
    participant Rocks as RocksDB

    App->>Backend: open(path)
    Backend->>Backend: Build global Options
    Backend->>Rocks: list_cf(opts, path)
    Rocks-->>Backend: existing CF names
    Backend->>Backend: Union(existing CFs, TABLES)
    loop For each CF
        Backend->>Backend: Build CF Options
        Backend->>Backend: Apply per-CF memtable + L0 trigger tuning
    end
    Backend->>Rocks: open_cf_descriptors(opts, cf_descriptors)
    Rocks-->>Backend: DB handle
    Backend->>Rocks: drop_cf(obsolete CFs)
    Backend-->>App: RocksDBBackend
Loading

@github-actions
Copy link

github-actions bot commented Feb 11, 2026

Benchmark Block Execution Results Comparison Against Main

Command Mean [s] Min [s] Max [s] Relative
base 63.699 ± 0.269 63.264 64.182 1.01 ± 0.01
head 63.136 ± 0.175 62.829 63.355 1.00

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

L1 Ethereum client performance Block execution throughput and performance in general

Projects

Status: In Review
Status: Todo

Development

Successfully merging this pull request may close these issues.

5 participants