Skip to content

Switch RocksDB block cache from LRU to HyperClockCache#4473

Open
AhmedSoliman wants to merge 10 commits intomainfrom
pr4473
Open

Switch RocksDB block cache from LRU to HyperClockCache#4473
AhmedSoliman wants to merge 10 commits intomainfrom
pr4473

Conversation

@github-actions
Copy link

github-actions bot commented Mar 9, 2026

Test Results

 5 files  ±0   5 suites  ±0   1m 6s ⏱️ -11s
34 tests ±0  34 ✅ ±0  0 💤 ±0  0 ❌ ±0 
52 runs  ±0  52 ✅ ±0  0 💤 ±0  0 ❌ ±0 

Results for commit c6556c1. ± Comparison against base commit b07a875.

@github-actions
Copy link

github-actions bot commented Mar 9, 2026

Test Results

  7 files  ± 0    7 suites  ±0   4m 54s ⏱️ + 2m 22s
 49 tests + 2   49 ✅ + 2  0 💤 ±0  0 ❌ ±0 
210 runs  +10  210 ✅ +10  0 💤 ±0  0 ❌ ±0 

Results for commit 39c89b4. ± Comparison against base commit efdb162.

♻️ This comment has been updated with latest results.

Extract two generic, reusable utilities into restate-futures-util:

**monotonic_token**: A lightweight mechanism for a producer to signal
completion of a prefix of sequentially issued work items. Provides
Token<T>, TokenOwner<T>, Tokens<T>, and TokenListener<T> types with
a phantom type parameter to prevent mixing tokens from different domains.
Uses atomics (Relaxed/Release/Acquire) for lock-free operation — no
RwLock or watch overhead.

**waiter_queue**: A priority-drainable queue (WaiterQueue<K, V>) designed
for the common case where entries arrive in key-order. Uses an adaptive
strategy: push_back for in-order inserts (O(1)), binary-search insert for
out-of-order (rare). Drain is always a simple front-pop. Includes a
Criterion benchmark comparing four strategies (naive, compact,
adaptive, sorted-insert).

Both modules include comprehensive documentation and tests. Neither
references any specific use-case — they are general-purpose building
blocks.
…closed

This makes turning off loglet workers cleaner (next PR).
- Priority-queue based writer allowing seal messages to jump the queue
- Deduplication of seal messages and store messages
- Improved metrics for the write path (counting bytes, stores, and store status)
- Loglet workers will shutdown when quiescent and release resources
- Writer task limits the batch based on the memtable size as a reasonable guidance and removing the need for `write-batch-commit-count` config.
- Removed the returned WriteBatch in the error case since write errors are terminal. This reduces the size of the returned Result.
Replace raw usize/NonZeroUsize types with type-safe NonZeroByteCount
for all RocksDB memory budget configurations across the codebase.

Key changes:
- CommonOptions: make rocksdb_total_memory_size private behind a getter
  that enforces a 256 MiB minimum; rename rocksdb_actual_total_memtables_size
  to rocksdb_total_memtables_size with a 32 MiB floor; remove the 5% safety
  margin (rocksdb_safe_total_memtables_size); clamp memtables ratio to
  [0.1, 1.0] instead of [0.0, 1.0]
- LogServerOptions: remove data_service_memory_limit config (memory pool
  capacity is now derived from rocksdb_data_memtables_budget); fix metadata
  memtables budget to a constant 8 MiB instead of a ratio; enforce 40 MiB
  (32 MiB data + 8 MiB metadata) minimum for log-server memory budget
- MetadataServerOptions/StorageOptions: change rocksdb_memory_budget return
  types from usize to NonZeroByteCount with per-component minimums
- ByteCount: add arithmetic ops (Add, Mul, saturating_add/mul), Default,
  and TryFrom<u64> for NonZeroByteCount
- Remove unnecessary runtime assertions that were checking for non-zero on
  already non-zero types
Move db-level properties (is-write-stopped, background-errors,
num-running-compactions, actual-delayed-write-rate) from the per-CF set
to the per-DB set since they are database-wide. Also fix the unit of
actual-delayed-write-rate to Bytes and add blob-db metrics
(live-blob-file-size, live-blob-file-garbage-size) and
obsolete-sst-files-size for log-server observability.
…ion budgets

Remove the shared rocksdb_max_background_jobs config (which gave every database
CPU_COUNT background jobs) and replace it with role-aware per-database budgets
for flushes and compactions. Flushes (latency-critical) are split equally across
databases while compactions (throughput-heavy) are weighted ~65% toward the
partition-store. Metadata-server and local-loglet get a fixed budget of 1+1.

Also adds worker.snapshots.export-concurrency-limit (default 4) to replace the
snapshot export concurrency that was previously derived from max_background_jobs.
Add two new options to RocksDbOptions:
- rocksdb-disable-wal-compression: disables Zstd WAL compression (default: false)
- rocksdb-disable-l0-l1-compression: disables Zstd L0/L1 SST compression (default: false)

Both options cascade from common rocksdb config and default to compression
enabled, preserving existing behavior. A new build_compression_per_level()
helper in the rocksdb crate constructs per-level compression arrays.
HyperClockCache (HCC) is the recommended default for RocksDB, offering
better scalability under concurrent access compared to LRU cache.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant