Skip to content

Make RocksDB WAL and L0/L1 compression configurable#4472

Open
AhmedSoliman wants to merge 9 commits intomainfrom
pr4472
Open

Make RocksDB WAL and L0/L1 compression configurable#4472
AhmedSoliman wants to merge 9 commits intomainfrom
pr4472

Conversation

@AhmedSoliman
Copy link
Contributor

@AhmedSoliman AhmedSoliman commented Mar 9, 2026

@github-actions
Copy link

github-actions bot commented Mar 9, 2026

Test Results

  7 files  +  2    7 suites  +2   4m 49s ⏱️ + 3m 58s
 49 tests + 31   49 ✅ + 31  0 💤 ±0  0 ❌ ±0 
210 runs  +174  210 ✅ +174  0 💤 ±0  0 ❌ ±0 

Results for commit fcc0c66. ± Comparison against base commit efdb162.

This pull request removes 18 and adds 49 tests. Note that renamed tests count towards both.
dev.restate.sdktesting.tests.AwakeableIngressEndpointTest ‑ completeWithFailure(Client)
dev.restate.sdktesting.tests.AwakeableIngressEndpointTest ‑ completeWithSuccess(Client)
dev.restate.sdktesting.tests.IngressTest ‑ idempotentInvokeSend(Client)
dev.restate.sdktesting.tests.IngressTest ‑ idempotentInvokeService(Client)
dev.restate.sdktesting.tests.IngressTest ‑ idempotentInvokeVirtualObject(Client)
dev.restate.sdktesting.tests.IngressTest ‑ idempotentSendThenAttachWIthIdempotencyKey(Client)
dev.restate.sdktesting.tests.IngressTest ‑ privateService(URI, Client)
dev.restate.sdktesting.tests.JournalRetentionTest ‑ journalShouldBeRetained(Client, URI)
dev.restate.sdktesting.tests.KafkaAndWorkflowAPITest ‑ callSharedWorkflowHandler(URI, int, Client)
dev.restate.sdktesting.tests.KafkaAndWorkflowAPITest ‑ callWorkflowHandler(URI, int, Client)
…
dev.restate.sdktesting.tests.CallOrdering ‑ ordering(boolean[], Client)[1]
dev.restate.sdktesting.tests.CallOrdering ‑ ordering(boolean[], Client)[2]
dev.restate.sdktesting.tests.CallOrdering ‑ ordering(boolean[], Client)[3]
dev.restate.sdktesting.tests.Cancellation ‑ cancelFromAdminAPI(BlockingOperation, Client, URI)[1]
dev.restate.sdktesting.tests.Cancellation ‑ cancelFromAdminAPI(BlockingOperation, Client, URI)[2]
dev.restate.sdktesting.tests.Cancellation ‑ cancelFromAdminAPI(BlockingOperation, Client, URI)[3]
dev.restate.sdktesting.tests.Cancellation ‑ cancelFromContext(BlockingOperation, Client)[1]
dev.restate.sdktesting.tests.Cancellation ‑ cancelFromContext(BlockingOperation, Client)[2]
dev.restate.sdktesting.tests.Cancellation ‑ cancelFromContext(BlockingOperation, Client)[3]
dev.restate.sdktesting.tests.Combinators ‑ awakeableOrTimeoutUsingAwaitAny(Client)
…

♻️ This comment has been updated with latest results.

Extract two generic, reusable utilities into restate-futures-util:

**monotonic_token**: A lightweight mechanism for a producer to signal
completion of a prefix of sequentially issued work items. Provides
Token<T>, TokenOwner<T>, Tokens<T>, and TokenListener<T> types with
a phantom type parameter to prevent mixing tokens from different domains.
Uses atomics (Relaxed/Release/Acquire) for lock-free operation — no
RwLock or watch overhead.

**waiter_queue**: A priority-drainable queue (WaiterQueue<K, V>) designed
for the common case where entries arrive in key-order. Uses an adaptive
strategy: push_back for in-order inserts (O(1)), binary-search insert for
out-of-order (rare). Drain is always a simple front-pop. Includes a
Criterion benchmark comparing four strategies (naive, compact,
adaptive, sorted-insert).

Both modules include comprehensive documentation and tests. Neither
references any specific use-case — they are general-purpose building
blocks.
…closed

This makes turning off loglet workers cleaner (next PR).
- Priority-queue based writer allowing seal messages to jump the queue
- Deduplication of seal messages and store messages
- Improved metrics for the write path (counting bytes, stores, and store status)
- Loglet workers will shutdown when quiescent and release resources
- Writer task limits the batch based on the memtable size as a reasonable guidance and removing the need for `write-batch-commit-count` config.
- Removed the returned WriteBatch in the error case since write errors are terminal. This reduces the size of the returned Result.
Replace raw usize/NonZeroUsize types with type-safe NonZeroByteCount
for all RocksDB memory budget configurations across the codebase.

Key changes:
- CommonOptions: make rocksdb_total_memory_size private behind a getter
  that enforces a 256 MiB minimum; rename rocksdb_actual_total_memtables_size
  to rocksdb_total_memtables_size with a 32 MiB floor; remove the 5% safety
  margin (rocksdb_safe_total_memtables_size); clamp memtables ratio to
  [0.1, 1.0] instead of [0.0, 1.0]
- LogServerOptions: remove data_service_memory_limit config (memory pool
  capacity is now derived from rocksdb_data_memtables_budget); fix metadata
  memtables budget to a constant 8 MiB instead of a ratio; enforce 40 MiB
  (32 MiB data + 8 MiB metadata) minimum for log-server memory budget
- MetadataServerOptions/StorageOptions: change rocksdb_memory_budget return
  types from usize to NonZeroByteCount with per-component minimums
- ByteCount: add arithmetic ops (Add, Mul, saturating_add/mul), Default,
  and TryFrom<u64> for NonZeroByteCount
- Remove unnecessary runtime assertions that were checking for non-zero on
  already non-zero types
Move db-level properties (is-write-stopped, background-errors,
num-running-compactions, actual-delayed-write-rate) from the per-CF set
to the per-DB set since they are database-wide. Also fix the unit of
actual-delayed-write-rate to Bytes and add blob-db metrics
(live-blob-file-size, live-blob-file-garbage-size) and
obsolete-sst-files-size for log-server observability.
…ion budgets

Remove the shared rocksdb_max_background_jobs config (which gave every database
CPU_COUNT background jobs) and replace it with role-aware per-database budgets
for flushes and compactions. Flushes (latency-critical) are split equally across
databases while compactions (throughput-heavy) are weighted ~65% toward the
partition-store. Metadata-server and local-loglet get a fixed budget of 1+1.

Also adds worker.snapshots.export-concurrency-limit (default 4) to replace the
snapshot export concurrency that was previously derived from max_background_jobs.
Add two new options to RocksDbOptions:
- rocksdb-disable-wal-compression: disables Zstd WAL compression (default: false)
- rocksdb-disable-l0-l1-compression: disables Zstd L0/L1 SST compression (default: false)

Both options cascade from common rocksdb config and default to compression
enabled, preserving existing behavior. A new build_compression_per_level()
helper in the rocksdb crate constructs per-level compression arrays.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant