-
Notifications
You must be signed in to change notification settings - Fork 6.4k
Remote Compaction (Experimental)
NOTE: We've renamed the APIs from Start()/StartV2()
to Schedule()
and from WaitForComplete()/WaitForCompleteV2()
to Wait()
while adding support for generic unique identifier in 9.1.0. Please refer to this PR for details.
The Remote Compaction feature enables the user to run the compaction remotely, it could be a different process or even on a remote host. It separates the background compaction from the primary host, which has performance benefits and improves flexibility. Especially if the compactions are offloaded to a remote host, there won't be a background compaction job impacting the read/write requests. And on the remote host that is dedicated for compaction, it can be tuned only for compaction and used for running compactions from different DBs. Currently, the remote host has to have access to the DB to run the compaction.
Here is an overview of the Remote Compaction feature:
The first step is primary DB triggers the compaction, instead of running the compaction locally, it sends the compaction information to a callback in CompactionService
. The user needs to implement the CompactionService::Start()
CompactionService::Schedule()
, which sends the compaction information to a remote process to schedule the compaction.
On the remote Compaction Worker side, it needs to run DB::OpenAndCompact()
with the compaction information sent from the primary. Based on the compaction information, the worker opens the DB in read-only mode and runs the compaction. The compaction worker cannot change the LSM tree, it outputs the compaction result to a temporary location that the user needs to set.
Once the compaction is done, the compaction result needs to be sent back to primary, which includes the metadata about the compacted SSTs and some internal information. The same as scheduling, the user needs to implement the communication between primary and compaction workers.
The primary is waiting for the result by callback CompactionService::WaitForComplete()
CompactionService::Wait()
. The result should be passed to that API and return function call. After that, the primary will install the result by renaming the result SST files in the temporary workplace to the LSM files. Then the compaction input files will be purged.
As RocksDB is renaming the result SST files, make sure the temporary workplace and the DB are on the same file system. If not, the user needs to copy the file to the DB file system before returning the Wait()
call.
Here is the overview of the API between Primary and Compaction Worker. The Compaction Service part needs to be implemented by the user and set by Options.CompactionService
.
Contents
- RocksDB Wiki
- Overview
- RocksDB FAQ
- Terminology
- Requirements
- Contributors' Guide
- Release Methodology
- RocksDB Users and Use Cases
- RocksDB Public Communication and Information Channels
-
Basic Operations
- Iterator
- Prefix seek
- SeekForPrev
- Tailing Iterator
- Compaction Filter
- Multi Column Family Iterator
- Read-Modify-Write (Merge) Operator
- Column Families
- Creating and Ingesting SST files
- Single Delete
- Low Priority Write
- Time to Live (TTL) Support
- Transactions
- Snapshot
- DeleteRange
- Atomic flush
- Read-only and Secondary instances
- Approximate Size
- User-defined Timestamp
- Wide Columns
- BlobDB
- Online Verification
- Options
- MemTable
- Journal
- Cache
- Write Buffer Manager
- Compaction
- SST File Formats
- IO
- Compression
- Full File Checksum and Checksum Handoff
- Background Error Handling
- Huge Page TLB Support
- Tiered Storage (Experimental)
- Logging and Monitoring
- Known Issues
- Troubleshooting Guide
- Tests
- Tools / Utilities
-
Implementation Details
- Delete Stale Files
- Partitioned Index/Filters
- WritePrepared-Transactions
- WriteUnprepared-Transactions
- How we keep track of live SST files
- How we index SST
- Merge Operator Implementation
- RocksDB Repairer
- Write Batch With Index
- Two Phase Commit
- Iterator's Implementation
- Simulation Cache
- [To Be Deprecated] Persistent Read Cache
- DeleteRange Implementation
- unordered_write
- Extending RocksDB
- RocksJava
- Lua
- Performance
- Projects Being Developed
- Misc