Skip to content

RocksDB FAQ

Andrew Kryczka edited this page Apr 23, 2024 · 58 revisions

Building RocksDB

Q: What is the absolute minimum version of gcc that we need to build RocksDB?

A: 4.8.

Q: What is RocksDB's latest stable release?

A: All the releases in https://github.com/facebook/rocksdb/releases are stable. For RocksJava, stable releases are available in https://oss.sonatype.org/#nexus-search;quick~rocksdb.

Basic Read/Write

Q: Are basic operations Put(), Write(), Get() and NewIterator() thread safe?

A: Yes.

Q: Can I write to RocksDB using multiple processes?

A: No. However, it can be opened using Secondary DB. If no write goes to the database, it can be opened in read-only mode from multiple processes.

Q: Does RocksDB support multi-process read access?

A: Yes, you can read it using secondary database using DB::OpenAsSecondary(). RocksDB can also support multi-process read only process without writing the database. This can be done by opening the database with DB::OpenForReadOnly() call.

Q: Is it safe to close RocksDB while another thread is issuing read, write or manual compaction requests?

A: No. The users of RocksDB need to make sure all functions have finished before they close RocksDB. You can speed up the waiting by calling DisableManualCompaction().

Q: What's the maximum key and value sizes supported?

A: In general, RocksDB is not designed for large keys. The maximum recommended sizes for key and value are 8MB and 3GB respectively.

Q: What's the fastest way to load data into RocksDB?

A: A fast way to direct insert data to the DB:

  1. using single writer thread and insert in sorted order
  2. batch hundreds of keys into one write batch
  3. use vector memtable
  4. make sure options.max_background_flushes is at least 4
  5. before inserting the data, disable automatic compaction, set options.level0_file_num_compaction_trigger, options.level0_slowdown_writes_trigger and options.level0_stop_writes_trigger to very large value. After inserting all the data, issue a manual compaction.

3-5 will be automatically done if you call Options::PrepareForBulkLoad() to your option

If you can pre-process the data offline before inserting. There is a faster way: you can sort the data, generate SST files with non-overlapping ranges in parallel and bulk load the SST files. See https://github.com/facebook/rocksdb/wiki/Creating-and-Ingesting-SST-files

Q: What is the correct way to delete the DB? Can I simply call DestroyDB() on a live DB?

A: Close the DB then destroy the DB is the correct way. Calling DestroyDB() on a live DB is an undefined behavior.

Q: What is the difference between DestroyDB() and directly deleting the DB directory manually?

A: The major difference is that DestroyDB() will take care of the case where the RocksDB database is stored in multiple directories. For instance, a single DB can be configured to store its data in multiple directories by specifying different paths to DBOptions::db_paths, DBOptions::db_log_dir, and DBOptions::wal_dir.

Q: Any better way to dump key-value pairs generated by map-reduce job into RocksDB?

A: A better way is to use SstFileWriter, which allows you to directly create RocksDB SST files and add them to a RocksDB database. However, if you're adding SST files to an existing RocksDB database, then its key-range must not overlap with the database. https://github.com/facebook/rocksdb/wiki/Creating-and-Ingesting-SST-files

Q: Is it safe to read from or write to RocksDB inside compaction filter callback?

A: It is safe to read but not always safe to write to RocksDB inside compaction filter callback as write might trigger deadlock when write-stop condition is triggered.

Q: Does RocksDB hold SST files and memtables for a snapshot?

A: No. See https://github.com/facebook/rocksdb/wiki/RocksDB-Overview#gets-iterators-and-snapshots for how snapshots work.

Q: With DBWithTTL, is there a time bound for the expired keys to be removed?

A: DBwithTTL itself does not provide an upper time bound. Expired keys will be removed when they are part of any compaction. However, there is no guarantee that when such compaction will start. For instance, if you have a certain key-range that is never updated, then compaction is less likely to apply to that key-range. For leveled compaction, you can enforce some limit using the feature of periodic compaction to do that. The feature right now has a limitation: if the write rate is too slow that memtable flush is never triggered, the periodic compaction won't be triggered either.

Q: If I delete a column family, and I didn't yet delete the column family handle, can I still use it to access the data?

A: Yes. DropColumnFamily() only marks the specified column family as dropped, and it will not be dropped until its reference count goes to zero and marked as dropped.

Q: Why does RocksDB issue reads from the disk when I only make write request?

A: Such IO reads are from compactions. RocksDB compaction reads from one or more SST files, perform merge-sort like operation, generate new SST files, and delete the old SST files it inputs.

Q: Is block_size before compression , or after?

A: block_size is for size before compression.

Q: After using options.prefix_extractor, I sometimes see wrong results. What's wrong?

A: There are limitations in options.prefix_extractor. If prefix iterating is used, doesn't support Prev() or SeekToLast(), and many operations don't support SeekToFirst() either. A common mistake is to seek the last key of a prefix by calling Seek(), followed by Prev(). This is, however, not supported. Currently there is no way to find the last key of prefix with prefix iterating. Also, you can't continue iterating keys after finishing the prefix you seek to. In places where those operations are needed, you can try to set ReadOptions.total_order_seek = true to disable prefix iterating.

Q: If Put() or Write() is called with WriteOptions.sync=true, does it mean all previous writes are persistent too?

A: Yes, but only for all previous writes with WriteOptions.disableWAL=false.

Q: I disabled write-ahead-log and rely on DB::Flush() to persist the data. It works well for single family. Can I do the same if I have multiple column families?

A: Yes. Set option.atomic_flush=true to enable atomic flush across multiple column families.

Q: What's the best way to delete a range of keys?

A: See https://github.com/facebook/rocksdb/wiki/DeleteRange .

Q: What are column families used for?

A: The most common reasons of using column families:

  1. Use different compaction setting, comparators, compression types, merge operators, or compaction filters in different parts of data
  2. Drop a column family to delete its data
  3. One column family to store metadata and another one to store the data.

Q: What's the difference between storing data in multiple column family and in multiple rocksdb database?

A: The main differences will be backup, atomic writes and performance of writes. The advantage of using multiple databases: database is the unit of backup or checkpoint. It's easier to copy a database to another host than a column family. Advantages of using multiple column families:

  1. write batches are atomic across multiple column families on one database. You can't achieve this using multiple RocksDB databases
  2. If you issue sync writes to WAL, too many databases may hurt the performance.

Q: Is RocksDB really “lockless” in reads?

A: Reads might hold mutex in the following situations:

  1. access the sharded block cache
  2. access table cache if options.max_open_files != -1
  3. if a read happens just after flush or compaction finishes, it may briefly hold the global mutex to fetch the latest metadata of the LSM tree.
  4. the memory allocators RocksDB relies on (e.g. jemalloc), may sometimes hold locks. These locks are only held rarely, or in fine granularity.

Q: If I update multiple keys, should I issue multiple Put(), or put them in one write batch and issue Write()?

A: Using WriteBatch() to batch more keys usually performs better than single Put().

Q: What's the best practice to iterate all the keys?

A: If it's a small or read-only database, just create an iterator and iterate all the keys. Otherwise consider to recreate iterators once a while, because an iterator will hold all the resources from being released. If you need to read from consistent view, create a snapshot and iterate using it.

Q: I have different key spaces. Should I separate them using prefixes, or use different column families?

A: If each key space is reasonably large, it's a good idea to put them in different column families. If it can be small, then you should consider to pack multiple key spaces into one column family, to avoid the trouble of maintaining too many column families.

Q: Is the performance of iterator Next() the same as Prev()?

A: The performance of reversed iteration is usually much worse than forward iteration. There are various reasons for that:

  1. delta encoding in data blocks is more friendly to Next()
  2. the skip list used in the memtable is single-direction, so Prev() is another binary search
  3. the internal key order is optimized for Next().

Q: If I want to retrieve 10 keys from RocksDB, is it better to batch them and use MultiGet() versus issuing 10 individual Get() calls?

A: There are potential performance benefits in using MultiGet(). See https://github.com/facebook/rocksdb/wiki/MultiGet-Performance .

Q: If I have multiple column families and call the DB functions without a column family handle, what the result will be?

A: It will operate only the default column family.

Q: Can I reuse ReadOptions, WriteOptions, etc, across multiple threads?

A: As long as they are const, you are free to reuse them.

Feature Support

Q: Can I cancel a specific compaction?

A: No, you can't cancel one specific compaction.

Q: Can I close the DB when a manual compaction is in progress?

A: No, it's not safe to do that. However, you call CancelAllBackgroundWork(db, true) in another thread to abort the running compactions, so that you can close the DB sooner. Since 6.5, you can also speed it up using DB::DisableManualCompaction().

Q: Is it safe to directly copy an open RocksDB instance?

A: No, unless the RocksDB instance is opened in read-only mode.

Q: Does RocksDB support replication?

A: No, RocksDB does not directly support replication. However, it offers some APIs that can be used as building blocks to support replication. For instance, GetUpdatesSince() allows developers to iterate though all updates since a specific point in time. See https://github.com/facebook/rocksdb/wiki/Replication-Helpers

Q: Does RocksDB support group commit?

A: Yes. Multiple write requests issued by multiple threads may be grouped together. One of the threads writes WAL log for those write requests in one single write request and fsync once if configured.

Q: Is it possible to scan/iterate over keys only? If so, is that more efficient than loading keys and values?

A: No it is usually not more efficient. RocksDB's values are normally stored inline with keys. When a user iterates over the keys, the values are already loaded in memory, so skipping the value won't save much. In BlobDB, keys and large values are stored separately so it maybe beneficial to only iterate keys, but it is not supported yet. We may add the support in the future.

Q: Is the transaction object thread-safe?

A: No it's not. You can't issue multiple operations to the same transaction concurrently. (Of course, you can execute multiple transactions in parallel, which is the point of the feature.)

Q: After iterator moves away from a key/value, is the memory pointed by those key/value still kept?

A: No, they can be freed, unless you set ReadOptions.pin_data = true and your setting supports this feature.

Q: Can I programmatically read data from an SST file?

A: We don't support it right now. But you can dump the data using sst_dump. Since version 6.5, you'll be able to do it using SstFileReader.

Q: RocksDB repair: when can I use it? Best-practices?

A: Check https://github.com/facebook/rocksdb/wiki/RocksDB-Repairer

Configuration and Tuning

Q: What's the default value of the block cache?

A: 8MB. That's too low for most use cases, so it's likely that you need to set your own value.

Q: Are bloom filter blocks of SST files always loaded to memory, or can they be loaded from disk?

A: The behavior is configurable. When BlockBaseTableOptions::cache_index_and_filter_blocks is set to true, then bloom filters and index block will be loaded into a LRU cache only when related Get() requests are issued. In the other case where cache_index_and_filter_blocks is set to false, then RocksDB will try to keep the index block and bloom filter in memory up to DBOptions::max_open_files number of SST files.

Q: Is it safe to configure different prefix extractor for different column family?

A: Yes.

Q: Can I change the prefix extractor?

A: No. Once you've specified a prefix extractor, you cannot change it. However, you can disable it by specifying a null value.

Q: How to configure RocksDB to use multiple disks?

A: You can create a single filesystem (ext3, xfs, etc.) on multiple disks. Then, you can run RocksDB on that single file system. Some tips when using disks:

  • If RAID is used, use larger RAID stripe size (64kb is too small, 1MB would be excellent).
  • Consider enabling compaction read-ahead by specifying ColumnFamilyOptions::compaction_readahead_size to at least 2MB.
  • If workload is write-heavy, have enough compaction threads to keep the disks busy
  • Consider enabling async write behind for compaction

Q: Can I open RocksDB with a different compression type and still read old data?

A: Yes, since RocksDB stored the compression information in each SST file and performs decompression accordingly, you can change the compression and the db will still be able to read existing files. In addition, you can also specify a different compression for the last level by specifying ColumnFamilyOptions::bottommost_compression.

Q: Can I put log files and sst files in different directories? How about information logs?

A: Yes. WAL files can be placed in a separate directory by specifying DBOptions::wal_dir, information logs can as well be written in a separate directory by using DBOptions::db_log_dir.

Q: If I use non-default comparators or merge operators, can I still use ldb tool?

A: You cannot use the regular ldb tool in this case. However, you can build your custom ldb tool by passing your own options using this function rocksdb::LDBTool::Run(argc, argv, options) and compile it.

Q: What will happen if I open RocksDB with a different compaction style?

A: When opening a RocksDB database with a different compaction style or compaction settings, one of the following scenarios will happen:

  1. The database will refuse to open if the new configuration is incompatible with the current LSM layout.
  2. If the new configuration is compatible with the current LSM layout, then RocksDB will continue and open the database. However, in order to make the new options take full effect, it might require a full compaction.

Consider to use the migration helper function OptionChangeMigration(), which will compact the files to satisfy the new compaction style if needed.

Q: Does RocksDB have columns? If it doesn't have column, why there are column families?

A: No, RocksDB doesn't have columns. See https://github.com/facebook/rocksdb/wiki/Column-Families for what is column family.

Q: How to estimate space can be reclaimed If I issue a full manual compaction?

A: There is no easy way to predict it accurately, especially when there is a compaction filter. If the database size is steady, DB property rocksdb.estimate-live-data-size is the best estimation.

Q: What's the difference between a snapshot, a checkpoint and a backup?

A: Snapshot is a logical concept. Users can query data using program interface, but underlying compactions still rewrite existing files.

A checkpoint will create a physical mirror of all the database files using the same Env. This operation is very cheap if the file system hard-link can be used to create mirrored files.

A backup can move the physical database files to another Env (like HDFS). The backup engine also supports incremental copy between different backups.

Q: Which compression type should I use?

A: Start with LZ4 (or Snappy, if LZ4 is not available) for all levels for good performance. If you want to further reduce data size, try to use ZStandard (or Zlib, if ZStandard is not available) in the bottommost level. See https://github.com/facebook/rocksdb/wiki/Setup-Options-and-Basic-Tuning#compression

Q: Is compaction needed if no key is deleted or overwritten?

A: Even if there is no need to clear out-of-date data, compaction is needed to ensure read performance.

Q: After a write following option.disableWAL=true, I write another record with options.sync=true, will it persist the previous write too?

A: No. After the program crashes, writes with option.disableWAL=true will be lost, if they are not flushed to SST files.

Q: What is options.target_file_size_multiplier useful for?

A: It's a rarely used feature. For example, you can use it to reduce the number of the SST files.

Q: I observed burst write I/Os. How can I eliminate that?

A: Try to use the rate limiter: See https://github.com/facebook/rocksdb/wiki/Rate-Limiter

Q: Can I change the compaction filter without reopening the DB?

A: It's not supported. However, you can achieve it by implementing your CompactionFilterFactory which returns different compaction filters.

Q: How many column families can a single db support?

A: Users should be able to run at least thousands of column families without seeing any error. However, too many column families don't usually perform well. We don't recommend users to use more than a few hundreds of column families.

Q: Can I reuse DBOptions or ColumnFamilyOptions to open multiple DBs or column families?

A: Yes. Internally, RocksDB always makes a copy to those options, so you can freely change them and reuse these objects.

Portability

Q: Can I run RocksDB and store the data on HDFS?

A: Yes, by using the Env returned by NewHdfsEnv(), RocksDB will store data on HDFS. However, the file lock is currently not supported in HDFS Env.

Q: Does RocksJava support all the features?

A: We are working toward making RocksJava feature compatible. However, you're more than welcome to submit pull request if you find something is missing

Backup

Q: Can I preserve a “snapshot” of RocksDB and later roll back the DB state to it?

A: Yes, via the BackupEngine or Checkpoints.

Q: Does BackupableDB create a point-in-time snapshot of the database?

A: Yes when BackupOptions::backup_log_files = true or flush_before_backup = true when calling CreateNewBackup().

Q: Does the backup process affect accesses to the database in the mean while?

A: No, you can keep reading and writing to the database at the same time.

Q: How can I configure RocksDB to backup to HDFS?

A: Use BackupableDB and set backup_env to the return value of NewHdfsEnv().

Failure Handling

Q: Does RocksDB throw exceptions?

A: No, RocksDB returns rocksdb::Status to indicate any error. However, RocksDB does not catch exceptions thrown by STL or other dependencies. For instance, so it's possible that you will see std::bad_malloc when memory allocation fails, or similar exceptions in other situations.

Q: How RocksDB handles read or write I/O errors?

A: If the I/O errors happen in the foreground operations such as Get() and Write(), then RocksDB will return rocksdb::IOError status. If the error happens in background threads and options.paranoid_checks=true, we will switch to the read-only mode. All the writes will be rejected with the status code representing the background error.

Q: How to distinguish type of exceptions thrown by RocksJava?

A: Yes, RocksJava throws RocksDBException for every RocksDB related exceptions.

Failure Recovery

Q: If my process crashes, can it corrupt the database?

A: No, but data in the un-flushed memtables might be lost if Write Ahead Log (WAL) is disabled.

Q: If my machine crashes and rebooted, will RocksDB preserve the data?

A: Data is synced when you issue a sync write (write with WriteOptions.sync=true), call DB::SyncWAL(), or when memtables are flushed.

Q: How to know the number of keys stored in a RocksDB database?

A: Use GetIntProperty(cf_handle, "rocksdb.estimate-num-keys") to obtain an estimated number of keys stored in a column family, or use GetAggregatedIntProperty(“rocksdb.estimate-num-keys", &num_keys) to obtain an estimated number of keys stored in the whole RocksDB database.

Q: Why GetIntProperty can only return an estimated number of keys in a RocksDB database?

A: Obtaining an accurate number of keys in any LSM databases like RocksDB is a challenging problem as they have duplicate keys and deletion entries (i.e., tombstones) that will require a full compaction in order to get an accurate number of keys. In addition, if the RocksDB database contains merge operators, it will also make the estimated number of keys less accurate.

Resource Management

Q: How much resource does an iterator hold and when will these resource be released?

A: Iterators hold both data blocks and memtables in memory. The resource each iterator holds are:

  1. The data blocks that the iterator is currently pointing to. See https://github.com/facebook/rocksdb/wiki/Memory-usage-in-RocksDB#blocks-pinned-by-iterators
  2. The memtables that existed when the iterator was created, even after the memtables have been flushed.
  3. All the SST files on disk that existed when the iterator was created, even if they are compacted.

These resources will be released when the iterator is deleted.

Q: How to estimate total size of index and filter blocks in a DB?

A: For an offline DB, "sst_dump --show_properties --command=none" will show you the index and filter size for a specific sst file. You can sum them up for all DB. For a running DB, you can fetch from DB property kAggregatedTableProperties. Or calling DB::GetPropertiesOfAllTables() and sum up the index and filter block size of individual files.

Q: Can RocksDB tell us the total number of keys in the database? Or the total number of keys within a range?

A: RocksDB can estimate number of keys through DB property “rocksdb.estimate-num-keys”. Note this estimation can be far off when there are merge operators, existing keys overwritten, or deleting non-existing keys.

The best way to estimate total number of keys within a range is to first estimate size of a range by calling DB::GetApproximateSizes(), and then estimate number of keys from that.

Others

Q: Who is using RocksDB?

A: https://github.com/facebook/rocksdb/blob/main/USERS.md

Q: How should I implement multiple data shards/partitions.

A: You can use one RocksDB database per shard/partition. Multiple RocksDB instances could be run as separate processes or within a single process. When multiple instances of RocksDB are used within the single process, some resources (like thread pool, block cache, rate limiter etc..) could be shared between those RocksDB instances (See https://github.com/facebook/rocksdb/wiki/RocksDB-Overview#support-for-multiple-embedded-databases-in-the-same-process)

Q: DB operations fail because of out-of-space. How can I unblock myself?

A: First clear up some free space. The DB will automatically start accepting operations once enough free space is available. The only exception is if 2PC is enabled and the WAL sync fails (in this case, the DB needs to be reopened). See Background Error Handling for more details.

Contents

Clone this wiki locally