⚠️ This library is in active development: there is currently no release schedule!
This package is developed by Well-Typed LLP on behalf of Input Output Global, Inc. (IOG) and INTERSECT. The main contributors are Duncan Coutts, Joris Dral, Matthias Heinzel, Wolfgang Jeltsch, Wen Kokke, and Alex Washburn.
This package contains an efficient implementation of on-disk key–value
storage, implemented as a log-structured merge-tree or LSM-tree. An
LSM-tree is a data structure for key–value mappings, similar to
Data.Map
, but optimized for large tables with a high insertion volume.
It has support for:
-
Basic key–value operations, such as lookup, insert, and delete.
-
Range lookups, which efficiently retrieve the values for all keys in a given range.
-
Monoidal upserts which combine the stored and new values.
-
BLOB storage which associates a large auxiliary BLOB with a key.
-
Durable on-disk persistence and rollback via named snapshots.
-
Cheap table duplication where all duplicates can be independently accessed and modified.
-
High-performance lookups on SSDs using I/O batching and parallelism.
This package exports two modules:
-
Database.LSMTree.Simple
This module exports a simplified API which picks sensible defaults for a number of configuration parameters.
It does not support upserts or BLOBs, due to their unintuitive interaction, see Upsert and BLOB.
If you are looking at this package for the first time, it is strongly recommended that you start by reading this module.
-
Database.LSMTree
This module exports the full API.
The interaction between upserts and BLOBs is unintuitive. A upsert updates the value associated with the key by combining the new and old values with a user-specified function. However, any BLOB associated with the key is simply deleted.
-
This package only supports 64-bit, little-endian systems.
-
On Windows, the package has only been tested with NTFS filesystems.
-
On Linux, executables using this package, including test and benchmark suites, must be compiled with the
-threaded
RTS option enabled.
LSM-trees can be used concurrently, but with a few restrictions:
-
Each session locks its session directory. This means that a database cannot be accessed from different processes at the same time.
-
Tables can be used concurrently and concurrent use of read operations such as lookups is deterministic. However, concurrent use of write operations such as insert or delete with any other operation results in a race condition.
The worst-case behaviour of the library is described using big-O notation. The documentation provides two measures of complexity:
-
The time complexity of operations is described in terms of the number of disk I/O operations and referred to as the disk I/O complexity. In practice, the time of the operations on LSM-trees is dominated by the number of disk I/O actions.
-
The space complexity of tables is described in terms of the in-memory size of an LSM-tree table. Both the in-memory and on-disk size of an LSM-tree table scale linearly with the number of physical entries. However, the in-memory size of an LSM-tree table is smaller than its on-disk size by a significant constant. This is discussed in detail below, under In-memory size of tables.
The complexities are described in terms of the following variables and constants:
-
The variable
$n$ refers to the number of physical table entries. A physical table entry is any key–operation pair, e.g.,Insert k v
orDelete k
, whereas a logical table entry is determined by all physical entries with the same key. If the variable$n$ is used to describe the complexity of an operation that involves multiple tables, it refers to the sum of all table entries. -
The variable
$o$ refers to the number of open tables and cursors in the session. -
The variable
$s$ refers to the number of snapshots in the session. -
The variable
$b$ usually refers to the size of a batch of inputs/outputs. Its precise meaning is explained for each occurrence. -
The constant
$B$ refers to the size of the write buffer, which is determined by theTableConfig
parameterconfWriteBufferAlloc
. -
The constant
$T$ refers to the size ratio of the table, which is determined by theTableConfig
parameterconfSizeRatio
. -
The constant
$P$ refers to the average number of key–value pairs that fit in a page of memory.
The following table summarises the worst-case cost of the operations on
LSM-trees measured in the number of disk I/O operations. If the cost
depends on the merge policy or merge schedule, then the table contains
one entry for each relevant combination. Otherwise, the merge policy
and/or merge schedule is listed as N/A. The merge policy and merge
schedule are determined by the TableConfig
parameters
confMergePolicy
and confMergeSchedule
.
Resource | Operation | Merge policy | Merge schedule | Worst-case disk I/O complexity |
---|---|---|---|---|
Session | Open | N/A | N/A | O(1) |
Close | LazyLevelling |
N/A | ||
Table | New | N/A | N/A | O(1) |
Close | LazyLevelling |
N/A | ||
Lookup | LazyLevelling |
N/A | ||
Range Lookup | N/A | N/A |
|
|
Insert/Delete/Update | LazyLevelling |
Incremental |
||
LazyLevelling |
OneShot |
|||
Duplicate | N/A | N/A | O(0) | |
Union | N/A | N/A | ||
Snapshot | Save | LazyLevelling |
N/A | |
Open | N/A | N/A | ||
Delete | LazyLevelling |
N/A | ||
List | N/A | N/A | O(s) | |
Cursor | New | LazyLevelling |
N/A | |
Close | LazyLevelling |
N/A | ||
Next | N/A | N/A |
(*The variable
The in-memory and the on-disk size of an LSM-tree scale linearly with the number of physical entries. However, the in-memory size is smaller by a significant factor. Let us look at a table that uses the default configuration and has 100 million entries with 34 byte keys and 60 byte values. The total size of 100 million key–value pairs is approximately 8.75GiB. Hence, the on-disk size would be at least 8.75GiB, not counting the overhead for metadata.
The in-memory size would be approximately 265.39MiB:
-
The write buffer would store at most 20,000 entries, which is approximately 2.86MiB.
-
The fence-pointer indexes would store approximately 2.29 million keys, which is approximately 9.30MiB.
-
The Bloom filters would use 15.78 bits per entry, which is approximately 188.11MiB.
For a discussion of how the sizes of these components are determined by the table configuration, see Fine-tuning Table Configuration.
The total size of an LSM-tree must not exceed TableTooLargeError
.
confMergePolicy
The merge policy balances the performance of lookups against the
performance of updates. Levelling favours lookups. Tiering favours
updates. Lazy levelling strikes a middle ground between levelling and
tiering, and moderately favours updates. This parameter is explicitly
referenced in the documentation of those operations it affects.
confSizeRatio
The size ratio pushes the effects of the merge policy to the extreme.
If the size ratio is higher, levelling favours lookups more, and tiering
and lazy levelling favour updates more. This parameter is referred to as
confWriteBufferAlloc
The write buffer capacity balances the performance of lookups and
updates against the in-memory size of the table. If the write buffer is
larger, it takes up more memory, but lookups and updates are more
efficient. This parameter is referred to as
confMergeSchedule
The merge schedule balances the performance of lookups and updates
against the smooth performance of updates. The merge schedule does not
affect the performance of table unions. With the one-shot merge
schedule, lookups and updates are more efficient overall, but some
updates may take much longer than others. With the incremental merge
schedule, lookups and updates are less efficient overall, but each
update does a similar amount of work. This parameter is explicitly
referenced in the documentation of those operations it affects.
confBloomFilterAlloc
The Bloom filter size balances the performance of lookups against the
in-memory size of the table. If the Bloom filters are larger, they take
up more memory, but lookup operations are more efficient.
confFencePointerIndex
The fence-pointer index type supports two types of indexes. The
ordinary indexes are designed to work with any key. The compact
indexes are optimised for the case where the keys in the database are
uniformly distributed, e.g., when the keys are hashes.
confDiskCachePolicy
The disk cache policy determines if lookup operations use the OS page
cache. Caching may improve the performance of lookups if database access
follows certain patterns.
The configuration parameters confMergePolicy
, confSizeRatio
, and
confWriteBufferAlloc
affect how the table organises its data. To
understand what effect these parameters have, one must have a basic
understand of how an LSM-tree stores its data. The physical entries in
an LSM-tree are key–operation pairs, which pair a key with an operation
such as an Insert
with a value or a Delete
. These key–operation
pairs are organised into runs, which are sequences of key–operation
pairs sorted by their key. Runs are organised into levels, which are
unordered sequences or runs. Levels are organised hierarchically. Level
0 is kept in memory, and is referred to as the write buffer. All
subsequent levels are stored on disk, with each run stored in its own
file. The following shows an example LSM-tree layout, with each run as a
boxed sequence of keys and each level as a row.
The data in an LSM-tree is partially sorted: only the key–operation pairs within each run are sorted and deduplicated. As a rule of thumb, keeping more of the data sorted means lookup operations are faster but update operations are slower.
The configuration parameters confMergePolicy
, confSizeRatio
, and
confWriteBufferAlloc
directly affect a table's data layout. The
parameter confWriteBufferAlloc
determines the capacity of the write
buffer.
AllocNumEntries maxEntries
The write buffer can contain at most maxEntries
entries. The constant
maxEntries
. Irrespective of this
parameter, the write buffer size cannot exceed 4GiB.
The parameter confSizeRatio
determines the ratio between the
capacities of successive levels. The constant confSizeRatio
. For instance, if
The merge policy confMergePolicy
determines the number of runs per
level. In a tiering LSM-tree, each level contains
Tiering favours the performance of updates. Levelling favours the performance of lookups. Lazy levelling strikes a middle ground between tiering and levelling. It favours the performance of lookup operations for the oldest data and enables more deduplication, without the impact that full levelling has on update operations.
The configuration parameter confMergeSchedule
affects the worst-case
performance of lookup and update operations and the structure of runs.
Regardless of the merge schedule, the amortised disk I/O complexity of
lookups and updates is logarithmic in the size of the table. When the
write buffer fills up, its contents are flushed to disk as a run and
added to level 1. When some level fills up, its contents are flushed
down to the next level. Eventually, as data is flushed down, runs must
be merged. This package supports two schedules for merging:
-
Using the
OneShot
merge schedule, runs must always be kept fully sorted and deduplicated. However, flushing a run down to the next level may cause the next level to fill up, in which case it too must be flushed and merged futher down. In the worst case, this can cascade down the entire table. Consequently, the worst-case disk I/O complexity of updates is linear in the size of the table. This is unacceptable for real-time systems and other use cases where unresponsiveness is unacceptable. -
Using the
Incremental
merge schedule, runs can be partially merged, which allows the merging work to be spead out evenly across all update operations. This aligns the worst-case and average-case disk I/O complexity of updates—both are logarithmic in the size of the table. The cost is a small constant overhead for both lookup and update operations.
The merge schedule does not affect the performance of table unions. The amortised disk I/O complexity of one-shot unions is linear in the size of the tables. Instead, there are separate operations for incremental and oneshot unions. For incremental unions, it is up to the user to spread the merging work out evenly over time.
The configuration parameter confBloomFilterAlloc
affects the size of
the Bloom filters, which balances the performance of lookups against the
in-memory size of the table.
Tables maintain a Bloom
filter
in memory for each run on disk. These Bloom filter are probablilistic
datastructure that are used to track which keys are present in their
corresponding run. Querying a Bloom filter returns either "maybe"
meaning the key is possibly in the run or "no" meaning the key is
definitely not in the run. When a query returns "maybe" while the key is
not in the run, this is referred to as a false positive. While the
database executes a lookup operation, any Bloom filter query that
returns a false positive causes the database to unnecessarily read a run
from disk. The probabliliy of these spurious reads follow a binomial
distribution
LazyLevelling
The false-positive rate scales exponentially with size of the Bloom filters in bits per entry.
False-positive rate (FPR) | Bits per entry (BPE) |
---|---|
The configuration parameter confBloomFilterAlloc
can be specified in
two ways:
AllocFixed bitsPerEntry
Allocate the requested number of bits per entry in the table.
The value must strictly positive, but fractional values are permitted.
The recommended range is
AllocRequestFPR falsePositiveRate
Allocate the required number of bits per entry to get the requested
false-positive rate.
The value must be in the range
The total in-memory size of all Bloom filters scales linearly with the
number of physical entries in the table and is determined by the number
of physical entries multiplied by the number of bits per physical entry,
i.e,
False-positive rate (FPR) | Bloom filter size | Expected spurious reads per lookup |
---|---|---|
The configuration parameter confFencePointerIndex
affects the type and
size of the fence-pointer indexes. Tables maintain a fence-pointer index
in memory for each run on disk. These fence-pointer indexes store the
keys at the boundaries of each page of memory to ensure that each lookup
has to read at most one page of memory from each run. Tables support two
types of fence-pointer indexes:
OrdinaryIndex
Ordinary indexes are designed for any use case.
Ordinary indexes store one serialised key per page of memory. The total
in-memory size of all indexes is
CompactIndex
Compact indexes are designed for the use case where the keys in the
table are uniformly distributed, such as when using hashes.
Compact indexes store the 64 most significant bits of the minimum
serialised key of each page of memory. This requires that serialised
keys are at least 64 bits in size. Compact indexes store 1 additional
bit per page of memory to resolve collisions, 1 additional bit per page
of memory to mark entries that are larger than one page, and a
negligible amount of memory for tie breakers. The total in-memory size
of all indexes is
The configuration parameter confDiskCachePolicy
determines how the
database uses the OS page cache. This may improve performance if the
database's access pattern has good temporal locality or good
spatial locality. The database's access pattern refers to the pattern
by which entries are accessed by lookup operations. An access pattern
has good temporal locality if it is likely to access entries that were
recently accessed or updated. An access pattern has good spatial
locality if it is likely to access entries that have nearby keys.
-
Use the
DiskCacheAll
policy if the database's access pattern has either good spatial locality or both good spatial and temporal locality. -
Use the
DiskCacheLevelOneTo l
policy if the database's access pattern has good temporal locality for updates only. The variablel
determines the number of levels that are cached. For a description of levels, see Merge Policy, Size Ratio, and Write Buffer Size. With this setting, the database can be expected to cache up to$\frac{k}{P}$ pages of memory, where$k$ refers to the number of entries that fit in levels$[1,l]$ and is defined as$\sum_{i=1}^{l}BT^{i}$ . -
Use the
DiskCacheNone
policy if the database's access pattern has does not have good spatial or temporal locality. For instance, if the access pattern is uniformly random.
The implementation of LSM-trees in this package draws inspiration from:
-
Chris Okasaki. 1998. "Purely Functional Data Structures" doi:10.1017/CBO9780511530104
-
Niv Dayan, Manos Athanassoulis, and Stratos Idreos. 2017. "Monkey: Optimal Navigable Key-Value Store." doi:10.1145/3035918.3064054
-
Subhadeep Sarkar, Dimitris Staratzis, Ziehen Zhu, and Manos Athanassoulis. 2021. "Constructing and analyzing the LSM compaction design space." doi:10.14778/3476249.3476274