-
Notifications
You must be signed in to change notification settings - Fork 1k
Description
Purpose/Abstract
I aim to reduce memory overhead and increase performance by optimizing sorted sets (ZSETs) in Valkey. I have implemented a proof-of-concept B+ tree variant to replace the existing skiplist which saves about 24B per item and improves performance for most operations, some significantly: 7x faster rank lookup, 16-36x faster iteration.
Assumptions
When optimizing performance there are some important but potentially counterintuitive assumptions which we can make that are generally true for data structures in Valkey:
- Memory is cold when a command starts. Valkey is constantly reading and writing data to memory, so data is quickly pushed from cache between commands.
- Worry about fetch count, not instruction count. The CPU can execute 400-1200 instructions in the time it takes to complete a cold fetch from system RAM. Compared to a fetch, instructions are nearly free.
Background
ZSETs store ordered sets. It is a set of unique elements, each with an additional score used to sort them, stored as a double. For equal scores, elements are sorted lexicographically. There are lexicographical commands that only work if all scores are set to equal values, otherwise their output is undefined.
Valkey supports a variety of commands on ZSETs:
- Item Manipulation: add, remove, increment, get by score, get by rank
- Stats: item count, count by score range, count by lex range
- Set ops: intersection, union, diff
- Range queries: by rank, score, or lexicographically
- Bulk range deletion: by rank, score, or lexicographically
Common use cases
- Leaderboards
- Commands: add, get rank, range by rank
- Typical Size: 1-100 sets, 100k+ elements per set, ~24B per item
- Priority queues
- Commands: add, pop
- Typical size: 1-20 sets, 1k+ elements per set, 50B-200B per item
- Time Series Data, Activity Feeds
- Commands: add, range by score, delete range by rank
- Typical Size: ~10k sets, ~50k items per set, ~24B per item
- Secondary Indexing
- Commands: add, delete, range by score, range by lex
- Typical size: 1-3 sets, 1M+ elements, ~24B per item
Out of Scope
I am only addressing the encoding for larger sets, currently encoded with skiplist and hashtable. The listpack encoding for small sets is out of scope. (Listpack is used for <= 128 items no larger than 64 bytes. These thresholds are configurable.)
Problem Summary
Larger ZSETs are encoded in a skiplist paired with a hashtable. The hashtable provides O(1) unordered lookups, and the skiplist provides O(log n) ordered lookups, such as looking up an item by rank or score, or finding the rank of a given element. The skiplist has O(n) space complexity.
Illustration from Wikipedia:

More info about skiplists: https://en.wikipedia.org/wiki/Skip_list
In Valkey, each skiplist element is a separate allocation containing the score, element data, and a variable number of skiplist pointers, depending on the node’s height in the skiplist structure. The height is incremented repeatedly with 25% probability of adding a new layer, 75% chance of exiting the loop.
Every node has forward and back pointers – 16B. The expected number of additional level pointers is 0.25 + 0.0625 + … = 1.33 levels. For each level we store an 8B pointer and an 8B span indicating how many elements the jump will skip, so our final answer is 16 + 1.33 * 16, or 37.3B per item on average. Broadly speaking, this is a lot of pointers and we can do better.
Requirements
- Improve memory efficiency
- Do not regress performance, improve it if possible
Memory efficiency and performance are key features of Valkey which customers care about. Memory efficiency directly improves storage capacity, which is the driving constraint for customers much more often than performance. In addition, there is currently a global RAM shortage causing memory prices to increase by up to 500%, increasing the financial impact of memory savings.
Recommendation: Replace skiplist with fbtree
After a review of recent computer science research, I chose to implement a simplified version of the FB+-tree (Feature B+ Tree) described in detail by Chen et. Al: https://arxiv.org/abs/2503.23397
The FB+-tree described in the paper dedicates much effort and complexity towards supporting high performance concurrency. In Valkey we don’t need concurrency support because only the main thread executes commands and manipulates stored data. Removing concurrency support is the only major modification I made to the paper’s design.
My FB+-tree, which I’ll call fbtree going forward, can be explained as a series of incremental optimizations applied to a conventional B+ tree:
- ZSET’s score+element design complicates comparison between values. I would like to be able to interpret it as a single block of data to be ordered lexicographically without having to make exceptions. To do this I store a normalized version of score prepended to the element bytes, such that sorting lexicographically achieves the same score+element ordering. (See Appendix A for more details/code.) Going forward we no longer worry about any score/element distinction or boundary.
- The leaf node targets a 512B size.
- The exact size of a jemalloc size class, meaning no wasted memory
- 8 cache lines long. (Memory is always fetched in cache line chunks, typically 64B.) Fetching multiple cache lines is not dramatically slower than fetching a single cache line. Hardware prefetchers are very fast on this kind of linear access.
- 61 items per node. 61 being close to 64 is convenient for efficient SIMD work.
- We don’t store anchor (separator) values in inner nodes. Leaf nodes own all values, and inner nodes store non-owning pointers to their anchor values. This saves memory.
- During lookup we search among each node’s anchor values to identify the correct next node as we traverse down the tree. This binary search causes O(log n) memory fetches. To reduce this, we store a 4-byte “feature” from each anchor value in the tree node. We use this data first to narrow the range of children to consider with no additional memory fetches, before performing a binary search among the remaining possibilities. Depending on the data distribution, this can eliminate binary searches and O(log n) memory fetches.
- We use prefix compression on our feature data. We separately store the prefix shared by all anchors, and the feature data contains the following 4B of each anchor. This maximizes how much we can narrow down candidates using only feature data with no extra memory fetches. We tune the size of the stored prefix so the inner node exactly matches a jemalloc size class.
- By storing feature data in a SIMD-friendly shape, we can quickly compare each of the 4 bytes against all of the stored features in parallel.
- As we traverse down the tree during a lookup we track already-compared prefix bytes, and skip comparing those bytes. A subtree’s common prefix is equal or longer.
- To support rank lookup, an inner node stores the size of each child’s subtree, instead of storing it in the child node. This way when performing rank lookups, we know exactly which child node to traverse to with no additional fetches.
- The sizes are stored as uint32_t to save memory, but this effectively limits fbtree size to ~4.2 billion items, rather than 18 quintillion if uint64_t were used.
- For fast append/prepend inserts, we have multiple optimizations:
- Asymmetric splits: when the new item is before or after all children in a full node, we simply create a new node which contains only that new item. This avoids node updates and memory moves. For sequential adds this leaves 100% full nodes behind instead of 50% full nodes from 50/50 splits.
- We keep shortcut pointers to the first and last leaf nodes. We can use this to quickly check for append/prepend, and if the node has free space we simply add it to the leaf while skipping much of the tree traversal and update logic. If it is prepend/append but a split is needed, we pass a hint to the insert logic enabling it to skip search logic and perform asymmetric splits.
- To optimize iteration, we link the leaf nodes together in a doubly linked list. This provides fast forward and reverse iteration with no tree walk required.
- In leaf nodes we match the pointer value itself with a linear search in situations where the element is known to exist, such as calculating rank or delete. This avoids fetching any value data other than the element we’re searching for. Because we have a paired hashtable already, we can check membership and get the pointer value in O(1) time.
Let’s briefly analyze space complexity, assuming 100% load factor to start with. A leaf node is 512B but serves 61 items, so 8.4B per item. The level above is a 1536B inner node, but it serves 61*61 or 3721 items, so 0.41B per item. The next level serves 226981 items with 0.007B per item. Intuitively we can see that it is quickly approaching a limit and indeed it is a converging geometric series: 8.8B per item at infinite size.
This is at 100% load factor, which we can achieve when the dataset is populated sequentially. For populating data in a random order, we expect an average load factor of ln(2) or 69.3%. This results in about 12.7B overhead per item. For a mix of random inserts and deletes the load factor will be a little less, depending on the merge policy and exact insert:delete ratio. For example, if we free-at-emtpy instead of merge-at-half we can expect around 39% load factor, resulting in 22.6B per item.
- ln (2) load factor: https://cs.uwaterloo.ca/research/tr/1982/CS-82-17.pdf
- 39% free-at-empty load factor: https://www.sciencedirect.com/science/article/pii/002200009390020W
Tradeoffs and Uncertainty
- Slightly slower in a few hot-memory situations
- Number of allocations increases by about 1.7%
- More complex than skiplist, higher surface area for bugs
- Range deletion performance hasn't yet been implemented and tested. Optimization is a bit complex and final performance is difficult to predict.
Results
These tests were performed with micro benchmarks using the Google Benchmark library, testing skiplist and fbtree in isolation without the rest of Valkey involved. This means that these results are not representative of overall Valkey performance, but they do provide a much more direct and accurate comparison of the data structures. Keep in mind that the percentage gains will be less dramatic when the rest of Valkey is included in the results.
Memory efficiency is better than skiplist at any size
In the plot below the bump in the fbtree line occurs at the first node split when it shifts from 1 leaf node to 2 leaf nodes and 1 inner node. Later splits are increasingly amortized and aren’t noticeable on the plot. The vertical line at 128 denotes the current threshold when Valkey switches from listpack encoding to skiplist/fbtree.
Performance is better than skiplist in all operations
This is especially true when data is cold – not in CPU cache. There is one exception where skiplist slightly beats fbtree for lookup by score when data is hot in sets with roughly 8,000 elements or less.
This set of tests was done on an x86 platform with AVX2 enabled. Multiple sets were used so that total dataset size is 2x LLC, which was 46 MB on my dev machine, and requests were distributed round-robin in an effort to make memory fetches cold.
The following set of tests were done with single sets, meaning that set sizes smaller than LLC (lowest level cache) fit completely within CPU cache, such that nearly all memory fetches for these tests were hot. A vertical line has been drawn to mark the size of LLC cache on the test machine for reference. As size increases fetches increasingly must descent to lower levels of cache or to system RAM, but there isn’t a clear cutoff.
Remaining Work
I estimate the remaining dev work and testing can be finished in 10 weeks. I confess I am not the best at estimating, but I’ve made an effort to avoid being overly optimistic.
- 4 weeks - I need to implement node merging and deletion, as well as range deletion.
- 4 weeks - fbtree is not yet integrated with ZSET, which will be needed for integration tests with Valkey. I wrote an interface to abstract the ordered index used by ZSET which I plan to keep. This componentization facilitates unit testing, other uses of fbtree, and potential future replacements of the ordered index data structure. This includes defrag support, memory dismiss (fork/snapshot optimization), lazy free, etc.
- 3 days - I need to implement SIMD optimizations for ARM architectures.
- 2 days - Minor work to support both big endian and little endian
- 1 week - I did not experiment with other node sizes or feature sizes, but as these are centrally defined parameters it should be relatively trivial to do this.
Future Enhancement Ideas
- Item sizes are often small (perhaps 24B) and of uniform size. Could we make an embedded-value leaf node that avoids storing the 61*8B pointers? Perhaps we could detect uniform-size elements? This is nontrivial because the companion hashtable needs to be kept up-to-date.
- Most 64-bit platforms only use 6B per pointer in practice, though there are exceptions. Could we compress pointers to save memory or increase node fanout? Would this be a compile-time option, a config option, or determined dynamically at startup by querying Linux to determine 6B pointer compatibility?
- Prefetching? The data structures and access patterns are already designed with hardware prefetching and SIMD in mind, so we may not gain anything.
- When there are only a few leaf nodes, we can probably omit inner nodes and just use a doubly linked list of leaf nodes. This would improve memory efficiency for medium size sets with performance likely a compromise between listpack and proper fbtree, with no reallocation of leaf nodes needed for conversion to a tree.
- Combined with embedded leaf nodes, this could potentially replace listpack in this use case. If we think of it as a doubly linked list of listpacks, we might expect similar memory efficiency but skiplist-like performance when we can quickly skip through buckets. If listpack were replaced we would eliminate the listpack-to-fbtree conversion latency.
Appendix A: Normalizing Doubles
IEEE 754 doubles don't sort correctly with memcmp because the sign bit is the most significant bit, and negative numbers use sign+magnitude instead of two’s complement.
To normalize doubles to a memcmp-sortable form:
- Negative (sign bit set): flip all bits → makes negatives sort before positives, and more-negative before less-negative
- Positive (sign bit clear): flip only sign bit → shifts positives above negatives in unsigned comparison
- Then swap bytes on little-endian architectures so memcmp compares MSB first.
inline void encodeScoreToBytes(double score, unsigned char out[8]) {
uint64_t bits;
memcpy(&bits, &score, sizeof(bits));
if (bits & (1ULL << 63))
bits = ~bits;
else
bits ^= (1ULL << 63);
bits = __builtin_bswap64(bits);
memcpy(out, &bits, sizeof(bits));
}Appendix B: Proof of Concept Code and Benchmarks:
proof of concept branch: https://github.com/rainsupreme/valkey/tree/ordered-index
It is not integrated, but you can exercise it through the included unit tests.
Benchmarks implemented using Google Benchmark (which must be separately installed): https://github.com/rainsupreme/valkey/tree/ordered-index-benchmarks
Build benchmarks with "make valkey-microbench" and run the binary at "./src/valkey-microbench". See Google Benchmark for arguments for filters and such. There are some python scripts included to run benchmarks and generate the plots seen in this report. Plot generation may require installation of additional python libraries.