This is an implementation of the EDBT 2023 best paper bloomRF. BloomRF is a probabilistic filter that supports range queries, in addition to point queries. Traditional bloom filters support only point queries.
Probabilistic filters have many applications. One well known application is to LSM-trees. An LSM-tree stores sorted runs of keys in files on disk. A probabilistic filter allows us to ask "are we storing this key on disk?" without requiring us to make the trek to disk (because the compact filter can fit in memory). This is a valuable opimization because disk is slow! EBS gp3 volumes have been measured as having an average latency of roughly 5ms (see here). This is an eternity compared to processor speeds and the speed of memory. BloomRF allows us to ask "are we storing any key in the range 42 to 4711?" and therefore can improve the performance of LSM-Trees for serving range queries. A good overview of LSM-trees can be found here.
This project is a work in progress. In particular, the memory management optimization described in section 7, and also the tuning advisor are not yet implemented.
cmake -B build -DCMAKE_BUILD_TYPE=RelWithDebInfo
cmake --build build
That builds unit tests, experiments, and the standalone bloomRF library.
To run the unit tests, run ./build/src/test/test_bloomrf
. To run the experiments, run ./build/src/test/experiments
.
BloomRF
supports floats and integers. The interface that BloomRF
supports is BloomRF<T>::add(T key)
,
BloomRF<T>::find(T key)
, and BloomRF<T>::findRange(T low, T high)
.
#include "bloomrf.h"
int main() {
// TODO: simplify parameter tuning. For now, these are reasonable defaults.
// The first parameter is the memory allotment in bytes. For 2000000 keys, this is
// 16-bits per key.
BloomRF<uint64_t> ubf{BloomFilterRFParameters{4000000, 0, {7, 7, 7, 4, 4, 2, 2, 2}}};
// Insert random keys.
for (int i = 0; i < 2000000; ++i) {
// rand is a bad random number generator.
ubf.add(rand());
}
for (int i = 0; i < 100000000; ++i) {
// Do random range queries of size 1e8.
uint64_t low = rand();
uint64_t high = low + 1e8;
bool stored = ubf.findRange(low, high);
if (stored) {
std::cout << "There may be a key in range [" << low << "," << high << "]" << std::endl;
} else {
std::cout << "We have definitely not stored a key in range [" << low << "," << high << "]" << std::endl;
}
}
return 0;
}