You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In one experiment, we obtained about a 30% speedup (on a benchmark of larger than cache Bloom filters ) from using memory prefetching primitives.
This task is to do a production version, rather than just an experiment. This should include functional tests and should extend the existing macro-benchmark for the Bloom filter to measure any speedup.
In Database.LSMTree.Internal.Lookup there's a bloomQueriesDefault (which uses bloomQueries), which is the simple non-batched version of what we want:
given a vector of bloom filters;
given a vector of keys:
each key is looked up in each filter (so N * M),
the result is the lookup results where the key is (probably) present.
The result is represented by a vector of the pairs of the filter index and key index from the input vectors. This is a sparse representation of the N * M bit array of results.
We use a sparse representation based on the assumption (for LSMs) that most keys are not present in most filters. The result vector size is not know in advance for certain, but we expect most results to be proportional to the number of keys. We think that for typical use cases 2*N would be enough to avoid having to grow (and copy) the array. For correctness however the array needs to be able to grow.
The initial experiment using prefetching is in branch dcoutts/bloomfilter-optimisation-hacking. This adds optimised versions of bulk lookup, and extends the macrobenchmark. This experimental version uses a single key with a sequence of filters, rather than a sequence of keys with a sequence of filters (as in the current task).
The fastest version in the experiment (with the 30% speedup) works by iterating over the hash number as the outer loop, and iterating over the (remaining) keys as the inner loop. The version in the experiment also uses a dense rather than sparse representation for the result. As a setup phase it calculates the probe index for hash 0 of each key and prefetches the location. On each iteration of the main loop it iterates over two arrays: the array of probe indexes, reading each one, and creating the array of probe indexes for the next iteration.
A version that works with many keys and many filters could work in a similar way, with the outer loop being the hash number. It will have the challenge that for larger lookup batch sizes (eg. 100+) it will run out of L1 cache because the "distance" between the memory prefetch and the memory read will be too large, e.g. 100 keys * 20 filters = 2,000 items "in flight" in the CPU caches. It would be better to find an algorithm that has a "prefetch distance" that is independent of the number of keys and can be tuned to typical CPUs, e.g. on the order of 10 -- 100 items in flight.
The text was updated successfully, but these errors were encountered:
In one experiment, we obtained about a 30% speedup (on a benchmark of larger than cache Bloom filters ) from using memory prefetching primitives.
This task is to do a production version, rather than just an experiment. This should include functional tests and should extend the existing macro-benchmark for the Bloom filter to measure any speedup.
In
Database.LSMTree.Internal.Lookup
there's abloomQueriesDefault
(which usesbloomQueries
), which is the simple non-batched version of what we want:We use a sparse representation based on the assumption (for LSMs) that most keys are not present in most filters. The result vector size is not know in advance for certain, but we expect most results to be proportional to the number of keys. We think that for typical use cases 2*N would be enough to avoid having to grow (and copy) the array. For correctness however the array needs to be able to grow.
The initial experiment using prefetching is in branch
dcoutts/bloomfilter-optimisation-hacking
. This adds optimised versions of bulk lookup, and extends the macrobenchmark. This experimental version uses a single key with a sequence of filters, rather than a sequence of keys with a sequence of filters (as in the current task).The fastest version in the experiment (with the 30% speedup) works by iterating over the hash number as the outer loop, and iterating over the (remaining) keys as the inner loop. The version in the experiment also uses a dense rather than sparse representation for the result. As a setup phase it calculates the probe index for hash 0 of each key and prefetches the location. On each iteration of the main loop it iterates over two arrays: the array of probe indexes, reading each one, and creating the array of probe indexes for the next iteration.
A version that works with many keys and many filters could work in a similar way, with the outer loop being the hash number. It will have the challenge that for larger lookup batch sizes (eg. 100+) it will run out of L1 cache because the "distance" between the memory prefetch and the memory read will be too large, e.g. 100 keys * 20 filters = 2,000 items "in flight" in the CPU caches. It would be better to find an algorithm that has a "prefetch distance" that is independent of the number of keys and can be tuned to typical CPUs, e.g. on the order of 10 -- 100 items in flight.
The text was updated successfully, but these errors were encountered: