Skip to content

Bloom filters using a flawed double hashing scheme #4120

@pdillinger

Description

@pdillinger

Expected behavior

I expected the Bloom filter implementations to have behavior reasonably close to the accuracy predicted by formulas.

Actual behavior

With more thorough testing, the Bloom filters used for summarizing keys in an SST file show compromised accuracy. Fixing some implementation flaws can result in faster and more accurate Bloom filters.

Introduction

From my investigation, RocksDB appears to have three Bloom filter implementations:

  • IMPL1: BloomFilterPolicy::CreateFilter/KeyMayMatch (essentially unchanged from leveldb, for backward compatibility? still used?)
  • IMPL2: FullFilterBitsBuilder/FullFilterBitsReader (modified from that, mostly for CPU cache locality)
  • IMPL3: DynamicBloom (for in-memory only? lacking comments on purpose of yet another implementation)

All of these derive the requisite Bloom filter indices (often 6 of them for ~1% FP @ 10 bits/key) from a single 32-bit hash value using variations on double hashing. The original leveldb implementation cites [Kirsch,Mitzenmacher 2006] ("Less Hashing, Same Performance: Building a Better Bloom Filter") as justification, but that paper is about an asymptotic result, not about real-world performance on Bloom filters that might be medium-sized or small. My publications "Fast and Accurate Bitstate Verification for SPIN" and "Bloom Filters in Probabilistic Verification" established the idea underlying the Kirsch/Mitzenmacher work, along with many practical issues, but my dissertation has the best information/advice for implementing schemes like this. http://peterd.org/pcd-diss.pdf

I investigated the first two implementations above because the third appears to be most careful in converting hash information into indices. Both IMPL1 (BloomFilterPolicy::CreateFilter/KeyMayMatch) and IMPL2 (FullFilterBitsBuilder/FullFilterBitsReader) suffer from two issues I present below, with relatively simple solutions.

By the way, I believe the technique in IMPL2 (FullFilterBitsBuilder/FullFilterBitsReader) to ensure CPU cache locality was first described in "Cache-, Hash- and Space-Efficient Bloom Filters" by Putze et al. (PDF available online), in what they call "Blocked Bloom Filters" (which is different from what RocksDB calls "Block-based Bloom filter"). I recommend removing the basic mathematical analysis on the "RocksDB Bloom Filter" page of the wiki and just referencing this paper. Because the number of shards can be very high, the variance in number of keys landing in each shard is key to understanding the accuracy. Your implementations show the dramatic speed improvement that is possible, but it should be noted that in a good implementation, the accuracy hit is not lost in noise (as exibited below in detailed results).

First Issue

The first issue is "Issue 1" in section 6.5.1 (printed page 94 / pdf page 116) of http://peterd.org/pcd-diss.pdf, which is a standard problem for double hashing in open-addressed hash tables. Specifically, certain values for 'delta', such as zero, cause only one or few unique indices to be probed repeatedly. In open-addressed hash tables this could lead to infinite loops, but in Bloom filters such cases have a more subtle effect in the false positive rate. Standard solutions ensure that 'delta' is non-zero and relatively prime to the modulus.

If IMPL1 (BloomFilterPolicy::CreateFilter/KeyMayMatch) is to be fixed (backward compatibility?), I recommend switching it to enhanced double hashing to dodge entirely the complications of ensuring relative primality. The change (call it ENH) is simple (applied in two places):

-      const uint32_t delta = (h >> 17) | (h << 15);  // Rotate right 17 bits
+      uint32_t delta = (h >> 17) | (h << 15);  // Rotate right 17 bits
       for (size_t j = 0; j < num_probes_; j++) {
+        delta += j; // Enhanced double hashing
         const uint32_t bitpos = h % bits;
         array[bitpos/8] |= (1 << (bitpos % 8));
         h += delta;
       }

This carries a small hit in execution speed (~3% longer running relevant Bloom filter unit test) but has substantial accuracy benefit: just 1.5% of tested Bloom filters having a "mediocre" FP rate (IMPL1+ENH) rather than 31% (IMPL baseline) (based on my enhancements to the unit tests - see below). (More results and justification of speed trade-offs later.)

We can use the ENH change similarly for IMPL2 (FullFilterBitsBuilder/FullFilterBitsReader), but the accuracy benefit is not as significant (1 mediocre FP case for IMPL2+ENH vs. 3 mediocre for IMPL2 baseline). But here we can get a similar accuracy boost with almost no runtime overhead, because the effective modulus (CPU cache line size) is a power of 2. Ensuring 'delta' is odd suffices to ensure unique indices. Call this change ODD (applied in Builder and Reader):

-  const uint32_t delta = (h >> 17) | (h << 15);  // Rotate right 17 bits
+  const uint32_t delta = (h >> 17) | (h << 15) | 1;  // Rotate right 17 bits, ensure odd

IMPL2+ODD yields 2 mediocre FP cases.

Second Issue

These implementations only work with 32 bits of hash information and if bits of that data are reused with virtually no mixing, subtle associations between starting index and delta can affect accuracy. This appears to be fixable by throwing in one or two XOR operations between the existing modulus operations. IMPL1+XOR (I didn't put any engineering into the specific value used here):

       uint32_t h = hash_func_(keys[i]);
       const uint32_t delta = (h >> 17) | (h << 15);  // Rotate right 17 bits
+      h ^= 0x6740bca3; // random mixing
       for (size_t j = 0; j < num_probes_; j++) {
         const uint32_t bitpos = h % bits;

This carries a minimal speed hit (< 1%) but is itself sufficient to lower mediocre Bloom filters in the unit test from 31% (IMPL1 baseline) to 9% (IMPL1+XOR). Combined with the previous change we get no mediocre Bloom filters (0% for IMPL1+ENH+XOR).

IMPL2+XOR inserts two operations, because parts of the hash value are used three times:

   uint32_t h = hash;
   const uint32_t delta = (h >> 17) | (h << 15);  // Rotate right 17 bits
+  h ^= 0x6740bca3; // random mixing
   uint32_t b = (h % num_lines) * (cache_line_size * 8);
+  h ^= 0x41fc0926; // random mixing
...
     const uint32_t bitpos = b + (h % (cache_line_size * 8));

Obviously we could get the best accuracy with more significant mixing and/or with (re-)distributing parts of the hash over subsequent indices (like IMPL3), but I'm trying to close the accuracy gap with minimal additional running time.

IMPL2+XOR carries a modest speed hit (~3%) with an accuracy benefit similar to IMPL2+ENH, but combining IMPL2+ODD+XOR works better, with no mediocre FP Bloom filters and the same modest speed hit (~3%). IMPL2+ENH+XOR is slightly more accurate (results below) with about a 5% speed hit vs. IMPL2 baseline.

Accuracy vs. speed

You might say "ok, an improvement in accuracy is nice, but not if it's slower in unit tests." First of all, the purpose of these Bloom filters is to skip costly disk reads, so depeding on a specific deployment (I don't have a typical deployment handy, just an SSD laptop) a small improvement in Bloom filter accuracy could pay for a small regression in speed.

But even more generally, Bloom filters have a built-in way of trading accuracy for speed: the number of bits set per index. Manipulating this parameter might enable finding a configuration incorporating a new change that is at least as good in all dimensions (speed, accuracy, memory use) as the old configuration and better in at least one dimension.

So I tested the new configurations, IMPL1+ENH+XOR, IMPL2+ODD+XOR, and IMPL2+ENH+XOR with setting 5 bits per index instead of 6, keeping the same memory setting. All three modified implementations beat the baseline in accuracy with the same or better speed. (Details below.) Thus, the accuracy boosts are worth the runtime overhead, as you can negate the runtime overhead with remaining net accuracy benefit by setting fewer indices, at least in the common 5 vs. 6 case.

Detailed results

Working from commit 926f3a7, I extended bloom_test.cc to do more thorough testing. In short, Bloom filters are tested up to 10M keys, with more independence between testing of each filter, and more statistics gathered and printed.

The existing IMPL1 fares poorly with this more thorough testing, and IMPL2 fares not super well. (For some tests to finish, I also had to disable the limit of 2% FP rate.)

Running times are shortest of 2-3 runs for VaryingLength unit test. Mediocre filters are any with FP rate over 1.25%, out of 64 differently-sized Bloom filters. Average FP rate excludes lengths < 1000 because they sometimes have much more than 10 bits / key and are not strong samples.

Configuration  | Time (% vs base) | Mediocre | Avg FP rate | Max FP rate
-------------------------------------------------------------------
IMPL1 base     | 32283 ms         | 20       | 1.16%       | 1.97%
IMPL1+ENH      | 33155 ms (+2.7%) | 1        | 1.06%       | 1.35%
IMPL1+XOR      | 32459 ms (+0.5%) | 6        | 0.85%       | 2.06%
IMPL1+ENH+XOR  | 33239 ms (+2.9%) | 0        | 0.86%       | 1.00%
IMPL1+ENH+XOR@5| 30256 ms (-6.3%) | 0        | 0.96%       | 1.08%
Theoretical    |                  |          | 0.84%       |
Configuration  | Time (% vs base) | Mediocre | Avg FP rate | Max FP rate
-------------------------------------------------------------------
IMPL2 base     | 16542 ms         | 3        | 1.13%       | 1.51%
IMPL2+ODD      | 16642 ms (+0.6%) | 2        | 1.07%       | 1.42%
IMPL2+ENH      | 17268 ms (+4.3%) | 1        | 0.96%       | 1.27%
IMPL2+XOR      | 16996 ms (+2.7%) | 2        | 1.14%       | 1.26%
IMPL2+ODD+XOR  | 16830 ms (+1.7%) | 0        | 1.06%       | 1.24%
IMPL2+ENH+XOR  | 17445 ms (+5.4%) | 0        | 0.95%       | 1.05%
IMPL2+ODD+XOR@5| 16386 ms (-0.9%) | 1        | 1.10%       | 1.28%
IMPL2+ENH+XOR@5| 16522 ms (-0.1%) | 0        | 1.03%       | 1.17%

The theoretical entry comes from standard FP rate formulas for Bloom filters, assuming uniform random hash functions. The same theoretical ideal FP applies to IMPL2 except that the sharding for CPU cache locality affects it somewhat. I don't have a simple formula handy to compute a theoretical prediction with that (assuming quality hashing and index computation).

Tests were run on a Debian 8 laptop with a Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz and 8GB ram.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions