BBHash is a simple library for building minimal perfect hash function. It is designed to handle large scale datasets. The function is just a little bit larger than other state-of-the-art libraries, it takes approximately 3 bits / elements (compared to 2.62 bits/elem for the emphf lib), but construction is faster and does not require additional memory.
It is easy to include in other projects (just include a single .h file) and has no dependencies.
A. Limasset, G. Rizk, R. Chikhi, P. Peterlongo, Fast and Scalable Minimal Perfect Hashing for Massive Key Sets, SEA 2017: http://drops.dagstuhl.de/opus/volltexte/2017/7619/pdf/LIPIcs-SEA-2017-25.pdf
@InProceedings{bbhash,
author ={Antoine Limasset and Guillaume Rizk and Rayan Chikhi and Pierre Peterlongo},
title ={{Fast and Scalable Minimal Perfect Hashing for Massive Key Sets}},
booktitle ={16th International Symposium on Experimental Algorithms (SEA 2017)},
pages ={25:1--25:16},
series ={Leibniz International Proceedings in Informatics (LIPIcs)},
year ={2017},
volume ={75},
ISSN ={1868-8969},
publisher ={Schloss Dagstuhl--Leibniz-Zentrum fuer Informatik},
address ={Dagstuhl, Germany},
URL ={http://drops.dagstuhl.de/opus/volltexte/2017/7619},
doi ={10.4230/LIPIcs.SEA.2017.25}
}
Here is a simple example showing how to build and query a mphf with input keys in a std::vector<uint64_t> . BBHash is mainly designed for de-duplicated input. Keys can be read from a disk file, or from some user-defined iterator.
#include "BooPHF.h"
//tells bbhash to use included hash function working on uint64_t input keys :
typedef boomphf::SingleHashFunctor<uint64_t> hasher_t;
typedef boomphf::mphf< uint64_t, hasher_t > boophf_t;
std::vector<uint64_t> input_keys;
//
... fill the input_keys vector
//build the mphf
boophf_t * bphf = new boomphf::mphf<uint64_t,hasher_t>(input_keys.size(),input_keys,nthreads);
//query the mphf :
uint64_t idx = bphf->lookup(input_keys[0]);
The master branch works with Plain Old Data types only (POD). To work with other types, use the "alltypes" branch (it runs slighlty slower). The alltypes branch includes a sample code with strings. The "internal_hash" branch allows to work with types that do not support copy or assignment operators, at the expense of using 128bits/key in I/O operations regardless of the actual key size. Thus, if your keys are 64 bits integers, "internal_hash" will do twice more I/Os. But if your keys are longer than 128 bits, then "internal_hash" branch will be faster than the master branch.
A sample usage is provided in file example.cpp, compile and run with: ( params are nb_elements nb_threads)
make
./example 100000000 1
File Bootest.cpp contains more options, use ./Bootest with -check to check correctness of the hash function, and -bench to benchmark lookup performance.
Here is a sample output :
./Bootest 100000000 8 1.0 -check -bench
key file generated
Construct a BooPHF with 100000000 elements
for info, total work write each : 3.718 total work inram from level 8 : 9.408 total work raw : 25.000
[Building BooPHF] 100 % elapsed: 0 min 10 sec remaining: 0 min 0 sec
BooPHF constructed perfect hash for 100000000 keys in 10.25s
Bitarray 305808384 bits (100.00 %) (array + ranks )
Last level hash 0 bits (0.00 %) (nb in last level hash 0)
boophf bits/elem : 3.058084
--- boophf working correctly ---
bench lookups sample size 10000000
BBhash bench lookups average 243.84 ns +- stddev 13.01 % (fingerprint 4999580507664480)
If you query a key that is not in the input set of keys, then one of the two following events will happen:
-
BBHash tells you that the key is not in the set (by returning
ULLONG_MAX
), or -
BBHash returns the index of a completely different key.
This is actually the behavior of any minimal perfect hash function, not just BBHash. It is just how they work. Need to increase the probability of situation 1 happening? Increase the gamma
parameter (at the expense of memory usage), or store a signature of your key at the position given by the index. Or to completely prevent situation 2, store the whole key at its index, but this somewhat defeats the purpose of having a minimal perfect hash.
Guillaume Rizk, Antoine Limasset, Rayan Chikhi