diff --git a/README.md b/README.md index 7ed54a7..8c9fabd 100644 --- a/README.md +++ b/README.md @@ -4,73 +4,45 @@ ## Overview -Cuckoo Index (CI), formerly Cuckoo Lookup Table (CLT), is a lightweight -secondary index structure that represents the many-to-many relationship between -keys and stripes (chunks of columns) in a highly space-efficient way. At its -core, CI associates variable-sized fingerprints in a Cuckoo filter [1] with -compressed bitmaps indicating qualifying stripes. +[Cuckoo Index](https://www.vldb.org/pvldb/vol13/p3559-kipf.pdf) (CI) is a lightweight secondary index structure that represents the many-to-many relationship between keys and partitions of columns in a highly space-efficient way. At its core, CI associates variable-sized fingerprints in a [Cuckoo filter](https://www.cs.cmu.edu/~dga/papers/cuckoo-conext2014.pdf) with compressed bitmaps indicating qualifying partitions. ## What Problem Does It Solve? -The problem of finding all stripes that possibly contain a given lookup key is -traditionally solved by maintaining one filter (e.g., a Bloom filter) per stripe -that indexes all unique key values contained in this stripe: +The problem of finding all partitions that possibly contain a given lookup key is traditionally solved by maintaining one filter (e.g., a Bloom filter) per partition that indexes all unique key values contained in this partition: ``` -Stripe 0: -Keys: A, B => Bloom filter 0 (1% false positive rate) - -Stripe 1: -Keys: B, C => Bloom filter 1 (1% false positive rate) +Partition 0: +A, B => Bloom filter 0 +Partition 1: +B, C => Bloom filter 1 ... ``` -To identify all stripes that contain a key, we probe all per-stripe filters -(which could be many!) to derive a bitmap of qualifying stripes. Since a Bloom -filter may return false positives, there is a chance (of e.g. 1%) that we -accidentally identify a stripe as a false positive. In the above example, a -lookup for key A may return Stripe 0 (true positive) and 1 (false positive). -Depending on the storage medium, a false positive stripe can be very expensive -(e.g., many milliseconds on disk). +To identify all partitions containing a key, we need to probe all per-partition filters (which could be many). Since a Bloom filter may return false positives, there is a chance (of e.g. 1%) that we accidentally identify a negative partition as positive. In the above example, a lookup for key A may return Partition 0 (true positive) and 1 (false positive). Depending on the storage medium, a false positive partition can be very expensive (e.g., many milliseconds on disk). -Besides this problem of false positive stripes (even for occurring keys such as -A!), secondary columns typically contain many duplicates (even across stripes). -With the per-stripe filter design, these duplicates may be indexed in multiple -filters (in the worst case, in all filters!). In the above example, the key B is -redundantly indexed in Bloom filter 0 and 1. +Furthermore, secondary columns typically contain many duplicates (also across partitions). With the per-partition filter design, these duplicates may be indexed in multiple filters (in the worst case, in all filters). In the above example, the key B is redundantly indexed in Bloom filter 0 and 1. -Cuckoo Index addresses both of these drawbacks of per-stripe filters. +Cuckoo Index addresses both of these drawbacks of per-partition filters. ## Features -* 100% correct results for lookups with occurring keys (as opposed to - traditional per-stripe filters) -* Configurable scan rate (ratio of false positive stripes) for lookups with - non-occurring keys -* Much smaller footprint size than full-fledged indexes that store full-sized - keys at the cost of false positive stripes for lookups with non-occurring - keys -* Smaller footprint size than per-stripe filters for low-to-medium cardinality - columns +* 100% correct results for lookups with occurring keys (as opposed to per-partition filters). +* Configurable scan rate (ratio of false positive partitions) for lookups with non-occurring keys. +* Much smaller footprint size than full-fledged indexes that store full-sized keys. +* Smaller footprint size than per-partition filters for low-to-medium cardinality columns. ## Limitations -* Requires access to all keys at build time -* Relatively high build time (in O(n) but with a high constant factor) - compared to e.g. per-stripe Bloom filters -* Once built, CI is immutable and will be fast to query (the current - implementation lacks a rank support structure [2] that is required for - efficient lookups) +* Requires access to all keys at build time. +* Relatively high build time (in O(n) but with a high constant factor) compared to e.g. per-partition Bloom filters. +* Once built, CI is immutable but fast to query (it uses a [rank support structure](https://www.cs.cmu.edu/~dga/papers/zhou-sea2013.pdf) for efficient rank calls). -## Running experiments +## Running Experiments -Prepare a data set in a CSV format that you are going to use. One of the data -sets we used was the DMV -[Vehicle, Snowmobile, and Boat Registrations](https://catalog.data.gov/dataset/vehicle-snowmobile-and-boat-registrations). +Prepare a dataset in a CSV format that you are going to use. One of the datasets we used was DMV [Vehicle, Snowmobile, and Boat Registrations](https://catalog.data.gov/dataset/vehicle-snowmobile-and-boat-registrations). -For footprint experiments, run the following command, specifying the path to the -data file, columns to test and the tests to run. +For footprint experiments, run the following command, specifying the path to the data file, columns to test, and the tests to run. ``` bazel run -c opt --cxxopt="-std=c++17" :evaluate -- \ @@ -80,11 +52,9 @@ bazel run -c opt --cxxopt="-std=c++17" :evaluate -- \ --output_csv_path="results.csv" ``` -For lookup performance experiments, run the following command, specifying the -path to the the data file and columns to test. +For lookup performance experiments, run the following command, specifying the path to the data file, and columns to test. -**NOTE** You might want to use fewer rows for lookup experiments as the -benchmarks are quite time-consuming. +**NOTE** You might want to use fewer rows for lookup experiments as the benchmarks are quite time-consuming. ``` bazel run -c opt --cxxopt='-std=c++17' --dynamic_mode=off :lookup_benchmark -- \ @@ -96,53 +66,38 @@ bazel run -c opt --cxxopt='-std=c++17' --dynamic_mode=off :lookup_benchmark -- \ #### Evaluation Framework -* Evaluate (evaluate.h) - - Entry point (binary) into our evaluation framework with instantiations of - all indexes - -* Evaluator (evaluator.h) - - Evaluation framework - -* Table/Column (data.h) - - Integer columns that we run the benchmarks on (string columns are - dict-encoded) - -* IndexStructure (index_structure.h) - - Interface shared among all indexes +* Evaluate (evaluate.h): *Entry point (binary) into our evaluation framework with instantiations of all indexes.* +* Evaluator (evaluator.h): *Evaluation framework.* +* Table/Column (data.h): *Integer columns that we run the benchmarks on (string columns are dict-encoded).* +* IndexStructure (index_structure.h): *Interface shared among all indexes.* #### Cuckoo Index -* CuckooIndex (cuckoo_index.h) - - Main class of Cuckoo Index - -* CuckooKicker (cuckoo_kicker.h) +* CuckooIndex (cuckoo_index.h): *Main class of Cuckoo Index.* +* CuckooKicker (cuckoo_kicker.h): *A heuristic that finds a close-to-optimal assignment of keys to buckets (in terms of the ratio of items residing in primary buckets).* +* FingerprintStore (fingerprint_store.h): *Stores variable-sized fingerprints in bitpacket format.* +* RleBitmap (rle_bitmap.h): *An RLE-based (bitwise, unaligned) bitmap representation (for sparse bitmaps we use position lists).* +* BitPackedReader (bit_packing.h): *A helper class for storing & retrieving bitpacked data.* - A heuristic that finds a close-to-optimal assignment of keys to buckets (in - terms of the ratio of items residing in primary buckets) +## Cite -* FingerprintStore (fingerprint_store.h) +Please cite our [VLDB 2020 paper](https://www.vldb.org/pvldb/vol13/p3559-kipf.pdf) if you use this code in your own work: - Stores variable-sized fingerprints in bitpacket format - -* RleBitmap (rle_bitmap.h) - - An RLE-based (bitwise, unaligned) bitmap representation (for sparse bitmaps - we use position lists) - -* BitPackedReader (bit_packing.h) - - A helper class for storing & retrieving bitpacked data - -## References - -[1] -[Fan et al., Cuckoo Filter: Practically Better Than Bloom, 2014](https://www.cs.cmu.edu/~dga/papers/cuckoo-conext2014.pdf) - -[2] [Zhou et al., Space-Efficient, High-Performance Rank & Select Structures on -Uncompressed Bit Sequences, -2013](https://www.cs.cmu.edu/~dga/papers/zhou-sea2013.pdf) +``` +@article{cuckoo-index, +author = {Kipf, Andreas and Chromejko, Damian and Hall, Alexander and Boncz, Peter and Andersen, David}, +title = {Cuckoo Index: A Lightweight Secondary Index Structure}, +year = {2020}, +issue_date = {September 2020}, +publisher = {VLDB Endowment}, +volume = {13}, +number = {13}, +issn = {2150-8097}, +url = {https://doi.org/10.14778/3424573.3424577}, +doi = {10.14778/3424573.3424577}, +journal = {Proc. VLDB Endow.}, +month = sep, +pages = {3559-3572}, +numpages = {14} +} +```