The B-field is a novel, probabilistic data structure for storing key-value pairs (or, said differently, it is a probabilistic associative array or map). B-fields support insertion (insert
) and lookup (get
) operations, and share a number of mathematical and performance properties with the well-known Bloom filter. A B-field lookup will always return the correct value for any inserted key,1 but may return a false positive for keys not in the B-field.
At One Codex, we use the rust-bfield
crate in bioinformatics applications to efficiently store associations between billions of (kmer, value)
pair for up to 100,000 unique taxonomic IDs (distinct values) and a 0.1% error rate. We hope others are able to use this library (or implementations in other languages) for applications in bioinformatics and beyond.
Note: In the Implementation Details section below, we detail the use of this B-field implementation in Rust and use
code
formatting and English parameter names (e.g., we discuss the B-field being a data structure for storing(key, value)
pairs). In the following Formal Data Structure Details section, we detail the design and mechanics of the B-field using mathematical notation (i.e., we discuss it as an associate array mapping a set of$(x, y)$ _pairs). The generated Rust documentation includes both notations for ease of reference.1 The lookup of an inserted key may also return an indeterminate value depending on the B-field configuration, but this is tunable to a ~0 error rate. See below for more details.
A B-field stores a (key, value)
mapping without directly storing either the keys or the values. This permits highly space efficient association of (key, value)
mappings, but comes at the cost of two classes of errors:
- False positives (which we later describe as occurring at a tunable rate of
$\alpha$ ) in which theget
operation for akey
that was not inserted into the B-field returns an incorrectvalue
; and - Indeterminacy errors (later described as occurring at a frequency of
$\beta$ ) in which theget
operation for akey
returns anIndeterminate
instead of a specificvalue
. Note that this error can occur for bothkeys
that were inserted into the B-field as well as those that were not.
Both error categories can be minimized with appropriate parameter selection, and it is trivial to achieve a zero or near-zero indeterminacy error rate (i.e.,
While the space requirements for a B-field depends on the number of discrete values and the desired error rates, the following examples are illustrative use cases:
-
Store 1 billion web URLs and...
-
Assign each of them one of a small number of categories values (n=8) in 2.22Gb (params include ν=8, κ=1, 𝛼=0.1%; 19 bits per element)
-
Store their IPv4 prefix (n=32) in 3.16Gb (ν=24, κ=2, 𝛼=0.1%; 27 bits per element)
-
Store their IP in an
0.0.0.0/8
address block (n=16,777,216) in 8.9Gb (ν=51, κ=6, 𝛼=0.1%; 76 bits per element) -
Determine if they are part of the set (a Bloom filter!) in 1.75Gb (params include ν=1, κ=1, 𝛼=0.1%; 15 bits per element)
We don't estimate space savings over alternative data structures, but any structure storing URLs averaging ~40 bytes per key will several-fold larger than even a B-field associating keys with ~16M distinct values.
-
-
Store 1 billion DNA or RNA k-mers ("ACGTA...") and...
-
Associate them with any of the ~500k bacterial IDs current described by NCBI in 6.93Gb (ν=62, κ=4, 𝛼=0.1%; 59 bits per element)
-
Store which of 1000 datasets they were most frequently observed in using 3.65Gb (ν=46, κ=2, 𝛼=0.1%; 31 bits per element)
-
Quantify their abundance in a dataset(s) (say with ~100 quantized values) in 2.91Gb (ν=15, κ=2, 𝛼=0.1%; 25 bits per element)
Compare this to a naive implementation storing 2-bit encoded 32-mers (or 8 amino acid character) and an associated 32 bit value, requiring 96 bits per element without any indexing or data structure overhead.
-
This implementation provides the following core functionality:
- A B-field can be created with the
create
function (see parameter selection for additional details) and then by iteratively inserting(key, value)
pairs with theinsert
function:
use bfield::BField;
let tmp_dir = tempfile::tempdir().unwrap();
// Create a B-field in a temporary directory
// with a 1M bit array, 10 hash functions (k), a marker
// width of 39 (ν), weight of 4 (κ), 4 secondary arrays, and
// an uncorrected β error of 0.1
let mut bfield: BField<String> = BField::create(
"/tmp/", // directory
"bfield", // filename (used as base filename)
1_000_000, // size (in bits)
10, // n_hashes (k)
39, // marker_width (ν)
4, // n_marker_bits (weight / κ)
0.1, // secondary_scaledown (uncorrected β)
0.025, // max_scaledown
4, // n_secondaries
false, // in_memory
String::new() // other_params
).expect("Failed to build B-field");
// Insert integers 0-10,000 as key-value pairs (10k keys, 10k distinct values)
for p in 0..4u32 {
for i in 0..10_000u32 {
bfield.insert(&i.to_be_bytes().to_vec(), i, p as usize);
}
}
- After creation, a B-field can optionally be loaded from a directory containing the produced
mmap
and related files with theload
function. And once created or loaded, a B-field can be directly queried using theget
function, which will either returnNone
,Indeterminate
, orSome(BFieldValue)
(which is currently an alias forSome(u32)
see limitations below for more details):
use bfield::BField;
// Load based on filename of the first array ".0.bfd"
let bfield: BField<String> = BField::load("/tmp/bfield.0.bfd", true).expect("Failed to load B-field");
for i in 0..10_000u32 {
let value = bfield.get(&i.to_be_bytes().to_vec());
}
Additional documentation can be generated using cargo docs
and is hosted on docs.rs for the latest rust-bfield
release.
This implementation has several current limitations:
-
u32
Values: Currently, this implementation only permits storingu32
values, though those can trivially be mapped to any other arbitrary values, e.g., by using them as indices for an array of mapped values ([value1, value2, value3, ...]
). -
No Parameter Selection Assistance: Currently, the
create
function requires manually specifying all of the B-field parameters. A future interface might automatically (and deterministically) select optimal parameters based on input information about the number of discretevalues
($\theta$ below) and desired false positive and indeterminacy error rates ($\alpha$ and$\beta$ below, respectively). -
No Higher-Level Insertion Management: Because creation of a B-field with no indeterminacy error
$(\beta\approx0)$ requires settingn_secondaries
number of inserts (e.g., ~4), it is necessary to iterate through all inserted elementsn_secondaries
times (see benchmark.rs for a crude example). A higher-level insertion function would take anIterable
data structure and manage performing the proper number of repeated insertions for the end-user.
In contrast to Bloom filters, which only support set membership queries, B-fields enable efficiently storing a set of
B-fields have a number of advantages of other probabilistic and deterministic data structures for associating keys and values. Specifically, B-fields exhibit a unique combination of properties that makes them ideal for storing very large sets of keys mapped to a modest number of discrete values:
-
Space Efficiency: B-fields probabilistically store key-value pairs in a highly space-efficient manner optimized for in-memory use. For many common use cases or configurations, it is possible to store many billions of elements using only tens of gigabytes of storage (or a few bytes per key-value pair). These space requirements scale linearly with the number of elements in the B-field
$n$ (i.e., B-fields have$O(n)$ space complexity). -
Speed: B-fields are optimized for in-memory storage of key-value pairs and require little computational overhead and few memory accesses (implementation speed should resemble that of a Bloom filter). All B-field insert and lookup operations have
$O(1)$ time complexity. -
Well-Defined and Bounded Errors: Three major types of errors are possible with probabilistic associative arrays:
-
False Positives: A false positive occurs when a data structure returns a value for
$y$ when$x\notin S$ (vs. returningNone
or equivalent). B-fields do exhibit false positives at an error rate of$\alpha$ . At the cost of reduced space efficiency, it is possible to select parameters that result in$\alpha$ being arbitrarily small. False positives are a common feature of probabilistic data structures, but a disadvantage of the B-field versus deterministic data structures (e.g., a hash table). -
Indeterminacy: An indeterminacy error, denoted
$\beta$ , is the rate at which a data structure returns two or more possible values$y$ for an element$x \in S$ . This can take the form of a vector of possible$y$ values or an error type (e.g., this implementation returnsIndeterminate
). While the internal arrays that together comprise a B-field each independently exhibit substantial$\beta$ errors, appropriate B-field parameter selection allows$\beta = 0$ or$\beta \approx 0$ . -
Erroneous Retrieval / Mismapping: An erroneous retrieval or mismapping error
$\gamma$ occurs when a data structure is queries for an$x \in S$ and returns the wrong value of$y$ such that$f(x) \ne y$ . Erroneous retrievals are not possible with B-fields.
-
False Positives: A false positive occurs when a data structure returns a value for
⚠️ Warning: This section provides a detailed overview of the math underlying the B-field data structure and is intended for those interested in those mechanics or those implementing a new B-field library. Many readers may prefer to skip ahead to the Summarizing B-field Properties subsection.
The B-field works by encoding
As the first step, the user needs to select several configuration options, which will determinate how the initial B-field is constructed. These key configuration variables are:
(a) The desired maximum false positive rate
(b) The expected number of values to be stored
(c) The maximum value
(d) The value
As the B-field encodes the values
Next, it is necessary to select a size for the primary bit array used in the B-field and select an appropriate number of hash functions
With regards to the selection of the hash function itself, any hash function with good pseudorandomness and distribution qualities will suffice for the B-field. While the prior equations assume a perfectly random hash function, the B-field performs comparably with non-cryptographic hash functions (such as MurmurHash3 or similar hash functions). Again, while the set of hash functions should in theory be fully independent, in practice it is possible (and preferable for performance reasons) to simply use 2 independent hash functions as seeds and then relax the independence requirements for the remaining hash functions, i.e., two seed hash functions
Now having calculated
Next, the keys of the B-field
(a) 0001
if
Figure 1: Sample
(b) 00101
into the
Figure 2: Sample
In order to describe the construction of subsequent arrays
The
A
-
A bit string with a Hamming weight
$\lt \kappa$ indicates$x \notin S$ . This is due to the fact that theINSERT
operation detailed in 4(b) inserts$\kappa$ 1s with a bitwise OR, guaranteeing that at least$\kappa$ 1s will be returned via any subsequent$\mathtt{LOOKUP}$ of the same element. In this case, a B-field implementation should returnNone
or a similar value that represents an$x_{i} \notin S$ . -
A bit string with a Hamming weight
$=\kappa$ . In this case, we$\mathtt{DECODE}$ the resulting bit string (simply the inverse of the$\mathtt{ENCODE}$ operation detailed in Figure 1) and the value$y_{i}$ mapping to the key$x_{i}$ is returned. This operation will erroneously return a$y_{i}$ for an$x \notin S$ at the false positive rate of$\alpha$ (derived below in 5(b)). -
Finally, a bit string with a Hamming weight of
$\gt \kappa$ will be retrieved with a probability of$\beta$ (also derived below in 5(b)). This represents anIndeterminate
result with more than$\kappa$ 1s. The dramatic reduction of$\beta$ is achieved via the construction of subsequent B-field arrays$\mathtt{Array_{1}}...\mathtt{Array_{a-1}}$ .
Figure 3: Sample
As noted above, the
First, we need to determine the probability that any single bit in the
We can then cancel out
Thus, the probability that a single bit is not set for any of the
After
And the probability that it is set to 1 is:
After each of the
Substituting the formulas for optimal
Where
Using
And, correspondingly,
Given these error rates
This secondary array
Further arrays
To summarize, this cascade of secondary arrays beyond
The sum of which is simply:
Thus, the total space requirements for a B-field with a primary
To briefly summarize, a B-field is a probabilistic associative array or map with the following unique set of properties:
-
Creation & Insertions:
$O(n)$ total creation time and$O(1)$ insertions, requiring an average of$\frac{\kappa}{1-\beta}$ hash computations and memory accesses, and$a\kappa$ hash computations and memory accesses in the worst case. A common value of$a$ would be$\approx 4$ . -
Lookups:
$O(1)$ lookups, requiring the same number of hash computations and memory accesses as an insertion. -
Space Complexity:
$O(n)$ space complexity, requiring$\frac{m\kappa}{1-\beta}$ bits where$m$ ,$\kappa$ , and$\beta$ are defined above ($\mathtt{Array_{0}}$ requires$m\kappa$ bits, while the entire B-field requires an additional factor of$1 - \frac{1}{1-\beta}$ ).
An efficient B-field requires optimal selection of
A number of additional extensions to the B-field design are possible, but not implemented here. Several are outlined below:
-
Mutability: While this implementation is designed to support immutable B-fields, it is possible to implement a mutable B-field. For example,
$\mathtt{INSERT}$ operations may be saved and replayed in a continuous or batch fashion to ensure that all elements$x \in S$ are properly inserted into a B-field's primary and secondary arrays. -
Cache Locality: Optimize the speed of memory accesses by setting a subset of the hash functions
$h_{1}(x)...h_{k}(x)$ to perform lookups at a small fixed offset from a prior memory access (i.e., set$h_{2}(x)$ to select a random location within a fixed window after$h_{1}(x)$ in order to take advantage of cache locality). -
Alternative encoding schemes: While the above simple encoding scheme is well suited to relatively well distributed
$y$ values across a domain$\theta$ , it is possible to change the encoding scheme to, for example, (i) use an error-correcting code or (ii) use different weights for the subset of$y$ values that occur more frequently (to have the effect of minimizing the "fill" of the B-field arrays) amongst other possibilities.
- An associative array or map (e.g., a simple hash table) is likely a better choice when storing
(x, y)
pairs with many distincty
values (e.g., storing 1M keys with 800,000 distinct values). See Formal Data Structure Overview and Parameter Selection for further details on optimal use cases for a B-field. - A [minimal] perfect hash function (possibly paired with a Bloom filter or other data structure supporting set membership queries) is a better choice for any injective function mappings, where there is one unique
$y$ value for each$x$ (e.g., de Bruijn graph implementations) - Despite reducing to a Bloom filter when configured with the appropriate parameters, a Bloom filter (or perhaps xor filter) is a better choice than a B-field for supporting simple set membership queries
The B-field data structure was developed by Nick Greenfield (@boydgreenfield) in collaboration with Nik Krumm (@nkrumm) in 2013 as part of a metagenomics classifier developed under DTRA's 2013 Algorithm Challenge (some details on on Jonathan Eisen's blog here, primary content no longer online). After initial proof-of-concept implementations in Python, Cython, and C, a Nim implementation was developed in 2014 by @boydgreenfield and used in production for ~4 years as part of One Codex's core metagenomics classifier (first described here). Dominik Picheta (@dom96) and Alex Bowe (@alexbowe) contributed additional enhancements and ideas to the nim-bfield
implementation in 2015.
In 2017, Roderick Bovee (@bovee) ported the nim-bfield
library to this implementation in Rust, which continues to be used in the second-generation metagenomics classifier backing the One Codex platform. Vincent Prouillet (@Keats) and Gerrit Gerritsen (@GSGerritsen) contributed additional work (and documentation!) since 2020, and maintains the rust-bfield
crate.
In February 2021, One Codex was acquired by Invitae, where this library continued to be used as part of its metagenomics classification platform. In September 2022, the One Codex team spun the company back out of Invitae and decided to open source and make the B-field library freely available. We have done our best to document it here, and hope you find it useful! ❤️
If you find this crate helpful, we'd love to hear from you and document your use case here. Please feel free to open a PR against this section of the README or drop us an email! 👋
This source code is made available under the Apache 2 license with a patent grant (in order to ensure unencumbered use of the rust-bfield
crate given a 2014 patent).