Three probabilistic data structures, built from scratch — massive data, tiny memory.
When you can't afford to store everything, you trade a little accuracy for enormous savings in space. These are the structures behind real databases, caches, and analytics systems — implemented cleanly, with their math, and validated against their theoretical error bounds in the tests.
Zero dependencies. Pure Python.
Membership test with no false negatives and a tunable false-positive rate. Stores nothing but bits.
from sketchbox import BloomFilter
seen = BloomFilter(capacity=1_000_000, error_rate=0.01)
seen.add("user@example.com")
"user@example.com" in seen # True
"nope@example.com" in seen # almost always FalseSizing is derived from the math: m = -n·ln(p)/(ln2)² bits, k = (m/n)·ln2 hashes.
Frequency estimation for a stream in sub-linear space. Never underestimates; overestimates
are bounded by ε·N with probability 1−δ.
from sketchbox import CountMinSketch
cms = CountMinSketch(epsilon=0.001, delta=1e-5)
for word in stream:
cms.add(word)
cms.estimate("the") # ~ true count of "the"Count unique items in ~16 KB regardless of cardinality — millions of uniques, kilobytes of
memory, ~1% error. The algorithm behind COUNT(DISTINCT) at scale.
from sketchbox import HyperLogLog
hll = HyperLogLog(precision=14) # 2^14 registers ≈ 16 KB
for visitor in billions_of_events:
hll.add(visitor)
len(hll) # estimated unique visitors, ~0.8% standard error- Each structure is an explicit space ↔ accuracy trade-off — and the README/tests show the exact bound, not just "it's approximate."
- HyperLogLog uses leading-zero rank + harmonic mean + small-range (linear counting) correction — the parts people gloss over.
- Hashes are derived by seeding a single fast hash (BLAKE2b) — no need for
kindependent hash functions; double hashing and per-row seeds do the job.
pip install sketchbox
python -m pytest # asserts the error bounds actually hold
python examples/demo.py # Bloom vs set memory, CMS heavy hitters, HLL on 1M itemsMIT