Bloom Filter in Python

Introduction

A Bloom Filter is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. It allows false positives but never false negatives. This makes it ideal for applications where memory efficiency is crucial, such as caching, networking, and databases.

This implementation is optimized for configurability and performance, allowing users to specify the expected number of elements and the desired false positive probability.

Theory

A Bloom Filter consists of a bit array of size m and uses k different hash functions. When an element is inserted, all k hash functions generate indices in the bit array, and the corresponding bits are set to 1.

To check if an element is present:

Compute its k hash values.
If all bits at those positions are 1, the element may be present (with a certain probability of false positives).
If at least one bit is 0, the element is definitely not present.

Mathematical Formulas

Optimal bit array size:
$$m = - \frac{n \log p}{(\log 2)^2}$$
where ( n ) is the number of expected elements and ( p ) is the false positive rate.
Optimal number of hash functions:
$$k = \frac{m}{n} \log 2$$

Installation

To install the Bloom Filter package, run:

pip install bloomfilter-lite

Installation from source

Clone this repository:

 git clone https://github.com/lorenzomaiuri-dev/bloomfilter-py.git
 cd bloomfilter-py

Create a virtual environment (optional but recommended):

python -m venv venv

source venv/bin/activate  # On Linux/Mac
venv\Scripts\activate  # On Windows

Install the dependencies
```
pip install -r requirements.txt
```

Usage

Basic Example

from bloomfilter_lite import BloomFilter

# Create a Bloom Filter for 1000 expected elements with a 1% false positive rate
bf = BloomFilter(expected_items=1000, false_positive_rate=0.01)

# Add elements
bf.add("hello")
bf.add("world")

# Check for membership
print(bf.check("hello"))  # True (probably)
print(bf.check("python")) # False (definitely not present)

Benchmark

Performance testing for different dataset sizes:

Elements	False Positive Rate	Memory (bits)	Time per Insert (ms)	Time per Lookup (ms)
1,000	1%	~9.6 KB	0.01	0.008
10,000	1%	~96 KB	0.015	0.010
100,000	1%	~960 KB	0.020	0.012

Reproducing Benchmarks

To verify the benchmarks, run the following script:

python benchmarks/run_benchmark.py

This script tests insertions and lookups for varying dataset sizes and prints the execution time and memory usage.

Running Tests

To run the unit tests using pytest:

pytest tests/

Contributing

Contributions are welcome! If you'd like to contribute to this project, please follow these steps:

Fork the repository
Create a new branch (git checkout -b feature/your-feature)
Commit your changes (git commit -am 'Add new feature')
Push the branch (git push origin feature/your-feature)
Open a Pull Request

Please ensure all pull requests pass the existing tests and include new tests for any added functionality

License

This project is licensed under the MIT License. See the LICENSE file for more details

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.github/workflows		.github/workflows
benchmarks		benchmarks
src/bloomfilter_lite		src/bloomfilter_lite
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Bloom Filter in Python

Table of Contents

Introduction

Theory

Mathematical Formulas

Installation

Installation from source

Usage

Basic Example

Benchmark

Reproducing Benchmarks

Running Tests

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

lorenzomaiuri-dev/bloomfilter-py

Folders and files

Latest commit

History

Repository files navigation

Bloom Filter in Python

Table of Contents

Introduction

Theory

Mathematical Formulas

Installation

Installation from source

Usage

Basic Example

Benchmark

Reproducing Benchmarks

Running Tests

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages