Parallelizing HyperLogLog for Scalable Cardinality Estimation 📊

This project demonstrates how to estimate the cardinality (i.e., the number of unique elements) of dataset columns using the HyperLogLog (HLL) algorithm with parallel processing for improved performance.

Overview 🚀

This project demonstrates how to estimate the cardinality (i.e., the number of unique elements) of dataset columns using the HyperLogLog (HLL) algorithm with parallel processing for improved performance. 📄 Read the full research documentation here: Parallelizing HyperLogLog for Scalable Cardinality Estimation

The example dataset used is a transactional dataset containing 541,909 entries with 8 columns: InvoiceNo, StockCode, Quantity, InvoiceDate, UnitPrice, CustomerID, and Country. This dataset is first cleaned by removing duplicate and missing rows, and numeric columns like Quantity and UnitPrice are standardized using z-score normalization.

The script performs exploratory data analysis by visualizing the most frequent countries and aggregated customer purchase behaviors. Then, it estimates the number of unique elements in selected columns using HyperLogLog sketches, which are computed in parallel with Python's multiprocessing module for faster execution.

While this example uses a transactional dataset, the approach can be applied to any mid- to large-sized dataset where estimating unique counts efficiently is required.

Features ✨

Data cleaning: removing duplicates and missing values
Standardizing numeric columns (Quantity and UnitPrice)
Visualizing dataset statistics and aggregated customer data
Estimating cardinality with HyperLogLog using multiple processes
Comparing performance with different numbers of processes
Comparing estimated cardinality against actual unique counts

Requirements 🧰

Python 3.7+
Required libraries:
- pandas
- numpy
- matplotlib
- scikit-learn
- hyperloglog

📦 Install dependencies via pip:

pip install pandas numpy matplotlib scikit-learn hyperloglog

Usage 📋

Download the dataset file data.csv from this link and place it in the same directory as the script:
Dataset on Google Drive
Clone the repository:

git clone https://github.com/20101301-Alina-Hasan/Parallel-HyperLogLog-Cardinality-Estimation.git
cd Parallel-HyperLogLog

Run the script:

python parallel-hyperloglog.py

The script will output dataset info, statistics, plots, cardinality estimation results, and save the processed data as updated_dataset.csv

How It Works 🔧

The dataset is cleaned by dropping duplicates and missing values.
Numeric columns are standardized using z-score normalization.
Data is partitioned, and each partition is processed in parallel to build partial HyperLogLog sketches.
Partial sketches are merged to produce an overall cardinality estimate.
Execution time is measured for different numbers of processes to demonstrate speedup.
Results are visualized and printed for easy comparison.

Visualization 📈

Frequency of top countries in the dataset
Aggregated customer data trends (StockCode counts, Quantity sums, UnitPrice means)
Time taken to estimate cardinality vs number of processes
Bar plots comparing actual unique counts to estimated cardinalities
Scaled absolute difference between actual and estimated cardinalities per column

Notes 📝

The HyperLogLog error rate is set to 1% (error_rate=0.01).
Parallelization improves performance on large datasets, but overhead may limit gains beyond a certain number of processes.

Contributing 🙌

Contributions are welcome! If you have suggestions or want to collaborate, feel free to open an issue or pull request.

Contact 📧

For questions or inquiries, reach out via alina.hasan@g.bracu.ac.bd

Feel free to ⭐ the repo if you find it useful.

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
Figure_1.png		Figure_1.png
README.md		README.md
Research.md		Research.md
parallel-hyperloglog.py		parallel-hyperloglog.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Parallelizing HyperLogLog for Scalable Cardinality Estimation 📊

Overview 🚀

Features ✨

Requirements 🧰

Usage 📋

How It Works 🔧

Visualization 📈

Notes 📝

Contributing 🙌

Contact 📧

About

Uh oh!

Languages

20101301-Alina-Hasan/Parallel-HyperLogLog-Cardinality-Estimation

Folders and files

Latest commit

History

Repository files navigation

Parallelizing HyperLogLog for Scalable Cardinality Estimation 📊

Overview 🚀

Features ✨

Requirements 🧰

Usage 📋

How It Works 🔧

Visualization 📈

Notes 📝

Contributing 🙌

Contact 📧

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Languages