🔗 SQL Identity Resolution

Production-grade deterministic identity resolution for modern data warehouses. Unify customer identities across CRM, transactions, web events, and loyalty data—no ML required.

⚡ 60-Second Demo

# Clone and run
git clone https://github.com/anilkulkarni87/sql-identity-resolution.git
cd sql-identity-resolution
make demo

That's it! Open demo_results.html to see clustered identities.

🐳 Docker one-liner (no Python required)

docker run -it --rm -v $(pwd)/output:/output ghcr.io/anilkulkarni87/sql-identity-resolution:demo

🎯 Why SQL Identity Resolution?

Challenge	Our Solution
Expensive CDPs	Open source, runs on your warehouse
Black-box ML	Deterministic rules, fully auditable
Vendor lock-in	Same logic across 4 platforms
Scale limits	Tested to 100M+ rows

How We Compare

	vs CDPs	vs ML-based (Zingg, Dedupe)	vs dbt packages
Cost	Free vs $5K-50K/mo	No Spark cluster needed	More complete pipeline
Control	No vendor lock-in	No ML training required	Production-hardened
Transparency	Full auditability	Deterministic output	Multi-platform

Who Is This For?

🏢 SMBs wanting customer 360 without CDP costs
🔧 Data engineers building composable CDPs
📊 Analysts who prefer SQL over Python/Spark
⚖️ Compliance teams needing auditable matching logic

🏗️ Supported Platforms

Platform	Status	Quickstart
DuckDB	✅ Full	`make demo` (local)
Snowflake	✅ Full	`CALL idr_run('FULL', 30, FALSE);`
BigQuery	✅ Full	`python sql/bigquery/idr_run.py --project=...`
Databricks	✅ Full	Run `IDR_QuickStart.py` notebook

✨ Key Features

🎯 Cluster Confidence Scoring - Quality score (0-1) for each cluster based on edge diversity and match density
🔒 Dry Run Mode - Preview changes before committing
📊 Metrics Export - Prometheus, DataDog, webhook support
🛡️ Data Quality Controls - max_group_size, exclusion lists
📈 Incremental Processing - Watermark-based efficiency
🔍 Full Audit Trail - Every decision is traceable

📊 Architecture

┌─────────────┐     ┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   Sources   │────▶│  Configure  │────▶│  IDR Run    │────▶│   Output    │
│             │     │             │     │             │     │             │
│ • CRM       │     │ • Rules     │     │ • Extract   │     │ • Clusters  │
│ • POS       │     │ • Mappings  │     │ • Match     │     │ • Profiles  │
│ • Web       │     │ • Sources   │     │ • Cluster   │     │ • Metrics   │
│ • Mobile    │     │             │     │             │     │             │
└─────────────┘     └─────────────┘     └─────────────┘     └─────────────┘

4 Steps:

Configure - Register sources and identifier mappings
Extract - Pull identifiers (email, phone, loyalty ID)
Match - Build edges between entities sharing identifiers
Cluster - Label propagation to find connected components

🚀 Getting Started

Option 1: Local Demo (DuckDB)

make demo

Option 2: Platform-Specific

Snowflake

-- 1. Create objects
\i sql/snowflake/00_ddl_all.sql

-- 2. Configure and run
CALL idr_run('FULL', 30, FALSE);  -- FALSE = live run
CALL idr_run('FULL', 30, TRUE);   -- TRUE = dry run (preview)

BigQuery

# 1. Setup
bq query < sql/bigquery/00_ddl_all.sql

# 2. Run
pip install google-cloud-bigquery
python sql/bigquery/idr_run.py --project=your-project --run-mode=FULL

Databricks

Import sql/databricks/notebooks/IDR_QuickStart.py
Run all cells
Check idr_out.identity_resolved_membership_current

dbt Package

# packages.yml
packages:
  - git: "https://github.com/anilkulkarni87/sql-identity-resolution"
    subdirectory: "dbt_idr"

dbt deps
dbt seed --select dbt_idr
dbt run --select dbt_idr

📖 dbt Package Docs

📖 Documentation

📚 Full Documentation

Guide	Description
Quick Start	Get running in 5 minutes
Configuration	Set up sources and rules
Dry Run Mode	Preview before committing
Production Hardening	Enterprise best practices
Architecture	How it works

🏭 Industry Templates

Pre-built configurations for common use cases:

Template	Use Case	Identifiers
Retail	Nike, Lululemon style	email, phone, loyalty_id, address
Healthcare	Patient matching	MRN, SSN, name+DOB
Financial	Account linking	account_id, email, SSN
B2B SaaS	Lead deduplication	email, domain, company_name

📊 Performance

Tested on retail customer data (10M rows):

Platform	Duration	Cost	Clusters
DuckDB	143s	Free	1.84M
Snowflake	168s	~$0.25	1.84M
BigQuery	295s	~$0.50	1.84M
Databricks	317s	TBD	1.84M

See benchmarks/ for full testing suite and results.

🤝 Contributing

We welcome contributions! See CONTRIBUTING.md for guidelines.

# Run tests locally
make test

# Generate docs locally
make docs

📜 License

Apache 2.0 — see LICENSE

⭐ Star this repo if you find it useful!

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
.chat_history		.chat_history
.github/workflows		.github/workflows
benchmarks		benchmarks
dbt_idr		dbt_idr
deployment		deployment
docs		docs
examples		examples
metadata_samples		metadata_samples
sql		sql
tests		tests
tools		tools
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
Readme.md		Readme.md
docker-compose.yml		docker-compose.yml
mkdocs.yml		mkdocs.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🔗 SQL Identity Resolution

⚡ 60-Second Demo

🎯 Why SQL Identity Resolution?

How We Compare

Who Is This For?

🏗️ Supported Platforms

✨ Key Features

📊 Architecture

🚀 Getting Started

Option 1: Local Demo (DuckDB)

Option 2: Platform-Specific

📖 Documentation

🏭 Industry Templates

📊 Performance

🤝 Contributing

📜 License

About

Uh oh!

Releases

Packages

Languages

License

anilkulkarni87/sql-identity-resolution

Folders and files

Latest commit

History

Repository files navigation

🔗 SQL Identity Resolution

⚡ 60-Second Demo

🎯 Why SQL Identity Resolution?

How We Compare

Who Is This For?

🏗️ Supported Platforms

✨ Key Features

📊 Architecture

🚀 Getting Started

Option 1: Local Demo (DuckDB)

Option 2: Platform-Specific

📖 Documentation

🏭 Industry Templates

📊 Performance

🤝 Contributing

📜 License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages