Production-grade deterministic identity resolution for modern data warehouses. Unify customer identities across CRM, transactions, web events, and loyalty dataβno ML required.
# Clone and run
git clone https://github.com/anilkulkarni87/sql-identity-resolution.git
cd sql-identity-resolution
make demoThat's it! Open demo_results.html to see clustered identities.
π³ Docker one-liner (no Python required)
docker run -it --rm -v $(pwd)/output:/output ghcr.io/anilkulkarni87/sql-identity-resolution:demo| Challenge | Our Solution |
|---|---|
| Expensive CDPs | Open source, runs on your warehouse |
| Black-box ML | Deterministic rules, fully auditable |
| Vendor lock-in | Same logic across 4 platforms |
| Scale limits | Tested to 100M+ rows |
| vs CDPs | vs ML-based (Zingg, Dedupe) | vs dbt packages | |
|---|---|---|---|
| Cost | Free vs $5K-50K/mo | No Spark cluster needed | More complete pipeline |
| Control | No vendor lock-in | No ML training required | Production-hardened |
| Transparency | Full auditability | Deterministic output | Multi-platform |
- π’ SMBs wanting customer 360 without CDP costs
- π§ Data engineers building composable CDPs
- π Analysts who prefer SQL over Python/Spark
- βοΈ Compliance teams needing auditable matching logic
| Platform | Status | Quickstart |
|---|---|---|
| DuckDB | β Full | make demo (local) |
| Snowflake | β Full | CALL idr_run('FULL', 30, FALSE); |
| BigQuery | β Full | python sql/bigquery/idr_run.py --project=... |
| Databricks | β Full | Run IDR_QuickStart.py notebook |
- π― Cluster Confidence Scoring - Quality score (0-1) for each cluster based on edge diversity and match density
- π Dry Run Mode - Preview changes before committing
- π Metrics Export - Prometheus, DataDog, webhook support
- π‘οΈ Data Quality Controls - max_group_size, exclusion lists
- π Incremental Processing - Watermark-based efficiency
- π Full Audit Trail - Every decision is traceable
βββββββββββββββ βββββββββββββββ βββββββββββββββ βββββββββββββββ
β Sources ββββββΆβ Configure ββββββΆβ IDR Run ββββββΆβ Output β
β β β β β β β β
β β’ CRM β β β’ Rules β β β’ Extract β β β’ Clusters β
β β’ POS β β β’ Mappings β β β’ Match β β β’ Profiles β
β β’ Web β β β’ Sources β β β’ Cluster β β β’ Metrics β
β β’ Mobile β β β β β β β
βββββββββββββββ βββββββββββββββ βββββββββββββββ βββββββββββββββ
4 Steps:
- Configure - Register sources and identifier mappings
- Extract - Pull identifiers (email, phone, loyalty ID)
- Match - Build edges between entities sharing identifiers
- Cluster - Label propagation to find connected components
make demoSnowflake
-- 1. Create objects
\i sql/snowflake/00_ddl_all.sql
-- 2. Configure and run
CALL idr_run('FULL', 30, FALSE); -- FALSE = live run
CALL idr_run('FULL', 30, TRUE); -- TRUE = dry run (preview)BigQuery
# 1. Setup
bq query < sql/bigquery/00_ddl_all.sql
# 2. Run
pip install google-cloud-bigquery
python sql/bigquery/idr_run.py --project=your-project --run-mode=FULLDatabricks
- Import
sql/databricks/notebooks/IDR_QuickStart.py - Run all cells
- Check
idr_out.identity_resolved_membership_current
dbt Package
# packages.yml
packages:
- git: "https://github.com/anilkulkarni87/sql-identity-resolution"
subdirectory: "dbt_idr"dbt deps
dbt seed --select dbt_idr
dbt run --select dbt_idr| Guide | Description |
|---|---|
| Quick Start | Get running in 5 minutes |
| Configuration | Set up sources and rules |
| Dry Run Mode | Preview before committing |
| Production Hardening | Enterprise best practices |
| Architecture | How it works |
Pre-built configurations for common use cases:
| Template | Use Case | Identifiers |
|---|---|---|
| Retail | Nike, Lululemon style | email, phone, loyalty_id, address |
| Healthcare | Patient matching | MRN, SSN, name+DOB |
| Financial | Account linking | account_id, email, SSN |
| B2B SaaS | Lead deduplication | email, domain, company_name |
Tested on retail customer data (10M rows):
| Platform | Duration | Cost | Clusters |
|---|---|---|---|
| DuckDB | 143s | Free | 1.84M |
| Snowflake | 168s | ~$0.25 | 1.84M |
| BigQuery | 295s | ~$0.50 | 1.84M |
| Databricks | 317s | TBD | 1.84M |
See benchmarks/ for full testing suite and results.
We welcome contributions! See CONTRIBUTING.md for guidelines.
# Run tests locally
make test
# Generate docs locally
make docsApache 2.0 β see LICENSE
β Star this repo if you find it useful!