Skip to content

anilkulkarni87/sql-identity-resolution

Repository files navigation

πŸ”— SQL Identity Resolution

Tests Docs License GitHub stars GitHub forks

Snowflake BigQuery Databricks DuckDB GitHub last commit

Production-grade deterministic identity resolution for modern data warehouses. Unify customer identities across CRM, transactions, web events, and loyalty dataβ€”no ML required.

⚑ 60-Second Demo

# Clone and run
git clone https://github.com/anilkulkarni87/sql-identity-resolution.git
cd sql-identity-resolution
make demo

That's it! Open demo_results.html to see clustered identities.

🐳 Docker one-liner (no Python required)
docker run -it --rm -v $(pwd)/output:/output ghcr.io/anilkulkarni87/sql-identity-resolution:demo

🎯 Why SQL Identity Resolution?

Challenge Our Solution
Expensive CDPs Open source, runs on your warehouse
Black-box ML Deterministic rules, fully auditable
Vendor lock-in Same logic across 4 platforms
Scale limits Tested to 100M+ rows

How We Compare

vs CDPs vs ML-based (Zingg, Dedupe) vs dbt packages
Cost Free vs $5K-50K/mo No Spark cluster needed More complete pipeline
Control No vendor lock-in No ML training required Production-hardened
Transparency Full auditability Deterministic output Multi-platform

Who Is This For?

  • 🏒 SMBs wanting customer 360 without CDP costs
  • πŸ”§ Data engineers building composable CDPs
  • πŸ“Š Analysts who prefer SQL over Python/Spark
  • βš–οΈ Compliance teams needing auditable matching logic

πŸ—οΈ Supported Platforms

Platform Status Quickstart
DuckDB βœ… Full make demo (local)
Snowflake βœ… Full CALL idr_run('FULL', 30, FALSE);
BigQuery βœ… Full python sql/bigquery/idr_run.py --project=...
Databricks βœ… Full Run IDR_QuickStart.py notebook

✨ Key Features

  • 🎯 Cluster Confidence Scoring - Quality score (0-1) for each cluster based on edge diversity and match density
  • πŸ”’ Dry Run Mode - Preview changes before committing
  • πŸ“Š Metrics Export - Prometheus, DataDog, webhook support
  • πŸ›‘οΈ Data Quality Controls - max_group_size, exclusion lists
  • πŸ“ˆ Incremental Processing - Watermark-based efficiency
  • πŸ” Full Audit Trail - Every decision is traceable

πŸ“Š Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Sources   │────▢│  Configure  │────▢│  IDR Run    │────▢│   Output    β”‚
β”‚             β”‚     β”‚             β”‚     β”‚             β”‚     β”‚             β”‚
β”‚ β€’ CRM       β”‚     β”‚ β€’ Rules     β”‚     β”‚ β€’ Extract   β”‚     β”‚ β€’ Clusters  β”‚
β”‚ β€’ POS       β”‚     β”‚ β€’ Mappings  β”‚     β”‚ β€’ Match     β”‚     β”‚ β€’ Profiles  β”‚
β”‚ β€’ Web       β”‚     β”‚ β€’ Sources   β”‚     β”‚ β€’ Cluster   β”‚     β”‚ β€’ Metrics   β”‚
β”‚ β€’ Mobile    β”‚     β”‚             β”‚     β”‚             β”‚     β”‚             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

4 Steps:

  1. Configure - Register sources and identifier mappings
  2. Extract - Pull identifiers (email, phone, loyalty ID)
  3. Match - Build edges between entities sharing identifiers
  4. Cluster - Label propagation to find connected components

πŸš€ Getting Started

Option 1: Local Demo (DuckDB)

make demo

Option 2: Platform-Specific

Snowflake
-- 1. Create objects
\i sql/snowflake/00_ddl_all.sql

-- 2. Configure and run
CALL idr_run('FULL', 30, FALSE);  -- FALSE = live run
CALL idr_run('FULL', 30, TRUE);   -- TRUE = dry run (preview)
BigQuery
# 1. Setup
bq query < sql/bigquery/00_ddl_all.sql

# 2. Run
pip install google-cloud-bigquery
python sql/bigquery/idr_run.py --project=your-project --run-mode=FULL
Databricks
  1. Import sql/databricks/notebooks/IDR_QuickStart.py
  2. Run all cells
  3. Check idr_out.identity_resolved_membership_current
dbt Package
# packages.yml
packages:
  - git: "https://github.com/anilkulkarni87/sql-identity-resolution"
    subdirectory: "dbt_idr"
dbt deps
dbt seed --select dbt_idr
dbt run --select dbt_idr

πŸ“– dbt Package Docs

πŸ“– Documentation

πŸ“š Full Documentation

Guide Description
Quick Start Get running in 5 minutes
Configuration Set up sources and rules
Dry Run Mode Preview before committing
Production Hardening Enterprise best practices
Architecture How it works

🏭 Industry Templates

Pre-built configurations for common use cases:

Template Use Case Identifiers
Retail Nike, Lululemon style email, phone, loyalty_id, address
Healthcare Patient matching MRN, SSN, name+DOB
Financial Account linking account_id, email, SSN
B2B SaaS Lead deduplication email, domain, company_name

πŸ“Š Performance

Tested on retail customer data (10M rows):

Platform Duration Cost Clusters
DuckDB 143s Free 1.84M
Snowflake 168s ~$0.25 1.84M
BigQuery 295s ~$0.50 1.84M
Databricks 317s TBD 1.84M

See benchmarks/ for full testing suite and results.

🀝 Contributing

We welcome contributions! See CONTRIBUTING.md for guidelines.

# Run tests locally
make test

# Generate docs locally
make docs

πŸ“œ License

Apache 2.0 β€” see LICENSE


⭐ Star this repo if you find it useful!