Demo repository showcasing graph analytics on OLAP data using PuppyGraph and ClickHouse for two real-world use cases: Customer 360 and Fraud Detection.
The Challenge: Traditional OLAP databases excel at analytical queries but struggle with relationship-based queries (friend networks, fraud rings, recommendation paths).
The Solution: PuppyGraph provides a zero-ETL graph query layer on top of ClickHouse, enabling both analytical SQL queries and graph traversals (Cypher/Gremlin) on the same data without duplication.
Performance: Graph queries that would take minutes with traditional JOIN-heavy SQL execute in milliseconds using PuppyGraph's native graph traversal.
- Use Cases
- Configuration
- Quick Start
- Dataset Overview
- Example Queries
- Available Commands
- Prerequisites
- Technologies
- Repository Structure
- Troubleshooting
- Test Results
Unified view of customer behavior combining transactional data with relationship graphs:
- Product affinity networks and recommendations
- Customer segmentation and behavioral analysis
- Cross-sell recommendation paths
- Purchase pattern detection
Scale: 35.4M records (1M customers, 7.3M transactions, 27M interactions, 50K products)
Queries: 25 SQL + 30 Cypher queries (55 total)
View Customer 360 Documentation →
Real-time fraud detection using graph pattern matching with embedded fraud scenarios:
- Fraud ring identification (account takeover, money laundering, card fraud)
- Suspicious transaction networks
- Shared attribute analysis (device, identity, address)
- Anomaly detection in relationships
Scale: 1.29M records (100K customers, 1M transactions, 50K devices, 10K merchants)
Fraud Scenarios: 5 embedded patterns (account takeover, money laundering, credit card fraud, synthetic identity, merchant collusion)
Queries: 17 SQL + 25 Cypher queries (42 total)
View Fraud Detection Documentation →
The repository uses two environment template files for different deployment modes:
Purpose: All-in-one configuration for application runtime and data generation
Setup:
cp .env.example .env
# Edit .env with your credentials and settingsKey Settings:
Application Configuration:
- ClickHouse connection (host, port, credentials)
- PuppyGraph configuration (ports, password)
- Streamlit application settings
- Data pipeline configuration
- Performance tuning
Data Generation Configuration:
- Customer scale (100K, 1M, 10M, 100M)
- Random seed for reproducibility
- Batch size and compression
- Use case selection (customer360, fraud-detection, both)
Usage: The consolidated .env file is used by both the application runtime and the generate_data.py script, eliminating the need for separate configuration files.
Purpose: ClickHouse Cloud connection for hybrid deployment
Setup:
cp deployments/hybrid/.env.example deployments/hybrid/.env
# Edit with your ClickHouse Cloud credentials
make hybridKey Settings:
- ClickHouse Cloud instance details
- Secure connection configuration (port 9440, SSL enabled)
Run ClickHouse and PuppyGraph locally:
# Start services
make local
# Generate data (runs inside ClickHouse container)
make generate-local
# Check status
make statusAccess:
- PuppyGraph Web UI: http://localhost:8081
- ClickHouse HTTP: http://localhost:8123
Run PuppyGraph locally, connect to ClickHouse Cloud:
Prerequisites:
- ClickHouse Cloud instance running and accessible
- Python dependencies:
pip install -r requirements.txt - Network connectivity to ClickHouse Cloud
# Install dependencies
pip install -r requirements.txt
# Configure ClickHouse Cloud connection
cp deployments/hybrid/.env.example deployments/hybrid/.env
# Edit .env with your credentials:
# CLICKHOUSE_HOST: your-instance.clickhouse.cloud
# CLICKHOUSE_PORT: 9440
# CLICKHOUSE_USER: default
# CLICKHOUSE_PASSWORD: your-password
# CLICKHOUSE_SECURE: true
# Start PuppyGraph
make hybrid
# Generate data (runs locally, ingests to cloud)
make generate-hybrid
# Check status
make statusAccess:
- PuppyGraph Web UI: http://localhost:8081
- ClickHouse Cloud: Your cloud console
Total Dataset: 36.7M records across both use cases
- 1M customers (5 segments: VIP, Premium, Regular, Basic, New)
- 7.3M transactions ($10-$5,000 range, 8-12 per customer)
- 27M interactions (25 per customer: view, cart, wishlist, review, share)
- 50K products (10 categories)
SQL Performance: 25 queries including window functions, RFM scoring, and cohort analysis Cypher Performance: 30 queries including multi-hop recommendations and purchase triangles
- 100K customers (3% fraudulent, 97% legitimate)
- 170K accounts (5% involved in fraud)
- 50K devices (10% suspicious)
- 10K merchants (8% fraudulent)
- 1M transactions (100K fraudulent, 900K legitimate)
Embedded Fraud Scenarios:
- Account Takeover (390 accounts) - Star pattern with 1 device accessing many accounts
- Money Laundering (390 accounts) - Circular transfer patterns
- Credit Card Fraud (390 accounts) - Bipartite pattern with stolen cards
- Synthetic Identity (390 accounts) - Clique pattern with shared fake identities
- Merchant Collusion (390 accounts) - Dense network of colluding merchants
SQL Performance: 17 queries including velocity analysis, risk distribution, and statistical measures Cypher Performance: 25 queries including fraud rings, money laundering paths, and device sharing networks
-- Top customers by lifetime value
SELECT
customer_id,
name,
segment,
ltv as lifetime_value
FROM customers
WHERE segment = 'VIP'
ORDER BY ltv DESC
LIMIT 10;// Find product recommendations via collaborative filtering
MATCH (target:Customer {customer_id: 'CUST_12345'})-[:PURCHASED]->(p1:Product)
MATCH (other:Customer)-[:PURCHASED]->(p1)
MATCH (other)-[:PURCHASED]->(p2:Product)
WHERE target.segment = other.segment
AND NOT (target)-[:PURCHASED]->(p2)
AND target <> other
WITH p2, COLLECT(DISTINCT other) as similar_customers
RETURN DISTINCT p2.name as recommended_product,
p2.category,
SIZE(similar_customers) as recommendation_strength
ORDER BY recommendation_strength DESC
LIMIT 10;// Detect account takeover rings (devices accessing many accounts)
MATCH (a1:Account)-[:USED_DEVICE]->(d:Device)<-[:USED_DEVICE]-(a2:Account)
WHERE a1.account_id <> a2.account_id
WITH d, collect(DISTINCT a1.account_id) + collect(DISTINCT a2.account_id) as connected_accounts
WHERE size(connected_accounts) >= 5
RETURN d.device_id,
d.device_fingerprint,
d.location,
connected_accounts,
size(connected_accounts) as account_count
ORDER BY account_count DESC
LIMIT 10;# Deployment
make local # Start local deployment
make hybrid # Start hybrid deployment
# Data Generation
make generate-local # Generate data for local deployment
make generate-hybrid # Generate data for hybrid deployment
# Operations
make status # Check deployment status
make logs # View container logs
make clean # Stop containers and clean up
# Quick start (deploy + generate + status)
make local-quick # Complete local setup
make hybrid-quick # Complete hybrid setup# Using defaults (1M customers, both use cases)
python3 generate_data.py
# Custom configuration
python3 generate_data.py \
--customers 500000 \
--seed 42 \
--use-case both \
--output-dir data \
--compression snappy \
--verboseAvailable options:
--customers: Number of customers (100K, 1M, 10M, 100M)--seed: Random seed for reproducibility--use-case: customer360, fraud-detection, or both--output-dir: Output directory for data files--compression: Parquet compression (snappy, gzip, lz4)--verbose: Enable debug logging
Data Scales:
| Scale | Customers | Total Records | Generation Time | RAM Required | Disk Space |
|---|---|---|---|---|---|
| Small | 100K | ~1.3M | 5-10 min | 4GB | 2GB |
| Medium | 1M | ~35M | 30-45 min | 8GB | 10GB |
| Large | 10M | ~350M | 4-6 hours | 16GB | 50GB |
| Enterprise | 100M+ | ~3.5B+ | 1-2 days | 32GB+ | 500GB+ |
- Docker and Docker Compose
- 8GB+ RAM recommended (16GB for large datasets)
- 10GB+ free disk space
- Docker and Docker Compose (for PuppyGraph)
- Python 3.8+ with pip:
pip install -r requirements.txt - ClickHouse Cloud account with instance running
- 4GB+ RAM recommended (8GB for large datasets)
- 10GB+ free disk space
- ClickHouse: OLAP database for analytical queries
- PuppyGraph: Zero-ETL graph query engine
- Docker: Containerization
- Python: Data generation and pipeline
- Cypher/Gremlin: Graph query languages
- Parquet: Columnar data format with Snappy compression
.
├── README.md # This file
├── use-cases/
│ ├── customer-360/
│ │ ├── README.md # Customer 360 documentation
│ │ ├── generator.py # Customer 360 data generator
│ │ ├── queries.sql # 25 SQL queries
│ │ └── queries.cypher # 30 Cypher queries
│ └── fraud-detection/
│ ├── README.md # Fraud Detection documentation
│ ├── generator.py # Fraud Detection data generator
│ ├── queries.sql # 17 SQL queries
│ └── queries.cypher # 25 Cypher queries
│
├── app/
│ ├── database/ # ClickHouse client
│ ├── graph/queries.py # Cypher query wrapper
│ └── pipeline/ # Data ingestion pipeline
│
├── config/puppygraph/ # PuppyGraph schema definitions
├── deployments/
│ ├── local/ # Local Docker deployment configs
│ └── hybrid/ # Hybrid deployment configs
│
├── generate_data.py # Data generation CLI
├── Makefile # Deployment commands
├── CONFIG.yaml # Project configuration
└── requirements.txt # Python dependencies
# Check container status
make status
docker ps
# View logs
make logs
docker logs clickhouse-local
docker logs puppygraph-localSymptom: Authentication failed or SSL record layer failure
Solution:
# Reset containers with clean state
make clean
docker-compose -f deployments/local/docker-compose.yml down -v
make local
# Verify authentication
docker exec clickhouse-local clickhouse-client --password=clickhouse123 --query "SELECT 1"Symptom: Connection refused when running make generate-hybrid
Checklist:
- ClickHouse Cloud instance is running
.envfile has correct credentials (no quotes around values)CLICKHOUSE_PORT=9440(secure port for cloud)CLICKHOUSE_SECURE=trueis set- Firewall allows outbound connections to port 9440
Solution:
# Test connection manually
clickhouse-client \
--host=your-instance.clickhouse.cloud \
--port=9440 \
--user=default \
--password=your-password \
--secure \
--query="SELECT 1"Symptom: Error starting container: port is already allocated
Solution:
# Check what's using the ports
lsof -i :8081 # PuppyGraph Web UI
lsof -i :8123 # ClickHouse HTTP
lsof -i :9000 # ClickHouse Native
# Stop conflicting processes
docker stop $(docker ps -q)Symptom: Process killed by OS or taking too long
Solution:
# Start with smallest scale
python generate_data.py --customers 10000 --use-case customer360
# Monitor memory
docker stats
# Reduce batch size if needed
python generate_data.py --customers 100000 --batch-size 50000Symptom: ModuleNotFoundError
Solution:
# Install all dependencies
pip install -r requirements.txtSymptom: Table customer360.customers doesn't exist
Solution:
# Check databases and tables
docker exec clickhouse-local clickhouse-client --query "SHOW DATABASES"
docker exec clickhouse-local clickhouse-client --query "SHOW TABLES FROM customer360"
# Regenerate data if missing
make generate-local # or make generate-hybridIf nothing else works, start completely fresh:
# Stop everything
make clean
# Remove all Docker resources
docker-compose -f deployments/local/docker-compose.yml down -v
docker-compose -f deployments/hybrid/docker-compose.yml down -v
# Remove generated data
rm -rf data/
# Start fresh
make local # or make hybrid
make generate-local # or make generate-hybrid- Zero-ETL: No data duplication between OLAP and graph stores
- Dual Query Support: Run both SQL and graph queries on same data
- Real-time Analytics: Graph queries execute in milliseconds
- Scalability: Handles billions of edges efficiently
- Flexibility: Choose local or cloud deployment
All 97 queries validated with 100% success rate.
| Use Case | SQL Queries | Cypher Queries | Total |
|---|---|---|---|
| Customer 360 | 25 | 30 | 55 |
| Fraud Detection | 17 | 25 | 42 |
| Total | 42 | 55 | 97 |
SQL (ClickHouse) - Heavy Analytics:
- Window functions (running totals, rolling averages)
- Statistical measures (percentiles, standard deviation, variance)
- RFM scoring (Recency-Frequency-Monetary)
- Cohort retention analysis
- Month-over-month growth calculations
- Multi-dimensional aggregations (GROUPING SETS)
Cypher (PuppyGraph) - Graph Patterns:
- Multi-hop product recommendations (3+ degrees)
- Customer purchase triangles (shared products)
- Fraud ring detection (device sharing, circular transfers)
- Brand ecosystem analysis
- Cross-category affinity paths
- View-to-purchase conversion funnels
| Metric | Value |
|---|---|
| Success Rate | 100% |
| Total Rows Returned | 2,981 |
| Median Latency | 28ms |
| P95 Latency | 148ms |
# Ensure services are running
make local
# Run comprehensive test suite
CLICKHOUSE_DATABASE=default python3 test/comprehensive_test_runner.py \
--deployment local \
--use-case all \
--skip-data-loading