A production-style data quality lab for streaming/video analytics: ETL → data contracts → validation (SQL+Python) → anomaly detection → dashboards → CI/CD, with root-cause playbooks.
A comprehensive data validation framework for a video product pipeline (simulated). This lab demonstrates enterprise-grade data quality practices applied to video streaming analytics, from raw event ingestion through fact table construction and quality monitoring.
- Multi-layer test strategy across raw → staging → facts
- Data contracts (Pydantic) + column checks (Pandera) + SQL assertions
- KPI tracking: freshness, completeness, consistency, uniqueness, accuracy
- Anomaly detection: rule-based (z-score) + ML-based (IsolationForest demo)
- CI/CD pipeline (GitHub Actions) with artifacts (quality report markdown)
- Optional OpenSearch indexing for logs + dashboard query examples
- Tableau-ready extracts (CSV/Parquet)
- API layer for exposing quality signals to product/ops
- Governance: data dictionary + RCA templates for incident management
Raw Events (JSONL)
↓
ETL Pipeline (DuckDB + Python)
↓
Staging Tables → Data Contracts (Pydantic)
↓
Fact Tables → Column Checks (Pandera) + SQL Assertions
↓
KPI Computation (Freshness, Completeness, etc.)
↓
Anomaly Detection (Z-Score + IsolationForest)
↓
API + Dashboard Exports (OpenSearch/Tableau)
Get up and running in 3 commands:
# Recommended: Use a virtual environment
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies and run pipeline
make env # Install dependencies
make etl # Run ETL pipeline
make validate # Run all validations + generate reportsOr use the automated setup script:
./setup.sh # Runs all steps automaticallytransform_events.py: Raw → Staging → Facts transformation using DuckDBmodels.sql: SQL for dimensional models and fact tables
- Contracts (
contracts/): Pydantic schemas for compile-time validation - SQL Checks (
checks_sql/): SQL assertions for constraints and data integrity - Python Checks (
checks_py/): Pandera schemas + business rule validation - KPIs (
kpis/): Definitions and computation for quality metrics
rules_zscore.py: Statistical anomaly detection using z-scoresiforest_demo.py: ML-based anomaly detection with IsolationForest
app.py: FastAPI service exposing KPIs and health endpointstests/test_api.py: Contract tests for API endpoints
data_dictionary.md: Complete data catalog with lineagerca_templates/: Root cause analysis templates for incidents
github-actions.yml: Automated quality pipeline- Generated artifacts: quality reports, KPI CSVs, anomaly flags
docker-compose.yml: Local DuckDB + optional OpenSearch stackindex_to_opensearch.py: KPI log indexing for dashboards
Transform raw video events (play, pause, buffer, seek) from connected devices (iOS, Android, Roku, Apple TV, Samsung TV) into fact tables with full validation at each layer.
Pydantic models catch schema violations at ingestion time, preventing bad data from entering the pipeline.
Daily computation of:
- Freshness: Data recency checks
- Completeness: Missing value detection
- Consistency: Cross-field validation
- Uniqueness: Duplicate detection
- Accuracy: Business rule compliance
Automatic flagging of:
- Statistical outliers (sessions, buffering, bitrate)
- Pattern breaks in time series
- Device/platform-specific issues
Structured templates for investigating and documenting quality incidents with 5-Whys methodology.
- OpenSearch for real-time quality monitoring
- Tableau-ready exports for executive reporting
make test # Run full test suite
pytest -v # Verbose test outputTests cover:
- Contract validation (Pydantic)
- SQL constraint checks
- Business rule enforcement
- API endpoint contracts
Start the API server:
make api # or: uvicorn api.app:app --reloadEndpoints:
GET /health- Health checkGET /kpis/daily- Last 14 days of quality KPIs
GitHub Actions workflow runs on every push:
- ETL pipeline execution
- Pandera schema validation
- SQL assertion checks
- KPI computation
- Anomaly detection scan
- API contract tests
- Artifact upload (quality reports)
This lab demonstrates production-ready data quality engineering for video streaming platforms:
- Connected Devices: Multi-platform support (iOS, Android, Roku, Apple TV, Samsung TV)
- Automated Testing: Repeatable validation across the entire pipeline
- Observability: KPIs, anomaly detection, and root cause frameworks
- Integration-Ready: API for product teams, dashboards for ops, CI/CD for dev teams
- Governance: Data dictionaries and incident response templates
- ETL: Python, DuckDB, Pandas
- Validation: Pydantic, Pandera, SQL
- ML/Anomaly: scikit-learn (IsolationForest)
- API: FastAPI, Uvicorn
- Testing: Pytest, httpx
- CI/CD: GitHub Actions
- Visualization: OpenSearch (optional), Tableau-ready exports
video-data-quality-lab/
├── README.md
├── requirements.txt
├── Makefile
├── data/
│ ├── raw/ # Raw event JSONL files
│ └── reference/ # Device catalogs, reference data
├── etl/
│ ├── transform_events.py # ETL pipeline
│ └── models.sql # SQL models
├── dq/
│ ├── contracts/ # Pydantic schemas
│ ├── checks_sql/ # SQL assertions
│ ├── checks_py/ # Pandera + business rules
│ └── kpis/ # KPI definitions + computation
├── anomaly/
│ ├── rules_zscore.py # Statistical anomaly detection
│ └── iforest_demo.py # ML anomaly detection
├── api/
│ ├── app.py # FastAPI service
│ └── tests/test_api.py # API contract tests
├── governance/
│ ├── data_dictionary.md # Data catalog
│ └── rca_templates/ # Root cause analysis
├── ci/
│ ├── github-actions.yml # CI/CD pipeline
│ └── sample_quality_report.md
├── ops/
│ ├── docker-compose.yml # Local stack
│ └── index_to_opensearch.py # OpenSearch integration
└── tests/
├── test_contracts.py
├── test_business_rules.py
└── test_sql_checks.py
MIT