Skip to content

jptrp/video-data-quality-lab

Repository files navigation

video-data-quality-lab

A production-style data quality lab for streaming/video analytics: ETL → data contracts → validation (SQL+Python) → anomaly detection → dashboards → CI/CD, with root-cause playbooks.

What This Is

A comprehensive data validation framework for a video product pipeline (simulated). This lab demonstrates enterprise-grade data quality practices applied to video streaming analytics, from raw event ingestion through fact table construction and quality monitoring.

What This Proves

  • Multi-layer test strategy across raw → staging → facts
  • Data contracts (Pydantic) + column checks (Pandera) + SQL assertions
  • KPI tracking: freshness, completeness, consistency, uniqueness, accuracy
  • Anomaly detection: rule-based (z-score) + ML-based (IsolationForest demo)
  • CI/CD pipeline (GitHub Actions) with artifacts (quality report markdown)
  • Optional OpenSearch indexing for logs + dashboard query examples
  • Tableau-ready extracts (CSV/Parquet)
  • API layer for exposing quality signals to product/ops
  • Governance: data dictionary + RCA templates for incident management

Architecture

Raw Events (JSONL)
    ↓
ETL Pipeline (DuckDB + Python)
    ↓
Staging Tables → Data Contracts (Pydantic)
    ↓
Fact Tables → Column Checks (Pandera) + SQL Assertions
    ↓
KPI Computation (Freshness, Completeness, etc.)
    ↓
Anomaly Detection (Z-Score + IsolationForest)
    ↓
API + Dashboard Exports (OpenSearch/Tableau)

Quick Start

Get up and running in 3 commands:

# Recommended: Use a virtual environment
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies and run pipeline
make env      # Install dependencies
make etl      # Run ETL pipeline
make validate # Run all validations + generate reports

Or use the automated setup script:

./setup.sh    # Runs all steps automatically

What's Inside

Data Pipeline (etl/)

  • transform_events.py: Raw → Staging → Facts transformation using DuckDB
  • models.sql: SQL for dimensional models and fact tables

Data Quality (dq/)

  • Contracts (contracts/): Pydantic schemas for compile-time validation
  • SQL Checks (checks_sql/): SQL assertions for constraints and data integrity
  • Python Checks (checks_py/): Pandera schemas + business rule validation
  • KPIs (kpis/): Definitions and computation for quality metrics

Anomaly Detection (anomaly/)

  • rules_zscore.py: Statistical anomaly detection using z-scores
  • iforest_demo.py: ML-based anomaly detection with IsolationForest

API Layer (api/)

  • app.py: FastAPI service exposing KPIs and health endpoints
  • tests/test_api.py: Contract tests for API endpoints

Governance (governance/)

  • data_dictionary.md: Complete data catalog with lineage
  • rca_templates/: Root cause analysis templates for incidents

CI/CD (ci/)

  • github-actions.yml: Automated quality pipeline
  • Generated artifacts: quality reports, KPI CSVs, anomaly flags

Operations (ops/)

  • docker-compose.yml: Local DuckDB + optional OpenSearch stack
  • index_to_opensearch.py: KPI log indexing for dashboards

Use Cases Demonstrated

1. ETL Validation

Transform raw video events (play, pause, buffer, seek) from connected devices (iOS, Android, Roku, Apple TV, Samsung TV) into fact tables with full validation at each layer.

2. Data Contract Enforcement

Pydantic models catch schema violations at ingestion time, preventing bad data from entering the pipeline.

3. Quality KPI Tracking

Daily computation of:

  • Freshness: Data recency checks
  • Completeness: Missing value detection
  • Consistency: Cross-field validation
  • Uniqueness: Duplicate detection
  • Accuracy: Business rule compliance

4. Anomaly Detection

Automatic flagging of:

  • Statistical outliers (sessions, buffering, bitrate)
  • Pattern breaks in time series
  • Device/platform-specific issues

5. Root Cause Analysis

Structured templates for investigating and documenting quality incidents with 5-Whys methodology.

6. Dashboard Integration

  • OpenSearch for real-time quality monitoring
  • Tableau-ready exports for executive reporting

Testing

make test  # Run full test suite
pytest -v  # Verbose test output

Tests cover:

  • Contract validation (Pydantic)
  • SQL constraint checks
  • Business rule enforcement
  • API endpoint contracts

API Usage

Start the API server:

make api  # or: uvicorn api.app:app --reload

Endpoints:

  • GET /health - Health check
  • GET /kpis/daily - Last 14 days of quality KPIs

CI/CD Pipeline

GitHub Actions workflow runs on every push:

  1. ETL pipeline execution
  2. Pandera schema validation
  3. SQL assertion checks
  4. KPI computation
  5. Anomaly detection scan
  6. API contract tests
  7. Artifact upload (quality reports)

Why This Matters

This lab demonstrates production-ready data quality engineering for video streaming platforms:

  • Connected Devices: Multi-platform support (iOS, Android, Roku, Apple TV, Samsung TV)
  • Automated Testing: Repeatable validation across the entire pipeline
  • Observability: KPIs, anomaly detection, and root cause frameworks
  • Integration-Ready: API for product teams, dashboards for ops, CI/CD for dev teams
  • Governance: Data dictionaries and incident response templates

Tech Stack

  • ETL: Python, DuckDB, Pandas
  • Validation: Pydantic, Pandera, SQL
  • ML/Anomaly: scikit-learn (IsolationForest)
  • API: FastAPI, Uvicorn
  • Testing: Pytest, httpx
  • CI/CD: GitHub Actions
  • Visualization: OpenSearch (optional), Tableau-ready exports

Project Structure

video-data-quality-lab/
├── README.md
├── requirements.txt
├── Makefile
├── data/
│   ├── raw/                    # Raw event JSONL files
│   └── reference/              # Device catalogs, reference data
├── etl/
│   ├── transform_events.py     # ETL pipeline
│   └── models.sql              # SQL models
├── dq/
│   ├── contracts/              # Pydantic schemas
│   ├── checks_sql/             # SQL assertions
│   ├── checks_py/              # Pandera + business rules
│   └── kpis/                   # KPI definitions + computation
├── anomaly/
│   ├── rules_zscore.py         # Statistical anomaly detection
│   └── iforest_demo.py         # ML anomaly detection
├── api/
│   ├── app.py                  # FastAPI service
│   └── tests/test_api.py       # API contract tests
├── governance/
│   ├── data_dictionary.md      # Data catalog
│   └── rca_templates/          # Root cause analysis
├── ci/
│   ├── github-actions.yml      # CI/CD pipeline
│   └── sample_quality_report.md
├── ops/
│   ├── docker-compose.yml      # Local stack
│   └── index_to_opensearch.py  # OpenSearch integration
└── tests/
    ├── test_contracts.py
    ├── test_business_rules.py
    └── test_sql_checks.py

License

MIT

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published