A production-style GitHub analytics data warehouse featuring incremental ingestion, hybrid snapshot modeling, automated daily pipelines, and comprehensive test coverage.
This project demonstrates modern data engineering system design, including:
- Incremental API ingestion with state tracking
- Hybrid event-driven + time-driven snapshot modeling
- Batch bulk loading with idempotent writes
- Two-phase ingestion pipeline (read → process → write)
- Transactional integrity and atomic commits
- Automated orchestration via GitHub Actions
- Analytical warehouse design with SQL views
- Data quality guarantees and schema constraints
- Comprehensive unit and integration tests
+----------------------+
| GitHub API |
+----------+-----------+
|
| Incremental Fetch (since)
|
+------v------+
| Ingestion |
| Pipeline |
| (Python) |
+------+------+
|
| Bulk Inserts
|
+---------v----------+
| Neon PostgreSQL |
| Data Warehouse |
+---------+----------+
|
|
+-------v--------+
| Analytical |
| SQL Views |
+-------+--------+
|
+-------v--------+
| BI Dashboard |
+----------------+
^
|
+--------+--------+
| GitHub Actions |
| Daily Scheduler |
+-----------------+
The system follows a layered architecture:
GitHub API
↓
Incremental Ingestion (Event-Driven)
↓
Dimension Tables
↓
Daily Snapshot Layer (Time-Driven)
↓
Analytical Views
↓
Dashboards / API Consumers
This design cleanly separates:
- Data acquisition
- Warehouse modeling
- Analytical consumption
The pipeline is split into two distinct phases to handle long-running API operations efficiently:
Executed synchronously at pipeline start:
- Fetch last run timestamp per company from
ingestion_state - Retrieve latest repository metrics from
repo_snapshots - Retrieve latest language metrics from
language_snapshots - Fetch list of companies to process
- Close connection immediately after
This phase completes in seconds. The connection is closed before the long-running API phase to avoid timeout issues with Neon's connection suspension on idle databases.
Long-running phase with no database activity:
- Iterate through all companies
- For each company, fetch repos updated since last run using
sinceparameter - For each updated repo, fetch language distribution from GitHub API
- Build in-memory data structures (lists, dicts) for all entities
- After API phase completes, establish new database connection
- Execute all writes in a single transaction
- Cleanup old snapshots (30+ days)
- Commit transaction atomically
This two-phase design:
- Solves connection suspension: Neon suspends idle connections after ~5 minutes. By closing Phase I immediately, we avoid the timeout during the long API phase.
- Maintains transactional guarantees: All writes happen in Phase III as a single atomic transaction.
- Scales with data volume: The 13k+ repos × long-running API calls scenario is handled gracefully.
The ingestion pipeline:
- Fetches only updated repository data using GitHub's
sinceparameter - Tracks ingestion state per company in
ingestion_statetable - Uses idempotent inserts with
ON CONFLICT DO NOTHING - Uses batch loading via
execute_values()for efficiency - Runs all writes inside a single database transaction
- Avoids redundant API calls for unchanged repositories
Additional operational behavior:
- API retry with exponential backoff (2^attempt seconds)
- GitHub rate limit handling and sleep on throttle
- Structured logging of pipeline execution
- Automatic cleanup of snapshots older than 30 days
companies — stores company metadata and tracking info
- company_id (PK)
- company_name
repos — stores repository metadata
- id (PK)
- company_id (FK)
- name, full_name, language
- created_at, updated_at
languages — stores unique programming languages
- language_id (PK)
- language_name (UNIQUE)
Stores daily time-series metrics for each repository:
- repo_id (FK)
- snapshot_date
- stars, forks, open_issues
Primary key: (repo_id, snapshot_date) — one snapshot per repository per day
Stores historical language composition per repository:
- repo_id (FK)
- snapshot_date
- language_id (FK)
- bytes (lines of code in that language)
Primary key: (repo_id, snapshot_date, language_id)
ingestion_state — tracks incremental ingestion progress
- source (company username) — PK
- last_run (timestamp of last successful ingestion)
Enables reliable incremental API consumption and safe recovery after failures.
This warehouse uses a hybrid modeling strategy optimizing for both API efficiency and analytical completeness.
Only repositories updated since the last run are fetched from GitHub. Uses the since parameter to minimize API calls — daily delta is typically a few hundred repos for stable organizations like Google, Meta, Apple.
Every ingestion run generates a daily snapshot for all tracked repositories, not just changed ones:
| Repo Status | Snapshot Source |
|---|---|
| Updated repo | Fresh API data |
| Unchanged repo | Latest stored metrics (reused) |
Benefits:
- API efficiency — avoid redundant calls for stable repos
- Deterministic time-series — guaranteed one snapshot per repo per day
- Predictable compute load — snapshot count is fixed daily
- Complete historical analytics — no gaps in time-series data
Analytical views are defined in sql/views.sql and support real-time BI dashboards and programmatic queries.
repo_daily_growth — calculates daily star growth using window functions
LAG(stars) OVER (PARTITION BY repo_id ORDER BY snapshot_date)company_total_stars — aggregates repository metrics across all repos per company
language_popularity_by_company — shows programming language distribution and trends per company
company_top_forked_repos — ranks repositories by forks within each company
repo_activity_score — composite metric combining stars, forks, and open issues
These views support:
- Time-series analytics and trend analysis
- BI dashboards and ad-hoc queries
- Programmatic API access to aggregated data
The pipeline is automated using GitHub Actions with daily scheduling.
schedule:
- cron: "0 2 * * *" # Runs every day at 02:00 UTC
workflow_dispatch: # Manual trigger available- Checkout repository
- Set up Python 3.11
- Install dependencies from
requirements.txt - Execute ingestion pipeline (
src/ingest.py) - Write results to Neon PostgreSQL data warehouse
All logs are captured by GitHub Actions and retained for 90 days, visible in the Actions tab of the repository.
Comprehensive test suite covering unit tests for database operations and integration tests for the ingestion pipeline.
tests/test_db.py — unit tests for database layer
- Read operations:
get_last_run,get_all_companies,get_latest_repo_metrics,get_latest_language_metrics,get_all_languages - Write operations:
bulk_insert_*functions - Cleanup operations:
clean_up_db - Tests for both empty and non-empty inputs
tests/test_ingest.py — integration tests for the ingestion pipeline
- Empty company list handling
- Non-empty ingestion with mock GitHub API
- Error handling: missing repos, missing fields, missing languages
- Cleanup function invocation
tests/conftest.py — pytest fixtures shared across tests
mock_conn— mocked database connectionmock_github_client— mocked GitHub API client
pytest tests/
pytest tests/test_db.py # Only database tests
pytest tests/test_ingest.py # Only pipeline tests
pytest tests/ -v # Verbose outputAll tests use mocking to avoid external dependencies:
- Database operations are mocked with
unittest.mock.MagicMock - GitHub API calls are mocked
execute_valuesis patched to prevent actual DB writes
The system enforces strong integrity and consistency guarantees.
- NOT NULL constraints on critical fields
- Foreign key relationships between all fact and dimension tables
- Composite primary keys ensuring uniqueness of daily snapshots
- UNIQUE constraints on language names
All inserts use ON CONFLICT DO NOTHING:
INSERT INTO repo_snapshots(...) VALUES %s
ON CONFLICT (repo_id, snapshot_date) DO NOTHING;This ensures safe pipeline reruns — re-executing the same day's ingestion is harmless.
Each repository receives exactly one snapshot per day, ensuring:
- Deterministic time-series data
- No duplicate snapshots
- Predictable row counts
The clean_up_db() function runs daily to delete snapshots older than 30 days, keeping storage capped at ~150 MB on Neon's free tier while maintaining sufficient history for trend analysis.
Structured logs record:
- Companies processed and count
- Repositories updated
- Snapshots written
- Language distributions captured
- Pipeline run duration
- Phase completion status
root/
│
├── sql/
│ ├── schema.sql # Table definitions and constraints
│ └── views.sql # Analytical views
│
├── src/
│ ├── config.py # Environment and configuration
│ ├── db.py # Database operations (reads + writes)
│ ├── github_client.py # GitHub API client with retry logic
│ ├── ingest.py # Two-phase ingestion orchestration
│ └── api.py # GitHub API reference
│
├── tests/
│ ├── conftest.py # Pytest fixtures
│ ├── test_db.py # Database operation tests
│ └── test_ingest.py # Pipeline integration tests
│
├── .github/
│ └── workflows/
│ └── ingest.yml # GitHub Actions daily scheduler
│
├── requirements.txt # Python dependencies
├── Makefile # Local development commands
├── .gitignore # Git ignore rules
└── README.md # This file
Potential enhancements for this project:
- Orchestration — migrate from GitHub Actions to Apache Airflow or Prefect for more sophisticated scheduling, error handling, and task dependencies
- Materialized Views — pre-compute expensive aggregations for dashboard performance
- Pipeline Monitoring — integrate with observability platforms (Datadog, New Relic) for alerting on failures and performance degradation
- dbt Integration — move SQL transformations to dbt for version control, testing, and documentation of the analytics layer
This project demonstrates:
- Incremental data pipeline design and optimization
- Warehouse modeling principles (facts, dimensions, slowly changing data)
- Hybrid snapshot strategies for balancing efficiency and completeness
- Transactional ingestion patterns and ACID guarantees
- Automated data pipelines via CI/CD (GitHub Actions)
- Analytical warehouse design with performant queries
- End-to-end data system thinking from source to analytics
- Comprehensive testing for data pipelines
The goal is to showcase data engineering architecture.