KlineForge is a production-grade crypto analytics warehouse that transforms raw Binance klines into a trading intelligence, trustable, pre-aggregated analytics ready for BI, risk analysis, and exploration.
- Ingestion (Python) → Bronze (raw stage table) → Silver (fact_1h_klines) → Gold (dim_coins, dim_timestamp) → Marts (returns, volume, top movers, correlations) → BI / Analytics(Power BI - upcoming).
- Orchestrated by Dagster (daily at 7 AM UTC) with GitHub Actions CI/CD for testing on push.
- Full live dashboard live: klineforge Crypto Dashboard Live
- Dashboard pdf: Klineforge Crypto Dashboard PDF
KlineForge is an end-to-end crypto analytics data platform designed to ingest, transform, and model high‑frequency cryptocurrency market data into analytics‑ready datasets. The project focuses on robust incremental ingestion, cost‑efficient transformations, and pre‑aggregated analytical marts suitable for risk metrics, quantitative analysis and BI consumption
The pipeline is designed with production-grade data engineering principles: incremental processing, idempotency, cost-aware modeling, orchestrating workflows with Dagster, testing, and CI/CD.
Aim: KlineForge transforms raw Binance spot kline data into a production-ready analytics platform. Our mission is to empower traders—from beginners to degens—with actionable insights that go beyond basic charts, providing a trusted foundation for risk management, opportunity detection, and portfolio diversification through modern ELT practices.
Technical Objectives:
To achieve this vision, the platform is engineered to:
-
Robust Ingestion: Safely ingest large volumes of historical and incremental kline data with crash-safe, resumable checkpoints.
-
Data Integrity: Standardize, cleanse, and deduplicate time-series data to ensure "Gold-standard" accuracy.
-
Advanced Modeling: Architect analytics-ready Fact, Dimension, and Mart tables using a Medallion architecture.
-
Optimization: Utilize dbt and Snowflake/BigQuery to optimize transformations for both performance and cost-efficiency.
-
Reliable Orchestration: Schedule and monitor end-to-end pipelines using Dagster to ensure data freshness.
-
Enable Intelligence: Provide the clean data layer necessary for downstream BI (Power BI/Streamlit) and quantitative strategy development.
Layer Tool Purpose Data Source Binance API Raw market (kline) data Ingestion Python Incremental, checkpoint-based ingestion Data Warehouse Snowflake Storage & analytics engine Transformation dbt (SQL + Jinja) Modeling, testing, documentation Orchestration Dagster Asset-based orchestration & scheduling CI/CD GitHub Actions Automated dbt testing & builds Visualization Power BI Analytics & dashboards Architecture Design Draw.io System architecture diagrams Development VS Code Local development environment Assistance ChatGPT Design, debugging, documentation
Purpose: Store raw Binance kline data with minimal transformation.
Key Characteristics:
- Append-only ingestion
- Historical backfill for new coins
- Raw epoch timestamps
- Ingestion metadata
- No business logic applied
Key Tables:
binance_klines_dailybinance_ingestion_checkpoints
- Incremental ingestion per coin
- Checkpoint table tracks last successfully ingested timestamp per coin
- Idempotent design allows safe re-runs
Generated Columns:
coinintervalingestion_ts
Challenges And Solutions:
- Slow sequential downloads (10+ hours) — Solution: Batched 60 days + parallelized with ThreadPoolExecutor.
- Partial ingestion due to day/hour grain mismatch — Solution: Hour-level checkpoints + recent re-ingestion.
- Discoveries: Binance data has epoch in milliseconds; needed to handle microseconds for future-proofing.
Purpose: Clean, normalize, and standardize raw data before analytics.
Key Transformations:
- Epoch → standard timestamp conversion
- Timezone consistency (UTC)
- Deduplication using ingestion timestamp
- Column renaming & typing
Key Table:
stage_binance_klines
Deduplication Logic:
row_number() over (
partition by coin, open_timestamp
order by ingestion_ts desc
)Grain:
coin, open_timestamp
Why 1-Hour Fact:
- Canonical time grain for downstream analytics
- Single source of truth for price & volume metrics
Incremental Strategy:
- Coin-aware incremental logic
- Per-coin watermarks
- Historical backfill for new coins
- Append-only for existing coins
- Coin metadata
- Symbol normalization
Why this dimension exists: Binance provides raw UTC epochs; analytics require rich temporal context.
Attributes:
- date, hour, day, month, year
- day_of_week, day_name
- week_of_year
- is_weekend
- Asia / EU / US trading sessions
Design Note: Trading sessions are analytical constructs derived from UTC hours and belong in a dimension, not facts.
Marts are built on top of the fact table, not raw data, to:
- Reduce repeated heavy computations
- Lower warehouse cost
- Improve BI performance
Purpose: Price-based performance and risk analytics.
Key Metrics:
- Close price
- Previous close
- Return %:
(close - prev_close) / prev_close * 100— Profit/loss. - Log return:
ln(close / prev_close)— For compounding. - Rolling volatility (24h):
stddev(return_pct) over 24 hours— Risk measure. - Volatility regime: High/Medium/Low based on avg thresholds — Alerts shifts.
Why: Provides core metrics required for trading strategies and portfolio analysis. Spots trends; traders avoid high volatility.
Challenges: Rolling windows in incremental mode — Solution: 23-hour buffer.
Purpose: Liquidity and activity analysis.
Key Metrics:
- Volume
- Rolling volume (24h):
sum(volume) over 24 hours— Trends. - Volume spike ratio:
rolling_volume / avg_volume— Flags surges. - VWAP (24h):
sum(quote_volume) / sum(volume)— True paid price.
Why: Separates noise from real moves; high spikes signal interest.
Challenges: Duplicates inflated volume — Solution: Dedup in fact layer.
Why built on returns mart: Uses pre-aggregated metrics to reduce compute and simplify joins.
Purpose: Opportunity scanner.
Key Metrics:
- Return %
- Log return
- Rolling volatility
- Volume spike ratio
- Mover signal:
abs(return_pct) * volume_spike_ratio— Ranks by impact.
Purpose: Cross-asset relationship, Risk & diversification analysis.
Key Metrics:
- 24h correlation:
(avg_xy - avg_x * avg_y) / sqrt((avg_x2 - avg_x^2) * (avg_y2 - avg_y^2))— Pearson rolling. - Correlation regime: High/Medium/Low (abs(corr) thresholds) — Diversification check.
- Correlation breakdown alerts:
abs(current_corr - prev_corr) > 0.5— Sudden shifts.
Key Challenges Solved:
- Manual rolling correlation (Snowflake-compatible)
- Canonical coin-pair ordering
- Strict timestamp alignment
- Incremental scalability
KlineForge includes a fully documented analytics suite built on top of curated warehouse models.
The dashboards translate raw crypto market data into:
- Market regime awareness
- Asset-level behavioral insights
- Cross-asset risk visibility
- Investment outcome simulation
All metrics are pre-computed in dbt marts to ensure:
- Consistency across views
- Fast query performance
- Transparent business logic
📊 Dashboard documentation and exports can be found in the /dashboards directory.
Why Dagster:
- Asset-based orchestration
- Native dbt integration
- Clear lineage & observability
Responsibilities:
- Schedule ingestion (daily)
- Trigger dbt assets
- Monitor failures & retries
Tool: GitHub Actions
Pipeline Capabilities:
- Python installation and setup
- dbt deps
- dbt build
- Isolated CI schemas
- Secrets-based authentication
Ensures transformations remain correct before deployment.
- Fact tables: 1 row per (coin, open_timestamp)
- Dimensions: One row per entity (coin, timestamp)
- Marts: Derived at the same hourly grain
Implemented using dbt schema.yml:
- Not null tests
- Unique constraints
- Accepted values
- Relationship tests
Ensures trust in analytical outputs.
KlineForge includes a fully documented analytics suite built on top of curated warehouse models.
The dashboards translate raw crypto market data into:
- Market regime awareness
- Asset-level behavioral insights
- Cross-asset risk visibility
- Investment outcome simulation
All metrics are pre-computed in dbt marts to ensure:
- Consistency across views
- Fast query performance
- Transparent business logic
📊 Dashboard documentation and exports can be found in the /dashboards directory.
- Replaced correlated subqueries with joins
- Precomputed watermarks for incremental models
- Built marts on facts instead of raw data
- Reduced repeated window function usage
- Incremental pipelines must be entity-aware
- Epoch timestamps require careful handling (ms vs µs)
- Deduplication is mandatory for idempotency
- Grain alignment (hour vs day) is critical
- Pre-aggregation saves cost and compute
- Clone the repository
- Set up Python virtual environment
- Configure Snowflake credentials in
profiles.yml - Run ingestion: python src/ingestion/klines_ingest.py
- Run dbt: dbt deps dbt build
- Start Dagster: dagster dev
- Airbyte for ingestion *Multi-source (Binance + Bybit, OKX, Others ..) with dim_exchange for unified marts.
- Migration to BigQuery Sandbox: Free, permanent public access.
- Streamlit analytics app: Live/near-real-time view
- WebSocket ingestion: Switch to per-hour/minute for fresher data without overloading sandbox.
- Advanced alerting
This project is an open ecosystem designed for collaboration. I am actively looking for partners to help bridge the gap between Data Engineering and Market Intelligence.
Feel free to reach out for:
- Data Engineering: Discussions on pipeline optimization, Airbyte connectors, and medallion architecture.
- Analytics Engineering: Feedback on dbt modeling, testing frameworks, and data quality standards.
- Data Analysis: I am looking for analysts to build compelling visualizations and exploratory reports. If you want to practice building crypto dashboards in Power BI, Tableau, or Streamlit using high-quality, pre-modeled marts, let's connect.
- Crypto Use-Cases: Sharing alpha on new exchanges, coins, or on-chain data sources.
- Machine Learning (ML): I am providing a feature-rich, clean data foundation (Gold Marts). If you are looking for a high-integrity dataset for time-series forecasting, volatility prediction, or sentiment analysis, let’s collaborate to build predictive models on top of this warehouse.
"Rome wasn't built in a day but we are laying bricks every hour"
Built by Muhammad Bashir (Delex), Dec 2025
Data Engineering / Analytics Engineering. From API chaos to trader edge — let's make crypto smarter
- Github: https://github.com/Greatdev666
- LinkdIn: https://www.linkedin.com/in/delexcode29/
- X: https://x.com/delexcode29
- Gmail: codedelex@gmail.com



