This repository contains a production-grade data engineering pipeline that ingests real-time cryptocurrency data and processes it using a hybrid cloud architecture. The project demonstrates advanced orchestration techniques in Apache Airflow, leveraging Google Cloud Platform (GCP) for scalable processing.
Following Clean Code principles, all core processing logic is decoupled into a dedicated src/ directory, while the dags/ folder contains only lightweight orchestration scripts. [cite: 2026-02-01]
The system follows a multi-layer data architecture (Medallion Architecture) and is modeled using Data Vault 2.0 standards.
Figure 1: High-level ETL process designed in Camunda Modeler.
- Orchestration: Apache Airflow (Dockerized)
- Compute: PySpark, Google Cloud Dataproc
- Storage: Local SSD (Bronze/Silver), Google Cloud Storage, BigQuery (Gold)
- Table Format: Apache Iceberg
- Modeling: Data Vault 2.0 (Hubs & Satellites)
The project was managed using Agile methodology, broken down into 6 key Epics.
Figure 2: Final state of CDP Sprint 1, showing all User Stories completed.
- Deployed a containerized Airflow environment on Ubuntu.
- Mapped local volumes for persistent storage and the
src/module library. [cite: 2026-02-01]
- Automated hourly extraction from CoinGecko API.
- Persisted raw JSON data to local Bronze storage.
- Implemented schema validation and deduplication using PySpark.
- Utilized Apache Iceberg for efficient data management and time travel capabilities.
- Architected a scalable schema with Hubs and Satellites to ensure auditability.
- Developed modular Spark loaders to populate the Vault structure.
Figure 3: Airflow DAG orchestrating ephemeral Dataproc clusters.
-
Ephemeral Compute: The pipeline dynamically creates a Dataproc cluster, submits the Spark job, and deletes the cluster upon completion to optimize costs.
-
Reliability & Fault Tolerance: The system is designed to handle temporary cloud glitches. As shown in the job history, the pipeline automatically manages retries to ensure data consistency.
-
Data Warehouse: Final optimized datasets are loaded into BigQuery for downstream analytics, following the Data Vault 2.0 structure.
| Dataproc Job History | BigQuery Gold Layer |
|---|---|
![]() |
![]() |
| Successful PySpark executions | Final processed crypto data |
The final layer of the platform is an interactive Power BI dashboard that provides real-time insights into the processed cryptocurrency market data.
Figure 4: Interactive dashboard displaying market trends, price volatility, and historical analysis.
Key Features:
- Real-time Price Tracking: Connected directly to the BigQuery Gold layer for up-to-date market prices.
- Trend Analysis: Visualizing price movements and volume changes over time using Spark-processed historical data.
- Data Vault Integrity: The dashboard reflects the auditability and consistency of the underlying Data Vault 2.0 architecture.
A key feature of this project is the Separation of Concerns. [cite: 2026-02-01]
.
├── airflow/ # Airflow Environment
│ ├── dags/ # DAG definitions (Orchestration only)
│ └── logs/ # Execution logs
├── src/ # Core Business Logic (PySpark & Python)
│ ├── coingecko_ingestion.py # Main: Raw data extraction from API [cite: 2026-02-01]
│ ├── transform_crypto_data.py # Main: Spark-based cleaning & Silver layer logic [cite: 2026-02-01]
│ ├── vault_loader.py # Main: Data Vault 2.0 Gold layer loading [cite: 2026-02-01]
│ ├── crypto_spark_pipeline.py # Main: Spark session & pipeline orchestration [cite: 2026-02-01]
│ │
│ ├── silver_layer_setup.py # Setup: Pre-creating BigQuery tables & schemas [cite: 2026-02-01]
│ ├── test_gcp_connection.py # Test: Validating GCP service account & connectivity [cite: 2026-02-01]
│ └── crypto_dv_dag.py # Test: Local Airflow DAG for development [cite: 2026-02-01]
├── data/ # Local Bronze/Silver storage
├── docs/images/ # Architecture diagrams and screenshots
├── docker-compose.yaml # Infrastructure as Code
├── Dockerfile # Custom Airflow image with Spark/GCP dependencies
├── gcp-key.json # Google Cloud Service Account key (Local only)
└── README.md
🔒 Security Note: The gcp-key.json file shown in the structure is used for local authentication to Google Cloud services. This file is excluded from version control via .gitignore for security reasons. Users should provide their own Service Account key to run the pipeline.
---
### 🛠️ Development & Connectivity Tools
The stability and ease of development are supported by the following specialized scripts:
* **Connection Validation:** `test_gcp_connection.py` allows for verifying Google Cloud authentication and permissions before running the full pipeline.
* **Schema Initialization:** `silver_layer_setup.py` automates the preparation of the BigQuery environment, following "Infrastructure as Code" principles.
* **Local Testing:** `crypto_dv_dag.py` enables local validation of Airflow DAG logic, minimizing cloud-side execution errors.
