This project demonstrates a complete end-to-end Azure Data Engineering pipeline built on top of the following technologies:
- GitHub β Source CSV data files
- Azure Data Factory (ADF) β Ingest CSV data from GitHub via HTTP and store in Azure Data Lake Storage (ADLS) in Parquet format (Staging Layer)
- Azure Data Lake Storage Gen2 (ADLS) β Cloud storage for both staging and transformed data layers
- Azure Databricks β Read, transform, and process Parquet data from staging; write the transformed data as CSV files (Transform Layer) --Azure key vault
- Azure Synapse Analytics β Read transformed CSV files from ADLS and create Lake Databases and Tables; perform analytical queries using T-SQL.
GitHub (CSV files) β βΌ Azure Data Factory (HTTP Connector) β βΌ Azure Data Lake Storage Gen2 (Staging Layer - Parquet) β βΌ Azure Databricks (Transformations in PySpark) β βΌ Azure Data Lake Storage Gen2 (Transform Layer - CSV) β βΌ Azure Synapse Analytics (Lake Database & Analysis)
| Component | Technology |
|---|---|
| Data Source | GitHub (CSV Files) |
| Ingestion | Azure Data Factory (HTTP) |
| Storage | Azure Data Lake Storage Gen2 |
| Processing & Transformation | Azure Databricks (PySpark) |
| Security | Azure key vault |
| Serving & Analysis | Azure Synapse Analytics (Serverless SQL) |
- Fetch raw CSV files from GitHub using HTTP linked service.
- Convert and store data into Parquet format in ADLS Gen2 Staging layer.
- Read Parquet files from Staging Layer.
- Apply transformations using PySpark (e.g., schema correction, filtering, type casting).
- Write transformed data as CSV files into ADLS Gen2 Transform Layer.
- Read transformed CSV files from ADLS Gen2 into Lake Databases in Synapse.
- Create external tables over these files.
- Execute analytical SQL scripts (example: medal analysis, country-wise performance, event statistics).
- End-to-End Azure-native pipeline without any on-prem components.
- Demonstrates multi-layer storage design (Staging & Transform layers).
- Uses ADF for ingestion, Databricks for ETL, and Synapse for analytics.
- Shows best practices like Parquet for intermediate storage and CSV for transformed outputs.