2024 Paris Olympic Analysis | Azure End-To-End Data Engineering Project

This project provides a data engineering and analytical journey on the Paris Olympic dataset. Starting with a CSV on GitHub, the data is ingested into the Azure ecosystem via Azure Data Factory. It's initially stored in Azure Data Lake Storage Gen2, then transformed in Azure Databricks. The enriched data, once again housed in ADLS Gen2, undergoes advanced analytics in Azure Synapse. The insights are finally visualized in Azure Synapse or Power BI, offering a comprehensive view of the dataset.

Datasets Utilized

This dataset contains the details of the participating Athletes, Coaches, and team, Entries by gender. It contains their names, countries represented, discipline, gender of competitors, name of the coaches.

This dataset contains the details of over 10,500 athletes, with 48 disciplines, along with 206 Teams taking part in the 2024 Paris Olympics.

Structure

The process begins with the ingestion of raw data from a CSV file stored on GitHub, using Azure Data Factory to bring it into the Azure ecosystem. The data is initially stored in Azure Data Lake Storage Gen2 before being transformed and enriched in Azure Databricks. After processing, the enhanced data is stored back in ADLS Gen2, where advanced analytics are performed in Azure Synapse Analytics. Finally, the insights derived from these analyses are visualized in Azure Synapse or Power BI, providing a comprehensive and actionable view of the dataset.

Azure Services Used

Azure Data Factory: For data ingestion from GitHub.
Azure Data Lake Storage Gen2: As the primary data storage solution.
Azure Databricks: For data transformation tasks.
Azure Synapse Analytics: To perform in-depth data analytics.

Initial Setup

created Azure Free Subscription account
created a Resource Group 'Paris-Olympic' to house and manage all the Azure resources associated with this project.
Within the created resource group, set up a storage account. This is specifically configured to leverage Azure Data Lake Storage(ADLS) Gen2 capabilities.
Created a Container 'paris-olympic-data-2024' inside this storage account to hold the project's data. Two directories 'raw-data' and 'transformed-data' are created to store raw data and transformed data.

Data Ingestion

Begin by creating an Azure Data Factory workspace within the previously established resource group.
After setting up the workspace, launch the Azure Data Factory Studio.
Upload the Paris Olympics dataset from Kaggle to GitHub. Within the studio, initialize a new data integration pipeline.
Now use the task Copy Data to move data efficiently between various supported sources and destinations.
Configuring the Data Source with an HTTP template as we are using HTTP requests to get the data from the GitHub repo.
Establishing the Linked Service for source.
Configuring the File Format for and setting up the Linked Service Sink.
Repeat the above steps to load all the datasets.
You can connect all the copy data activity and run them all at once.
After the pipeline completes its execution, navigate to your Azure Data Lake Storage Gen2. Dive into the "raw_data" folder and validate that the files, like "athletes.csv", "medals.csv", etc., are present and populated with the expected data.

Please refer to the notebook below for transformations and code mounting ADLS Gen2 to Databricks. Paris-Olympic-Data-Transformation

Data Transformation using Azure Databricks

Navigate to Azure Databricks within the Azure portal and create a workspace within the previously established resource group and launch it.
Configuring Compute in Databricks.
Create a new notebook within Databricks and rename it appropriately, reflecting its purpose or the dataset it pertains to.
Establishing a Connection to Azure Data Lake Storage (ADLS)
Using the credentials (Client ID, Tenant ID, Secret), write the appropriate code in the Databricks notebook to mount ADLS.
Writing Data Transformations mount ADLS Gen2 to Databricks.
Writing Transformed Data to ADLS Gen2.

Refer to the information below for the transformations and code used to mount ADLS Gen2 to Databricks. Paris-Olympic-Data-Transformation

Setting Up and Using Azure Synapse Analytics

Creating a Synapse Analytics Workspace.
Within Workspace navigate to the "Data" section, choose "Lake Database" and created a Database called "ParisOlympicDB"
Create a Table from the Data Lake from the Transformed Data folder within your ADLS Gen2 storage.

Performing Data Analysis on the Data

Refer to the SQL scripts used for data analysis Paris OlympicSQL script

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
Images		Images
data		data
Paris OlympicSQL script.sql		Paris OlympicSQL script.sql
Paris-Olympic-Data-Transformation.ipynb		Paris-Olympic-Data-Transformation.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

2024 Paris Olympic Analysis | Azure End-To-End Data Engineering Project

Datasets Utilized

Structure

Azure Services Used

Initial Setup

Data Ingestion

Data Transformation using Azure Databricks

Setting Up and Using Azure Synapse Analytics

Performing Data Analysis on the Data

About

Releases

Packages

Languages

SonuRepo/paris_olympic_azure_project

Folders and files

Latest commit

History

Repository files navigation

2024 Paris Olympic Analysis | Azure End-To-End Data Engineering Project

Datasets Utilized

Structure

Azure Services Used

Initial Setup

Data Ingestion

Data Transformation using Azure Databricks

Setting Up and Using Azure Synapse Analytics

Performing Data Analysis on the Data

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages