OpenF1 Data Engineering Project with AWS

This repository contains a data engineering project built using the OpenF1 API and AWS services: S3, Glue, and Athena. The goal is to extract Formula 1 data, transform and store it efficiently, and analyze it using SQL queries.

🚀 Project Goals

Extract Formula 1 session data from the OpenF1 API (2024 season)
Build an automated data pipeline using AWS services
Perform performance analysis on drivers throughout the season

🗂️ Datasets Used

The following CSV datasets were downloaded from OpenF1:

drivers.csv – Driver metadata per session
sessions.csv – Information for each F1 session
position.csv – Driver position changes during sessions

All data is from the 2024 season.

🏗️ Architecture Overview

OpenF1 API
- Data fetched using Python (requests, pandas)
- Saved locally as CSV files
Amazon S3
- sources/: Raw CSVs uploaded from the OpenF1 API
- datawarehouse/: Output of transformed data (Parquet)
AWS Glue
- Visual ETL job to join, clean, and convert data
- IAM role (glue_access_s3) used to grant S3 access
Glue Data Catalog
- Database openf1 created
- Crawler auto-generates schema from datawarehouse/
AWS Athena
- Connected to the Glue catalog
- Queries executed using SQL
- Output saved to athena-output-openf1 bucket

⚙️ Setup Highlights

IAM user created with appropriate policies
S3 bucket structured with sources/ and datawarehouse/
AWS Glue job created using the visual editor (joins, cleaning, conversion)
Crawler registered schema in Glue Data Catalog
Athena queries run against openf1.datawarehouse

🧩 Workflow Summary

This project involved:

Fetching and saving session-level data from the OpenF1 API in CSV format
Uploading raw files to an Amazon S3 bucket under the sources/ path
Creating an AWS Glue ETL job to join and clean the datasets
Storing transformed Parquet files in the datawarehouse/ path in S3
Using a Glue Crawler to register schema to the Glue Data Catalog
Querying driver performance trends using Amazon Athena

📦 Output

All query results are saved as CSV files and stored locally in the results/ folder:

avg_position_qualifying.csv: Average qualifying position per driver
avg_position_race.csv: Average race finishing position per driver
positions_lost_gained.csv: Detailed gain/loss positions from qualifying to race per driver
detailed_position_change.csv: Average positions gained or lost between qualifying and race
(A positive value means the driver gained positions on average. A negative value means they lost positions.)

📸 Screenshots

AWS Glue job diagram
IAM user permission policies
Athena query editor

Feel free to fork or clone this repo for your own Formula 1 data analysis!

📧 Questions? Reach out via GitHub Issues or Discussions.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
notebooks		notebooks
queries		queries
results		results
screenshots		screenshots
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

OpenF1 Data Engineering Project with AWS

🚀 Project Goals

🗂️ Datasets Used

🏗️ Architecture Overview

⚙️ Setup Highlights

🧩 Workflow Summary

📦 Output

📸 Screenshots

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

monikakrajnc/openf1-aws-pipeline

Folders and files

Latest commit

History

Repository files navigation

OpenF1 Data Engineering Project with AWS

🚀 Project Goals

🗂️ Datasets Used

🏗️ Architecture Overview

⚙️ Setup Highlights

🧩 Workflow Summary

📦 Output

📸 Screenshots

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages