This repository contains a data engineering project built using the OpenF1 API and AWS services: S3, Glue, and Athena. The goal is to extract Formula 1 data, transform and store it efficiently, and analyze it using SQL queries.
- Extract Formula 1 session data from the OpenF1 API (2024 season)
- Build an automated data pipeline using AWS services
- Perform performance analysis on drivers throughout the season
The following CSV datasets were downloaded from OpenF1:
drivers.csv
– Driver metadata per sessionsessions.csv
– Information for each F1 sessionposition.csv
– Driver position changes during sessions
All data is from the 2024 season.
-
OpenF1 API
- Data fetched using Python (
requests
,pandas
) - Saved locally as CSV files
- Data fetched using Python (
-
Amazon S3
sources/
: Raw CSVs uploaded from the OpenF1 APIdatawarehouse/
: Output of transformed data (Parquet)
-
AWS Glue
- Visual ETL job to join, clean, and convert data
- IAM role (
glue_access_s3
) used to grant S3 access
-
Glue Data Catalog
- Database
openf1
created - Crawler auto-generates schema from
datawarehouse/
- Database
-
AWS Athena
- Connected to the Glue catalog
- Queries executed using SQL
- Output saved to
athena-output-openf1
bucket
- IAM user created with appropriate policies
- S3 bucket structured with
sources/
anddatawarehouse/
- AWS Glue job created using the visual editor (joins, cleaning, conversion)
- Crawler registered schema in Glue Data Catalog
- Athena queries run against
openf1.datawarehouse
This project involved:
- Fetching and saving session-level data from the OpenF1 API in CSV format
- Uploading raw files to an Amazon S3 bucket under the
sources/
path - Creating an AWS Glue ETL job to join and clean the datasets
- Storing transformed Parquet files in the
datawarehouse/
path in S3 - Using a Glue Crawler to register schema to the Glue Data Catalog
- Querying driver performance trends using Amazon Athena
All query results are saved as CSV files and stored locally in the results/
folder:
avg_position_qualifying.csv
: Average qualifying position per driveravg_position_race.csv
: Average race finishing position per driverpositions_lost_gained.csv
: Detailed gain/loss positions from qualifying to race per driverdetailed_position_change.csv
: Average positions gained or lost between qualifying and race
(A positive value means the driver gained positions on average. A negative value means they lost positions.)
- AWS Glue job diagram
- IAM user permission policies
- Athena query editor
Feel free to fork or clone this repo for your own Formula 1 data analysis!
📧 Questions? Reach out via GitHub Issues or Discussions.