Skip to content

End-to-end Formula 1 data pipeline using the OpenF1 API and AWS (S3, Glue, Athena) to fetch, transform, and analyze driver performance.

License

Notifications You must be signed in to change notification settings

monikakrajnc/openf1-aws-pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OpenF1 Data Engineering Project with AWS

This repository contains a data engineering project built using the OpenF1 API and AWS services: S3, Glue, and Athena. The goal is to extract Formula 1 data, transform and store it efficiently, and analyze it using SQL queries.


🚀 Project Goals

  • Extract Formula 1 session data from the OpenF1 API (2024 season)
  • Build an automated data pipeline using AWS services
  • Perform performance analysis on drivers throughout the season

🗂️ Datasets Used

The following CSV datasets were downloaded from OpenF1:

  • drivers.csv – Driver metadata per session
  • sessions.csv – Information for each F1 session
  • position.csv – Driver position changes during sessions

All data is from the 2024 season.


🏗️ Architecture Overview

  1. OpenF1 API

    • Data fetched using Python (requests, pandas)
    • Saved locally as CSV files
  2. Amazon S3

    • sources/: Raw CSVs uploaded from the OpenF1 API
    • datawarehouse/: Output of transformed data (Parquet)
  3. AWS Glue

    • Visual ETL job to join, clean, and convert data
    • IAM role (glue_access_s3) used to grant S3 access
  4. Glue Data Catalog

    • Database openf1 created
    • Crawler auto-generates schema from datawarehouse/
  5. AWS Athena

    • Connected to the Glue catalog
    • Queries executed using SQL
    • Output saved to athena-output-openf1 bucket

⚙️ Setup Highlights

  • IAM user created with appropriate policies
  • S3 bucket structured with sources/ and datawarehouse/
  • AWS Glue job created using the visual editor (joins, cleaning, conversion)
  • Crawler registered schema in Glue Data Catalog
  • Athena queries run against openf1.datawarehouse

🧩 Workflow Summary

This project involved:

  • Fetching and saving session-level data from the OpenF1 API in CSV format
  • Uploading raw files to an Amazon S3 bucket under the sources/ path
  • Creating an AWS Glue ETL job to join and clean the datasets
  • Storing transformed Parquet files in the datawarehouse/ path in S3
  • Using a Glue Crawler to register schema to the Glue Data Catalog
  • Querying driver performance trends using Amazon Athena

📦 Output

All query results are saved as CSV files and stored locally in the results/ folder:

  • avg_position_qualifying.csv: Average qualifying position per driver
  • avg_position_race.csv: Average race finishing position per driver
  • positions_lost_gained.csv: Detailed gain/loss positions from qualifying to race per driver
  • detailed_position_change.csv: Average positions gained or lost between qualifying and race
    (A positive value means the driver gained positions on average. A negative value means they lost positions.)

📸 Screenshots

  • AWS Glue job diagram
  • IAM user permission policies
  • Athena query editor

Feel free to fork or clone this repo for your own Formula 1 data analysis!

📧 Questions? Reach out via GitHub Issues or Discussions.

About

End-to-end Formula 1 data pipeline using the OpenF1 API and AWS (S3, Glue, Athena) to fetch, transform, and analyze driver performance.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published