This repository contains a pipeline to extract activity data from the Strava API, process it, and load it into a PostgreSQL database. The pipeline is scheduled to run once a week using Apache Airflow, which is managed and executed within Docker containers. This setup ensures that any new activity data is efficiently captured, processed, and stored for further analysis.
-
Data Extraction: The data extraction step involves incrementally ingesting Strava activity data using the Strava API. The process begins by querying a
last_extracted
table in the PostgreSQL database to retrieve the timestamp of the last successful data extraction. Using the requests library, the pipeline then makes repeated calls to the Strava REST API to fetch all new activity data between the last date and the current date and time. To comply with Strava's rate limit of 100 requests per 15 minutes, the pipeline includes a time.sleep() command to pause between requests. -
Data Tranformation: In the data transformation step, the raw activity data is prepared for loading into the PostgreSQL activities table. The raw activity data is first validated to ensure accuracy, including checking that all IDs are unique and correctly matched. After validation, the data undergoes several transformations: geographic coordinates are converted to JSON format for easier handling, dates are formatted, and column names are updated to be more descriptive and consistent. These transformations make the data cleaner and better suited for storage and analysis.
-
Data Load: In the data load step, the transformed activity data is inserted into the PostgreSQL
activities
table. This process starts by connecting to the PostgreSQL database and executing an SQL INSERT statement to insert the cleaned and formatted data. Additionally, the current date is recorded in thelast_extracted
table to mark the completion of the latest data extraction. This update helps track the most recent successful extraction and ensures that future extractions only retrieve new data since this timestamp.
-
Clone the repository:
git clone https://github.com/joanabaiao/strava-data-pipeline.git cd strava-data-pipeline
-
Create configuration files:
-
Obtain your Strava API access tokens by creating an application on the Strava Developers website.
-
Create a configuration file
config/config.yaml
with your Strava client ID, client secret, and access token. Include in this file your PostgreSQL setup:strava: client_id: "xxxxxxx" client_secret: "xxxxxxx" refresh_token: "xxxxxxx" base_url: "https://www.strava.com/api/v3/" auth_url: "https://www.strava.com/oauth/token" activities_url: "https://www.strava.com/api/v3/activities" postgresql: host: "localhost" user: "xxxxxxx" password: "xxxxxxx" port: "5432" database: "xxxxxxx"
-
-
Set up Docker and Docker Compose:
- Ensure Docker and Docker Compose are installed on your machine. Follow the official Docker installation guide for installation instructions.
-
Docker Configuration:
- Update the
docker-compose.yml
file to configure environment variables and volume mounts for your Airflow setup.
- Update the
-
Run Airflow in Docker:
- Follow the official Running Airflow in Docker guide.
mkdir -p ./dags ./logs ./plugins ./config echo -e "AIRFLOW_UID=$(id -u)" > .env docker compose up airflow-init docker compose up
-
Access the Airflow Web Interface:
- Open your web browser and navigate to
http://localhost:8080
to access the Airflow web interface.
- Open your web browser and navigate to
-
Trigger the DAG:
- From the Airflow web interface, you can manually trigger the
strava_pipeline_DAG
DAG and let it run according to the schedule defined in the DAG configuration.
- From the Airflow web interface, you can manually trigger the
To run the pipeline manually without using Airflow, execute the following command:
python main.py
In summary, I've created a data pipeline that automatically extracts, transforms and loads my Strava data every week.
As the next step, I plan to build an interactive dashboard to further analyze my activity data.
I hope you found this project interesting! Your feedback is highly appreciated and will help guide future improvements :)