Strava Data Pipline

This repository contains a pipeline to extract activity data from the Strava API, process it, and load it into a PostgreSQL database. The pipeline is scheduled to run once a week using Apache Airflow, which is managed and executed within Docker containers. This setup ensures that any new activity data is efficiently captured, processed, and stored for further analysis.

Data Extraction: The data extraction step involves incrementally ingesting Strava activity data using the Strava API. The process begins by querying a last_extracted table in the PostgreSQL database to retrieve the timestamp of the last successful data extraction. Using the requests library, the pipeline then makes repeated calls to the Strava REST API to fetch all new activity data between the last date and the current date and time. To comply with Strava's rate limit of 100 requests per 15 minutes, the pipeline includes a time.sleep() command to pause between requests.
Data Tranformation: In the data transformation step, the raw activity data is prepared for loading into the PostgreSQL activities table. The raw activity data is first validated to ensure accuracy, including checking that all IDs are unique and correctly matched. After validation, the data undergoes several transformations: geographic coordinates are converted to JSON format for easier handling, dates are formatted, and column names are updated to be more descriptive and consistent. These transformations make the data cleaner and better suited for storage and analysis.
Data Load: In the data load step, the transformed activity data is inserted into the PostgreSQL activities table. This process starts by connecting to the PostgreSQL database and executing an SQL INSERT statement to insert the cleaned and formatted data. Additionally, the current date is recorded in the last_extracted table to mark the completion of the latest data extraction. This update helps track the most recent successful extraction and ensures that future extractions only retrieve new data since this timestamp.

Installation

Clone the repository:

git clone https://github.com/joanabaiao/strava-data-pipeline.git
cd strava-data-pipeline

Create configuration files:

Obtain your Strava API access tokens by creating an application on the Strava Developers website.

Create a configuration file config/config.yaml with your Strava client ID, client secret, and access token. Include in this file your PostgreSQL setup:

strava:
  client_id: "xxxxxxx"
  client_secret: "xxxxxxx"
  refresh_token: "xxxxxxx"
  base_url: "https://www.strava.com/api/v3/"
  auth_url: "https://www.strava.com/oauth/token"
  activities_url: "https://www.strava.com/api/v3/activities"

postgresql:
  host: "localhost"
  user: "xxxxxxx"
  password: "xxxxxxx"
  port: "5432"
  database: "xxxxxxx"

Set up Docker and Docker Compose:
- Ensure Docker and Docker Compose are installed on your machine. Follow the official Docker installation guide for installation instructions.
Docker Configuration:
- Update the docker-compose.yml file to configure environment variables and volume mounts for your Airflow setup.

Usage

Run Pipeline with Airflow and Docker

Run Airflow in Docker:

Follow the official Running Airflow in Docker guide.

mkdir -p ./dags ./logs ./plugins ./config
echo -e "AIRFLOW_UID=$(id -u)" > .env

docker compose up airflow-init
docker compose up

Access the Airflow Web Interface:
- Open your web browser and navigate to http://localhost:8080 to access the Airflow web interface.
Trigger the DAG:
- From the Airflow web interface, you can manually trigger the strava_pipeline_DAG DAG and let it run according to the schedule defined in the DAG configuration.

Run Pipeline Manually

To run the pipeline manually without using Airflow, execute the following command:

python main.py

Conclusion

In summary, I've created a data pipeline that automatically extracts, transforms and loads my Strava data every week.

As the next step, I plan to build an interactive dashboard to further analyze my activity data.

I hope you found this project interesting! Your feedback is highly appreciated and will help guide future improvements :)

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
dags		dags
images		images
src		src
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
docker-compose.yaml		docker-compose.yaml
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Strava Data Pipline

Installation

Usage

Run Pipeline with Airflow and Docker

Run Pipeline Manually

Conclusion

About

Releases

Packages

Languages

License

joanabaiao/strava-data-pipeline

Folders and files

Latest commit

History

Repository files navigation

Strava Data Pipline

Installation

Usage

Run Pipeline with Airflow and Docker

Run Pipeline Manually

Conclusion

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages