The original data for this project is stored in two main files:
-
data/workouts/workout_YYYY-MM-DD.csv: Theworkoutsdirectory contains workout data files. Each file contains the workout data for a specific week. The data includes the date, workout type, exercise, muscle group, set number, number of repetitions, and weight lifted. (Source: my own workout data) -
data/kaggle/megaGymDataset.csv: This file contains a comprehensive list of exercises with their descriptions, types, targeted body parts, equipment used, difficulty level, and ratings. (Source: Kaggle)
The data pipeline for this project consists of several stages:
-
Data Collection: The
data_collection.pyscript collects workout data from the original data files. It includes functions to collect workout data, fetch exercise data, filter exercise data, enrich workout data, and aggregate workout data. The collected data is saved in thedata/currentdirectory. -
Data Preprocessing: The
models_training.pyscript preprocesses the collected data to prepare it for model training. -
Model Training & Versioning: The
models_training.pyscript also trains models for each exercise and saves them in themodelsdirectory. The models are versioned by saving them in acurrentsubdirectory and archiving old models in aversionssubdirectory (unversioned for size purposes). It also plots the loss of the models and saves the plots in thelosssubdirectory. -
Data Analysis: The
data_analytics.pyscript performs data analysis on the workout data and model predictions. It includes functions to plot predicted volume, plot distribution of muscle groups, plot distribution of workout types, and plot weight and repetitions over time. The plots are saved as HTML files in thestatic/plotsdirectory.
All these steps are defined in the jobs.py script, which sets up the data pipeline stage.
The model used in this project is a Long Short-Term Memory (LSTM) model, which is a type of Recurrent Neural Network (RNN). LSTM models are particularly good at processing sequences of data, making them well-suited for time-series data like our workout data.
The model takes as input a sequence of workout data and outputs a prediction for the volume of the next workouts. The volume is calculated as the product of the number of sets, the number of repetitions per set, and the weight lifted.
The model's performance is evaluated using the Mean Squared Error (MSE) loss function.
The project includes a web app that allows users to view workout data, exercise data, and analytics. The app is defined in the app.py script and uses Flask as the web framework.
/: The home page./workouts: The workouts page contains a table with the workout data./my-exercises: The exercises page contains a table with the filtered exercise data./analytics: The analytics page contains plots of the workout data and model predictions.
The data loading process for the app is defined in the data_loading.py script. It includes functions to load workout data and load filtered exercise data.
Logging is configured in the logger_config.py script. Each script in the project has its own logger, which is configured to log messages to a file in the logs directory. The log messages include the timestamp, log level, and the name of the script that generated the log message. The log messages are also printed to the console.
To run the project, you can use the run.py script. This script runs the app and the data pipeline in sequence. The data pipeline is scheduled to run every monday at 12:00 AM.
python src/run.pyThe project includes several types of tests:
-
Unit Tests: The
test_data_collection.pyandtest_data_loading.pyscripts include unit tests for some data collection and data loading functionalities respectively. -
Integration Tests: The
test_app.pyscript includes integration tests for the app and for the data ingestion and model training jobs of the data pipeline. -
End-to-End Test: The
test_end_to_end.pyscript includes an end-to-end test that tests the entire app and data pipeline.
The project uses Docker for deployment. The Dockerfile defines the Docker image for the project, which uses Python 3.10 as the base image and runs the src/run.py script when the container is launched.
The project also uses GitHub Actions for automation and CI/CD. The ci-cd.yml workflow defines the CI/CD pipeline, which runs the tests, builds the Docker image, versions it and pushes it to Docker Hub. Finally, it creates a pull request to merge the changes to the main branch if the pipeline passes. The automerge.yml workflow, which defines the automerge action, automatically merges pull requests that pass the CI/CD pipeline.