This project implements an end-to-end Machine Learning pipeline for predicting car prices, orchestrated using Apache Airflow. The pipeline consists of two main stages: model training and prediction generation. This setup demonstrates a basic MLOps workflow where data processing, model training, and inference are automated and scheduled.
Developed as part of a "Data Science Introduction" course assignment.
airflow_hw/ ├── dags/ │ └── hw_dag.py # Airflow DAG for orchestrating the ML pipeline. ├── modules/ │ ├── init.py │ ├── pipeline.py # Python module for data preprocessing and model training. │ └── predict.py # Python module for loading the model and generating predictions. └── data/ ├── train/ │ └── homework.csv # Training dataset for the model. ├── test/ │ └── *.json # Test data for making predictions (each entry is a separate JSON file). ├── model/ │ └── model.pkl # Trained machine learning model (generated by pipeline.py). └── predictions/ └── predictions.csv # Output file containing car price predictions (generated by predict.py).
This script handles the data preparation and model training.
- Loads the training data from
data/train/homework.csv. - Performs necessary data preprocessing steps, such as:
- Handling missing values.
- One-Hot Encoding for categorical features.
- Feature scaling for numerical features.
- Trains a Machine Learning model (e.g.,
RandomForestRegressor). - Saves the trained model to
data/model/model.pklfor later use.
This script is responsible for making predictions using the trained model.
- Loads the pre-trained model from
data/model/model.pkl. - Reads new car data from individual JSON files in the
data/test/directory. - Generates price predictions for each car in the test dataset.
- Saves all predictions into a single CSV file at
data/predictions/predictions.csv.
This Airflow DAG orchestrates the entire ML pipeline.
- Defines two main tasks:
pipeline_task: Executes thepipeline.pyscript.predict_task: Executes thepredict.pyscript.
- Establishes a dependency:
pipeline_task >> predict_task, ensuring the model is trained before predictions are attempted. - Configured for daily execution at 15:00 UTC, but can also be triggered manually from the Airflow UI.
To set up and run this project, you will need Docker and Docker Compose installed on your system.
- Docker: Install Docker
- Docker Compose: Install Docker Compose
-
Clone the repository:
git clone [https://github.com/maxdavinci2022/airflow_project.git](https://github.com/maxdavinci2022/airflow_project.git) cd airflow_project/airflow_hwNote: Ensure you are in the
airflow_hwdirectory, which is the root of this project's code. -
Configure Docker Compose for Airflow: Your Airflow environment (usually defined by a
docker-compose.yamlfile) needs to have a volume mount that makes thisairflow_hwproject directory accessible inside the Airflow containers.Locate your
docker-compose.yamlfile (it's typically in the root of your Airflow environment setup, one level aboveairflow_hw). Add the following volume mount under thevolumes:section for services likeairflow-webserver,airflow-scheduler, andairflow-worker.services: airflow-webserver: volumes: # LOCAL_PATH_TO_AIRFLOW_HW should be the absolute path on your host machine. # Example for Windows: - D:/Projects/_Skillbox/ds-intro/33_airflow/airflow_hw:/opt/airflow/airflow_hw # Example for Linux/macOS: - ./airflow_hw:/opt/airflow/airflow_hw - /path/to/your/local/airflow_hw:/opt/airflow/airflow_hw # ... existing DAGs/logs/plugins mounts ... airflow-scheduler: volumes: - /path/to/your/local/airflow_hw:/opt/airflow/airflow_hw # ... airflow-worker: volumes: - /path/to/your/local/airflow_hw:/opt/airflow/airflow_hw # ...
Ensure that the
pathvariable indags/hw_dag.pymatches the internal container path:path = '/opt/airflow/airflow_hw'. -
Place the DAG file: Copy the
hw_dag.pyfile into your Airflow'sdags/folder. This is the folder that yourdocker-compose.yamltypically mounts to/opt/airflow/dagsinside the container. Example:[Your Airflow Root]/dags/hw_dag.py -
Start Apache Airflow: Navigate to the directory where your
docker-compose.yamlis located (the root of your Airflow environment setup) and run:docker-compose up -d
- Access Airflow UI: Open your web browser and navigate to the Airflow UI (usually
http://localhost:8080). - Enable the DAG: Locate the DAG with
dag_id='car_price_prediction'in the DAGs list and toggle it "On". - Trigger the DAG: Click on the "Play" icon (Trigger DAG) to initiate a manual run.
- Monitor Execution: Observe the task execution in the Airflow UI. Check task logs for any errors.
Upon successful completion of the DAG, a predictions.csv file will be generated in the airflow_hw/data/predictions/ directory on your local machine. This file will contain the predicted car prices for the test data.
Feel free to open issues or submit pull requests if you have suggestions for improvements or find any bugs.
- [maxdavinci] - [https://github.com/maxdavinci2022]
This project is licensed under the Choose a License, e.g., MIT License - see the LICENSE.md file for details (if you create one).