airflow_project

Car Price Prediction Project with Apache Airflow

Project Overview

This project implements an end-to-end Machine Learning pipeline for predicting car prices, orchestrated using Apache Airflow. The pipeline consists of two main stages: model training and prediction generation. This setup demonstrates a basic MLOps workflow where data processing, model training, and inference are automated and scheduled.

Developed as part of a "Data Science Introduction" course assignment.

Project Structure

airflow_hw/ ├── dags/ │ └── hw_dag.py # Airflow DAG for orchestrating the ML pipeline. ├── modules/ │ ├── init.py │ ├── pipeline.py # Python module for data preprocessing and model training. │ └── predict.py # Python module for loading the model and generating predictions. └── data/ ├── train/ │ └── homework.csv # Training dataset for the model. ├── test/ │ └── *.json # Test data for making predictions (each entry is a separate JSON file). ├── model/ │ └── model.pkl # Trained machine learning model (generated by pipeline.py). └── predictions/ └── predictions.csv # Output file containing car price predictions (generated by predict.py).

Functionality

`modules/pipeline.py`

This script handles the data preparation and model training.

Loads the training data from data/train/homework.csv.
Performs necessary data preprocessing steps, such as:
- Handling missing values.
- One-Hot Encoding for categorical features.
- Feature scaling for numerical features.
Trains a Machine Learning model (e.g., RandomForestRegressor).
Saves the trained model to data/model/model.pkl for later use.

`modules/predict.py`

This script is responsible for making predictions using the trained model.

Loads the pre-trained model from data/model/model.pkl.
Reads new car data from individual JSON files in the data/test/ directory.
Generates price predictions for each car in the test dataset.
Saves all predictions into a single CSV file at data/predictions/predictions.csv.

`dags/hw_dag.py`

This Airflow DAG orchestrates the entire ML pipeline.

Defines two main tasks:
- pipeline_task: Executes the pipeline.py script.
- predict_task: Executes the predict.py script.
Establishes a dependency: pipeline_task >> predict_task, ensuring the model is trained before predictions are attempted.
Configured for daily execution at 15:00 UTC, but can also be triggered manually from the Airflow UI.

Getting Started

To set up and run this project, you will need Docker and Docker Compose installed on your system.

Prerequisites

Docker: Install Docker
Docker Compose: Install Docker Compose

Setup Instructions

Clone the repository:

git clone [https://github.com/maxdavinci2022/airflow_project.git](https://github.com/maxdavinci2022/airflow_project.git)
cd airflow_project/airflow_hw

Note: Ensure you are in the airflow_hw directory, which is the root of this project's code.

Configure Docker Compose for Airflow: Your Airflow environment (usually defined by a docker-compose.yaml file) needs to have a volume mount that makes this airflow_hw project directory accessible inside the Airflow containers.

Locate your docker-compose.yaml file (it's typically in the root of your Airflow environment setup, one level above airflow_hw). Add the following volume mount under the volumes: section for services like airflow-webserver, airflow-scheduler, and airflow-worker.

services:
  airflow-webserver:
    volumes:
      # LOCAL_PATH_TO_AIRFLOW_HW should be the absolute path on your host machine.
      # Example for Windows: - D:/Projects/_Skillbox/ds-intro/33_airflow/airflow_hw:/opt/airflow/airflow_hw
      # Example for Linux/macOS: - ./airflow_hw:/opt/airflow/airflow_hw
      - /path/to/your/local/airflow_hw:/opt/airflow/airflow_hw
      # ... existing DAGs/logs/plugins mounts ...

  airflow-scheduler:
    volumes:
      - /path/to/your/local/airflow_hw:/opt/airflow/airflow_hw
      # ...

  airflow-worker:
    volumes:
      - /path/to/your/local/airflow_hw:/opt/airflow/airflow_hw
      # ...

Ensure that the path variable in dags/hw_dag.py matches the internal container path: path = '/opt/airflow/airflow_hw'.

Place the DAG file: Copy the hw_dag.py file into your Airflow's dags/ folder. This is the folder that your docker-compose.yaml typically mounts to /opt/airflow/dags inside the container. Example: [Your Airflow Root]/dags/hw_dag.py
Start Apache Airflow: Navigate to the directory where your docker-compose.yaml is located (the root of your Airflow environment setup) and run:
```
docker-compose up -d
```

Running the DAG

Access Airflow UI: Open your web browser and navigate to the Airflow UI (usually http://localhost:8080).
Enable the DAG: Locate the DAG with dag_id='car_price_prediction' in the DAGs list and toggle it "On".
Trigger the DAG: Click on the "Play" icon (Trigger DAG) to initiate a manual run.
Monitor Execution: Observe the task execution in the Airflow UI. Check task logs for any errors.

Expected Output

Upon successful completion of the DAG, a predictions.csv file will be generated in the airflow_hw/data/predictions/ directory on your local machine. This file will contain the predicted car prices for the test data.

Contributing

Feel free to open issues or submit pull requests if you have suggestions for improvements or find any bugs.

Author

[maxdavinci] - [https://github.com/maxdavinci2022]

License

This project is licensed under the Choose a License, e.g., MIT License - see the LICENSE.md file for details (if you create one).

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.github/workflows		.github/workflows
dags		dags
data		data
modules		modules
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

airflow_project

Car Price Prediction Project with Apache Airflow

Project Overview

Project Structure

Functionality

`modules/pipeline.py`

`modules/predict.py`

`dags/hw_dag.py`

Getting Started

Prerequisites

Setup Instructions

Running the DAG

Expected Output

Contributing

Author

License

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

maxdavinci2022/airflow_project

Folders and files

Latest commit

History

Repository files navigation

airflow_project

Car Price Prediction Project with Apache Airflow

Project Overview

Project Structure

Functionality

modules/pipeline.py

modules/predict.py

dags/hw_dag.py

Getting Started

Prerequisites

Setup Instructions

Running the DAG

Expected Output

Contributing

Author

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

`modules/pipeline.py`

`modules/predict.py`

`dags/hw_dag.py`

Packages