Skip to content

maxdavinci2022/airflow_project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

airflow_project

Car Price Prediction Project with Apache Airflow

Project Overview

This project implements an end-to-end Machine Learning pipeline for predicting car prices, orchestrated using Apache Airflow. The pipeline consists of two main stages: model training and prediction generation. This setup demonstrates a basic MLOps workflow where data processing, model training, and inference are automated and scheduled.

Developed as part of a "Data Science Introduction" course assignment.

Project Structure

airflow_hw/ ├── dags/ │ └── hw_dag.py # Airflow DAG for orchestrating the ML pipeline. ├── modules/ │ ├── init.py │ ├── pipeline.py # Python module for data preprocessing and model training. │ └── predict.py # Python module for loading the model and generating predictions. └── data/ ├── train/ │ └── homework.csv # Training dataset for the model. ├── test/ │ └── *.json # Test data for making predictions (each entry is a separate JSON file). ├── model/ │ └── model.pkl # Trained machine learning model (generated by pipeline.py). └── predictions/ └── predictions.csv # Output file containing car price predictions (generated by predict.py).

Functionality

modules/pipeline.py

This script handles the data preparation and model training.

  • Loads the training data from data/train/homework.csv.
  • Performs necessary data preprocessing steps, such as:
    • Handling missing values.
    • One-Hot Encoding for categorical features.
    • Feature scaling for numerical features.
  • Trains a Machine Learning model (e.g., RandomForestRegressor).
  • Saves the trained model to data/model/model.pkl for later use.

modules/predict.py

This script is responsible for making predictions using the trained model.

  • Loads the pre-trained model from data/model/model.pkl.
  • Reads new car data from individual JSON files in the data/test/ directory.
  • Generates price predictions for each car in the test dataset.
  • Saves all predictions into a single CSV file at data/predictions/predictions.csv.

dags/hw_dag.py

This Airflow DAG orchestrates the entire ML pipeline.

  • Defines two main tasks:
    • pipeline_task: Executes the pipeline.py script.
    • predict_task: Executes the predict.py script.
  • Establishes a dependency: pipeline_task >> predict_task, ensuring the model is trained before predictions are attempted.
  • Configured for daily execution at 15:00 UTC, but can also be triggered manually from the Airflow UI.

Getting Started

To set up and run this project, you will need Docker and Docker Compose installed on your system.

Prerequisites

Setup Instructions

  1. Clone the repository:

    git clone [https://github.com/maxdavinci2022/airflow_project.git](https://github.com/maxdavinci2022/airflow_project.git)
    cd airflow_project/airflow_hw

    Note: Ensure you are in the airflow_hw directory, which is the root of this project's code.

  2. Configure Docker Compose for Airflow: Your Airflow environment (usually defined by a docker-compose.yaml file) needs to have a volume mount that makes this airflow_hw project directory accessible inside the Airflow containers.

    Locate your docker-compose.yaml file (it's typically in the root of your Airflow environment setup, one level above airflow_hw). Add the following volume mount under the volumes: section for services like airflow-webserver, airflow-scheduler, and airflow-worker.

    services:
      airflow-webserver:
        volumes:
          # LOCAL_PATH_TO_AIRFLOW_HW should be the absolute path on your host machine.
          # Example for Windows: - D:/Projects/_Skillbox/ds-intro/33_airflow/airflow_hw:/opt/airflow/airflow_hw
          # Example for Linux/macOS: - ./airflow_hw:/opt/airflow/airflow_hw
          - /path/to/your/local/airflow_hw:/opt/airflow/airflow_hw
          # ... existing DAGs/logs/plugins mounts ...
    
      airflow-scheduler:
        volumes:
          - /path/to/your/local/airflow_hw:/opt/airflow/airflow_hw
          # ...
    
      airflow-worker:
        volumes:
          - /path/to/your/local/airflow_hw:/opt/airflow/airflow_hw
          # ...

    Ensure that the path variable in dags/hw_dag.py matches the internal container path: path = '/opt/airflow/airflow_hw'.

  3. Place the DAG file: Copy the hw_dag.py file into your Airflow's dags/ folder. This is the folder that your docker-compose.yaml typically mounts to /opt/airflow/dags inside the container. Example: [Your Airflow Root]/dags/hw_dag.py

  4. Start Apache Airflow: Navigate to the directory where your docker-compose.yaml is located (the root of your Airflow environment setup) and run:

    docker-compose up -d

Running the DAG

  1. Access Airflow UI: Open your web browser and navigate to the Airflow UI (usually http://localhost:8080).
  2. Enable the DAG: Locate the DAG with dag_id='car_price_prediction' in the DAGs list and toggle it "On".
  3. Trigger the DAG: Click on the "Play" icon (Trigger DAG) to initiate a manual run.
  4. Monitor Execution: Observe the task execution in the Airflow UI. Check task logs for any errors.

Expected Output

Upon successful completion of the DAG, a predictions.csv file will be generated in the airflow_hw/data/predictions/ directory on your local machine. This file will contain the predicted car prices for the test data.

Contributing

Feel free to open issues or submit pull requests if you have suggestions for improvements or find any bugs.

Author

License

This project is licensed under the Choose a License, e.g., MIT License - see the LICENSE.md file for details (if you create one).

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages