Airline Performance Dashboard

An interactive Streamlit dashboard for analyzing US domestic flight delays with machine learning predictions.

Features

Delay Analysis: Comprehensive breakdown of delay types and causes
Time Analysis: Temporal patterns in flight volumes and delays
Airline Analysis: Performance metrics by carrier
Airport Analysis: On-time performance by airport
Deep Dive (EDA): Exploratory data analysis with detailed insights
ML Prediction: Interactive flight delay probability prediction
About: Complete data science lifecycle documentation

Installation

Clone the repository or download the files
Install dependencies:

pip install -r requirements.txt

Data Setup

Data Source

The dataset contains 5.8 million US domestic flight records from 2015, sourced from the US Department of Transportation via Kaggle.

Data Format

The dashboard uses optimized Parquet format for fast loading:

Original: ~500MB CSV file
Optimized: 74MB Parquet file (85% reduction)

If you have the original flights.csv, it will be loaded automatically. For better performance, convert it to Parquet format using the included script.

Machine Learning Model

Option 1: Pre-train the Model (Recommended)

For faster dashboard loading, pre-train the model once:

python train_model.py

This will:

Train a Random Forest classifier on the flight data
Evaluate the model and display metrics
Save the trained model to flight_delay_model.pkl

The dashboard will automatically load this pre-trained model for instant predictions.

Option 2: Train On-the-Fly

If you don't pre-train the model, the dashboard will train it automatically on first load of the ML Prediction tab. This takes about 1-2 minutes but is cached for the session.

Running the Dashboard

streamlit run app.py

The dashboard will open in your browser at http://localhost:8501

Project Structure

├── app.py                      # Main Streamlit dashboard
├── utils.py                    # Data loading and ML utilities
├── train_model.py              # Standalone model training script
├── flights.parquet             # Optimized flight data
├── airlines.csv                # Airline reference data
├── airports.csv                # Airport reference data
├── requirements.txt            # Python dependencies
└── README.md                   # This file

Data Science Lifecycle

This project demonstrates a complete data science workflow:

Problem Definition: Predict flight delay probability
Data Collection: 5.8M flight records from US DOT via Kaggle
Data Preprocessing: CSV to Parquet optimization, data cleaning
EDA: Interactive visualizations across multiple dimensions
Modeling: Random Forest classifier with 8 features
Evaluation: Accuracy, Precision, Recall, F1-Score
Deployment: Interactive Streamlit dashboard

Model Details

Algorithm: Random Forest Classifier
Features: Airline, Origin/Destination Airports, Month, Day of Week, Day, Scheduled Departure, Distance
Target: Binary classification (Delayed >15 min vs On-Time)
Training: 80/20 train-test split with stratification
Performance: See ML Prediction tab for live metrics

Technical Optimizations

Parquet Format: 85% size reduction with gzip compression
Categorical Dtypes: Efficient memory usage for string columns
Model Caching: Train once, use across sessions
Pre-computed Metrics: Dashboard metrics calculated once and reused
Parallel Processing: Multi-core Random Forest training

Requirements

Python 3.8+
streamlit
plotly
pandas
scikit-learn

License

This project is for educational purposes demonstrating data science best practices.

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
.streamlit		.streamlit
__pycache__		__pycache__
data		data
debug		debug
models		models
scripts		scripts
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Airline Performance Dashboard

Features

Installation

Data Setup

Data Source

Data Format

Machine Learning Model

Option 1: Pre-train the Model (Recommended)

Option 2: Train On-the-Fly

Running the Dashboard

Project Structure

Data Science Lifecycle

Model Details

Technical Optimizations

Requirements

License

About

Uh oh!

Languages

WetCatto/flight-delay-predictor

Folders and files

Latest commit

History

Repository files navigation

Airline Performance Dashboard

Features

Installation

Data Setup

Data Source

Data Format

Machine Learning Model

Option 1: Pre-train the Model (Recommended)

Option 2: Train On-the-Fly

Running the Dashboard

Project Structure

Data Science Lifecycle

Model Details

Technical Optimizations

Requirements

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Languages