An interactive Streamlit dashboard for analyzing US domestic flight delays with machine learning predictions.
- Delay Analysis: Comprehensive breakdown of delay types and causes
- Time Analysis: Temporal patterns in flight volumes and delays
- Airline Analysis: Performance metrics by carrier
- Airport Analysis: On-time performance by airport
- Deep Dive (EDA): Exploratory data analysis with detailed insights
- ML Prediction: Interactive flight delay probability prediction
- About: Complete data science lifecycle documentation
- Clone the repository or download the files
- Install dependencies:
pip install -r requirements.txtThe dataset contains 5.8 million US domestic flight records from 2015, sourced from the US Department of Transportation via Kaggle.
The dashboard uses optimized Parquet format for fast loading:
- Original: ~500MB CSV file
- Optimized: 74MB Parquet file (85% reduction)
If you have the original flights.csv, it will be loaded automatically. For better performance, convert it to Parquet format using the included script.
For faster dashboard loading, pre-train the model once:
python train_model.pyThis will:
- Train a Random Forest classifier on the flight data
- Evaluate the model and display metrics
- Save the trained model to
flight_delay_model.pkl
The dashboard will automatically load this pre-trained model for instant predictions.
If you don't pre-train the model, the dashboard will train it automatically on first load of the ML Prediction tab. This takes about 1-2 minutes but is cached for the session.
streamlit run app.pyThe dashboard will open in your browser at http://localhost:8501
├── app.py # Main Streamlit dashboard
├── utils.py # Data loading and ML utilities
├── train_model.py # Standalone model training script
├── flights.parquet # Optimized flight data
├── airlines.csv # Airline reference data
├── airports.csv # Airport reference data
├── requirements.txt # Python dependencies
└── README.md # This file
This project demonstrates a complete data science workflow:
- Problem Definition: Predict flight delay probability
- Data Collection: 5.8M flight records from US DOT via Kaggle
- Data Preprocessing: CSV to Parquet optimization, data cleaning
- EDA: Interactive visualizations across multiple dimensions
- Modeling: Random Forest classifier with 8 features
- Evaluation: Accuracy, Precision, Recall, F1-Score
- Deployment: Interactive Streamlit dashboard
- Algorithm: Random Forest Classifier
- Features: Airline, Origin/Destination Airports, Month, Day of Week, Day, Scheduled Departure, Distance
- Target: Binary classification (Delayed >15 min vs On-Time)
- Training: 80/20 train-test split with stratification
- Performance: See ML Prediction tab for live metrics
- Parquet Format: 85% size reduction with gzip compression
- Categorical Dtypes: Efficient memory usage for string columns
- Model Caching: Train once, use across sessions
- Pre-computed Metrics: Dashboard metrics calculated once and reused
- Parallel Processing: Multi-core Random Forest training
- Python 3.8+
- streamlit
- plotly
- pandas
- scikit-learn
This project is for educational purposes demonstrating data science best practices.