This repository contains the final project for IST 707 Applied Machine Learning course, Syracuse University. Our project aims to forecast pothole development in Syracuse, NY, based on multiple factors such as weather, pavement ratings, and reported maintenance requests.
Marina Mitiaeva, mmitiaev@syr.edu
Cathryn Lee Shelton, clshelto@syr.edu
Abhi Chakraborty, abchakra@syr.edu
Edward Joseph Cogan II, ejcogani@syr.edu
Pothole development occurs due to:
- Poor paving materials.
- Extreme temperature changes.
- Traffic load over time.
This leads to road deterioration, impacting road safety and maintenance costs.
To build a predictive model that:
- Forecasts the count of potholes.
- Helps city maintenance departments to efficiently plan maintenance operations.
- Reduces costs and improves road safety.
We utilized several public datasets relevant to road conditions, weather, and maintenance requests:
Data provided by the Syracuse Metropolitan Transportation Council across multiple years:
Weather data from NASA, capturing climate conditions such as temperature fluctuations and precipitation, which impact road quality.
Collected by Syracuse citizens via SeeClickFix, this dataset tracks public maintenance requests, including pothole reports.
Detailed street information from Syracuse’s open data portal. This dataset provides road classifications and usage patterns that help predict pothole-prone areas.
Data collected from 2021-2023.
-
EDA (Exploratory Data Analysis):
- Correlation matrices and visual insights.
-
Model:
- Linear Regression (Baseline)
- Loss function: Mean Squared Error (MSE)
-
Data Preprocessing:
- Numerical Features: Imputed with the mean and scaled.
- Categorical Features: Imputed with the most frequent value, one-hot encoded.
-
Data Split:
- Train: 60%
- Validation: 20%
- Test: 20%
/data/
: Contains raw and processed datasets./notebooks/
: Jupyter notebook with code and experiments./models/
: Saved model weights and checkpoints./predictions/
: Generated predictions and model outputs for analysis and evaluation./presentation/
: Final project presentation in PPT format.
Our analysis compared the Mean Squared Error (MSE) across multiple models, incorporating various feature engineering and transformation techniques. The key takeaways from the results are:
- Baseline models (linear regression on all features) achieved an MSE of 0.1451, while feature engineering improved performance slightly (0.1299).
- Polynomial transformations combined with different regression techniques (ridge, lasso, elastic net, random forest, gradient boosting, stacking, and voting) led to varying performance improvements, with the best performing models around 0.1189 - 0.1251 MSE.
- Stacking showed higher MSE (0.1894), indicating potential overfitting or poor generalization.
- The best model on validation data achieved an MSE of 0.1189, and the final model tested on unseen data reached an MSE of 0.0339, demonstrating strong predictive performance.
To set up the environment, use:
pip install -r requirements.txt