-
Notifications
You must be signed in to change notification settings - Fork 0
Jupyter notebook for end-to-end MPG prediction: EDA, data cleaning (missing/categorical), feature scaling, train/test split, and scikit-learn regression with metrics (R², MAE, RMSE).
License
job28/Autompg-predictor-regression
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
AUTO-MPG — LINEAR REGRESSION (JUPYTER NOTEBOOK) ================================================ Project Summary --------------- This project builds a simple regression model to predict a car’s fuel efficiency (MPG) using the classic Auto MPG dataset. The workflow is implemented in a single notebook: Regression_AutoMpg.ipynb It covers: 1) Data loading and basic cleaning 2) Exploratory analysis (correlations and scatter-matrix) 3) Feature engineering (adding squared terms) 4) Train/test split 5) Linear Regression model training and evaluation (R² and RMSE) 6) Plot export for quick visualization Repository Structure (expected) ------------------------------- . ├─ Regression_AutoMpg.ipynb ← Main analysis notebook ├─ data/ │ └─ auto-mpg.csv ← Dataset file (you add this) └─ plots/ └─ Regression_autompg_Scatter.png ← Generated by the notebook Dataset ------- Name: Auto MPG Source: UCI Machine Learning Repository (originally from StatLib) Target column: mpg The notebook expects a CSV at: data/auto-mpg.csv with these columns in this exact order (no header row in source is fine as the notebook assigns names): mpg, cylinders, displacement, horsepower, weight, acceleration, model_year, origin, car_name Note: The original UCI data may contain “?” for horsepower. Ensure your CSV has numeric values (convert or drop rows with “?”) before running. The notebook drops the text columns `origin` and `car_name` and engineers polynomial features for a few numeric fields. Environment & Requirements -------------------------- Python 3.9+ recommended. Core libraries used in the notebook: - pandas - numpy - scikit-learn - matplotlib Quick Setup (virtual environment) --------------------------------- Linux / macOS 1) python -m venv .venv 2) source .venv/bin/activate 3) pip install pandas numpy scikit-learn matplotlib Windows (PowerShell) 1) python -m venv .venv 2) .venv\Scripts\Activate.ps1 3) pip install pandas numpy scikit-learn matplotlib Preparing Folders & Data ------------------------ 1) Create folders if missing: mkdir -p data plots 2) Place your dataset at: data/auto-mpg.csv Make sure the columns match the list given above and non-numeric entries (e.g., “?”) are handled. How to Run ---------- Option A: Jupyter 1) jupyter notebook 2) Open `Regression_AutoMpg.ipynb` 3) Run all cells (Kernel → Restart & Run All) Option B: VS Code / other IDE - Open the notebook and run all cells from the UI. What the Notebook Does ---------------------- 1) Reads the dataset and assigns column names 2) Drops non-numeric text columns: `origin`, `car_name` 3) Exploratory analysis: - Correlation matrix - Scatter-matrix (saved to `plots/Regression_autompg_Scatter.png`) 4) Feature engineering: - Adds squared terms for selected numeric features (e.g., horsepower, displacement, weight) 5) Train/test split (scikit-learn `train_test_split`, random_state=1) 6) Fits `LinearRegression` 7) Reports: - R² on the (full) dataset - RMSE on the test set Outputs You Should See ---------------------- - A scatter-matrix plot saved to: plots/Regression_autompg_Scatter.png - Printed metrics in the notebook output, including: - R squared: <value> - RMSE: <value> (Exact values depend on your cleaned dataset.) Reproducing Results ------------------- - Ensure your CSV is clean (no “?” / non-numeric in numeric columns). - Run all cells in order. - The plot and metrics will be generated automatically. Common Pitfalls & Tips ---------------------- - If you get parsing or dtype errors, check for non-numeric values in `horsepower` and other numeric columns. Convert them with pandas (e.g., `pd.to_numeric(..., errors="coerce")`) and drop rows with NaNs if necessary. - If `plots/` does not exist, create it before running, or the save call will fail. - Results will change if you alter the random seed, features, or data cleaning choices. Extending the Project --------------------- - Try adding more polynomial/interaction terms and compare RMSE. - Standardize/normalize features and see if it helps (especially for regularized models). - Evaluate alternative models (Ridge, Lasso, RandomForestRegressor, Gradient Boosting). - Cross-validate with KFold and compare performance. Credits & Attribution --------------------- - Dataset: Auto MPG, UCI Machine Learning Repository. - Libraries: pandas, numpy, scikit-learn, matplotlib. License ------- This project is provided for educational purposes. If you plan to distribute, consider adding a LICENSE file (e.g., MIT) to clarify usage terms. Contact ------- For questions or issues, please open an issue in the repository or contact the maintainer.
About
Jupyter notebook for end-to-end MPG prediction: EDA, data cleaning (missing/categorical), feature scaling, train/test split, and scikit-learn regression with metrics (R², MAE, RMSE).
Topics
Resources
License
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published