The goal is to build a robust regression model capable of predicting the continuous variable SalePrice based on 79 explanatory variables describing almost every aspect of residential homes.
- Target Variable Analysis: Identified positive skewness in
SalePriceand applied Log Transformation (np.log1p) to normalize the distribution for better model performance. - Advanced Data Preprocessing:
- Handled missing values using context-aware strategies (e.g., filling
LotFrontagewith the median of the neighborhood). - Conducted feature encoding using Label Encoding for ordinal data and One-Hot Encoding for nominal data.
- Handled missing values using context-aware strategies (e.g., filling
- Feature Engineering: Scaled numerical features using
StandardScalerto ensure uniform influence on the model. - Model Comparison: * Built a baseline Linear Regression model.
- Implemented an advanced XGBoost Regressor to handle non-linear relationships and improve accuracy.
- Evaluation Metrics: Evaluated models using RMSE (Root Mean Squared Error), MAE (Mean Absolute Error), and R-squared.
- Skewness Correction: Reduced
SalePriceskewness from 1.88 to a more normal distribution. - Model Performance: The XGBoost model significantly outperformed the baseline, capturing complex feature interactions that linear models missed.
Project-House_Price_Prediction/
βββ Project_House_Price_Prediction.ipynb # Full ML Pipeline
βββ train.csv # Training Data
βββ test.csv # Testing Data
βββ README.md # Project documentation