This project investigates the relationship between smartphone features and their prices using various regression techniques. It aims to answer the question: Which features influence mobile phone pricing the most, and how accurately can we predict price using those features?
View the notebook for an in-depth review of techniques.
Install the required packages:
pip install -r requirements.txt
Or install individually:
pip install numpy pandas matplotlib seaborn scikit-learn statsmodels kaggle jupyter ipykernel
Explore and compare multiple regression models to determine the most effective one for predicting mobile phone prices.
- Load and describe the dataset
- Understand structure, datatypes, and missing values
- Visualize the distribution of the target variable (Price)
- Analyze individual features
- Explore feature-target relationships
- Check for multicollinearity using VIF and correlation heatmaps
- Handle missing or inconsistent values
- Normalize / transform variables as needed
- Split into training and test sets
Models Explored:
- 🔹 Ordinary Least Squares (OLS) Regression
- 🔹 Random Forest Regressor
- 🔹 Ridge Regression (for multicollinearity)
Evaluation Metrics:
- 📉 Mean Absolute Error (MAE)
- 📉 Mean Squared Error (MSE)
- 📈 R-squared (R²)
- 📊 Variance Inflation Factor (VIF) for multicollinearity detection
- 📉 Actual vs Predicted scatter plots
- Dataset: 807 mobile phones with 8 features (Ratings, RAM, ROM, Mobile_Size, Primary_Cam, Selfi_Cam, Battery_Power, Price)
- Data Preprocessing: Applied log transformation to stabilize price variance and handle right-skewed distribution
- Multicollinearity: High VIF values detected in Ratings (40.5), Battery_Power (18.4), and Primary_Cam (18.0)
- Model Performance:
- OLS Regression: R² = 0.67
- Random Forest: R² = 0.95 (best performance)
- Key Predictors: Ratings (0.69), ROM (0.61), and Battery_Power (0.55) show strongest correlation with price
- Surprising Findings: Primary_Cam shows negative correlation (-0.27) with price
- Feature selection and dimensionality reduction (PCA)
- Deploy model via a Flask app or Streamlit dashboard
- Integration with real-time pricing APIs for prediction
- NumPy - Numerical computations and array operations
- Pandas - Data manipulation and analysis
- Matplotlib - Data visualization and plotting
- Seaborn - Statistical data visualization
- Scikit-learn - Machine learning algorithms and metrics
- Statsmodels - Statistical modeling and analysis
- Variance Inflation Factor (VIF) - Multicollinearity detection
- Correlation Analysis - Feature relationship analysis
- Log Transformation - Data normalization
- Train-Test Split - Model validation
- Ordinary Least Squares (OLS) - Linear regression with statistical inference
- Random Forest Regressor - Ensemble learning method
- Ridge Regression - Regularized linear regression
- StandardScaler - Feature scaling and normalization
- Dataset: Mobile Phone Price Prediction by Ganjerlawrence on Kaggle
- Visualization powered by Matplotlib and Seaborn
- Models built with Scikit-learn and Statsmodels
This project is open-source under the MIT License.