This project focuses on implementing and comparing various linear regression techniques using R. Developed as part of a graduate-level assignment for the Data Science & Data Analysis course at the University of Salerno, the objective is to identify the best predictive model for a given dataset by minimizing test set error.
This demonstrates proficiency in R programming, statistical analysis, and reproducible research workflows.
๐ This project is part of a series of regression analysis case studies. For another similar project, see regression-model-comparison-R.
Given a dataset with 60 observations and 30 predictors (X1โX30), the goal is to:
- Assess correlation and multicollinearity among predictors.
- Implement OLS manually, without using
lm()orsummary(). - Compare manual results with built-in R functions.
- Apply model selection strategies (stepwise, Ridge, Lasso) to minimize test MSE.
- Identify statistically significant predictors.
All code comments and the report are written in Italian, as this project was originally developed in an academic setting in Italy. Nonetheless, the structure, organization, and methodology follow international best practices in data science and statistical modeling.
- ๐ Correlation analysis and multicollinearity check using
VIF - ๐งฎ Manual OLS implementation (matrix algebra, p-values without
summary()) - โ๏ธ Model selection techniques:
- Forward selection (Cp, BIC, CV)
- Backward elimination (Cp, BIC, CV)
- Hybrid selection
- Ridge regression
- Lasso regression
- ๐งช 5-fold Cross-validation
- ๐ MSE computation on test set to compare all models
- Data split: 80% training / 20% testing
- Models fitted on training set, evaluated on test set
- All MSE values collected and best model selected
- Final modelโs predictors and coefficients reported
๐ฆ linear-regression-from-scratch-R/
โโโ ๐ analysis/
โ โโโ regression_analysis.R # Complete R script
โ
โโโ ๐ data/
โ โโโ Regression2024.csv # Dataset used for analysis
โ
โโโ ๐ graphs/
โ โโโ bwd_adjr2.png # Adjusted Rยฒ for backward stepwise
โ โโโ bwd_bic.png # BIC plot for backward stepwise
โ โโโ bwd_cp.png # Cp plot for backward stepwise
โ โโโ bwd_cv_error.png # CV error for backward stepwise
โ โโโ corrplot_matrice.png # Correlation matrix visualization
โ โโโ cv_forward.png # Cross-validation error (forward stepwise)
โ โโโ hyb_cv_error.png # CV error for hybrid selection
โ
โโโ ๐ instructions/
โ โโโ project_brief_ENGLISH.pdf # Assignment prompt (English)
โ โโโ project_brief_ITALIAN.pdf # Assignment prompt (Italian)
โ
โโโ ๐ report/
โ โโโ raw_output_analysis.txt # Raw analysis output (log, console results)
โ
โโโ LICENSE # License file
โโโ README.md # Project overview (this file)
- Language: R
- Libraries:
glmnet,car,leaps,boot,MASS,corrplot,reshape2
This project was created as part of the Data Science & Data Analysis course (2024) for the Masterโs Degree in Computer Engineering at the University of Salerno.
- Demonstrates practical knowledge of regression theory and implementation
- Hands-on experience with statistical computing in R
- Capable of end-to-end modeling: from raw matrix algebra to model selection
- Emphasizes explainability and statistical rigor
โ๏ธ Got feedback or want to contribute? Feel free to open an Issue or submit a Pull Request!
regression, model-selection, linear-regression, ridge-regression, lasso, best-subset-selection, stepwise-selection, cross-validation, R, data-science, statistics, predictive-modeling, machine-learning, feature-selection, multicollinearity, MSE, OLS, glmnet, BIC, AIC
This project is licensed under the MIT License, a permissive open-source license that allows anyone to use, modify, and distribute the software freely, as long as credit is given and the original license is included.
In plain terms: use it, build on it, just donโt blame us if something breaks.
โญ Like what you see? Consider giving the project a star!