Skip to content

๐Ÿ“ˆ Hands-on regression analysis project in R using a dataset with 30 predictors. Includes manual OLS implementation without lm(), p-value computation, and comparison with built-in functions. Applies stepwise selection (AIC/BIC), Ridge, and Lasso to minimize test error and identify key predictors.

License

Notifications You must be signed in to change notification settings

francescopiocirillo/linear-regression-from-scratch-R

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

8 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿ“ˆ Linear Regression Techniques in R

Made with R License: MIT

This project focuses on implementing and comparing various linear regression techniques using R. Developed as part of a graduate-level assignment for the Data Science & Data Analysis course at the University of Salerno, the objective is to identify the best predictive model for a given dataset by minimizing test set error.

This demonstrates proficiency in R programming, statistical analysis, and reproducible research workflows.

๐Ÿ“‚ This project is part of a series of regression analysis case studies. For another similar project, see regression-model-comparison-R.

๐Ÿ“Œ Objective

Given a dataset with 60 observations and 30 predictors (X1โ€“X30), the goal is to:

  1. Assess correlation and multicollinearity among predictors.
  2. Implement OLS manually, without using lm() or summary().
  3. Compare manual results with built-in R functions.
  4. Apply model selection strategies (stepwise, Ridge, Lasso) to minimize test MSE.
  5. Identify statistically significant predictors.

๐ŸŒ Language Note

All code comments and the report are written in Italian, as this project was originally developed in an academic setting in Italy. Nonetheless, the structure, organization, and methodology follow international best practices in data science and statistical modeling.

๐Ÿ› ๏ธ Techniques Implemented

  • ๐Ÿ“‰ Correlation analysis and multicollinearity check using VIF
  • ๐Ÿงฎ Manual OLS implementation (matrix algebra, p-values without summary())
  • โš–๏ธ Model selection techniques:
    • Forward selection (Cp, BIC, CV)
    • Backward elimination (Cp, BIC, CV)
    • Hybrid selection
    • Ridge regression
    • Lasso regression
  • ๐Ÿงช 5-fold Cross-validation
  • ๐Ÿ“Š MSE computation on test set to compare all models

๐Ÿ“ˆ Methodology

  • Data split: 80% training / 20% testing
  • Models fitted on training set, evaluated on test set
  • All MSE values collected and best model selected
  • Final modelโ€™s predictors and coefficients reported

๐Ÿ“‚ Repository Structure

๐Ÿ“ฆ linear-regression-from-scratch-R/
โ”œโ”€โ”€ ๐Ÿ“ analysis/
โ”‚   โ””โ”€โ”€ regression_analysis.R              # Complete R script
โ”‚
โ”œโ”€โ”€ ๐Ÿ“ data/
โ”‚   โ””โ”€โ”€ Regression2024.csv                 # Dataset used for analysis
โ”‚
โ”œโ”€โ”€ ๐Ÿ“ graphs/
โ”‚   โ”œโ”€โ”€ bwd_adjr2.png                      # Adjusted Rยฒ for backward stepwise
โ”‚   โ”œโ”€โ”€ bwd_bic.png                        # BIC plot for backward stepwise
โ”‚   โ”œโ”€โ”€ bwd_cp.png                         # Cp plot for backward stepwise
โ”‚   โ”œโ”€โ”€ bwd_cv_error.png                   # CV error for backward stepwise
โ”‚   โ”œโ”€โ”€ corrplot_matrice.png              # Correlation matrix visualization
โ”‚   โ”œโ”€โ”€ cv_forward.png                     # Cross-validation error (forward stepwise)
โ”‚   โ””โ”€โ”€ hyb_cv_error.png                   # CV error for hybrid selection
โ”‚
โ”œโ”€โ”€ ๐Ÿ“ instructions/
โ”‚   โ”œโ”€โ”€ project_brief_ENGLISH.pdf          # Assignment prompt (English)
โ”‚   โ””โ”€โ”€ project_brief_ITALIAN.pdf          # Assignment prompt (Italian)
โ”‚
โ”œโ”€โ”€ ๐Ÿ“ report/
โ”‚   โ””โ”€โ”€ raw_output_analysis.txt            # Raw analysis output (log, console results)
โ”‚
โ”œโ”€โ”€ LICENSE                                # License file
โ””โ”€โ”€ README.md                              # Project overview (this file)

๐Ÿ“Š Technologies

  • Language: R
  • Libraries: glmnet, car, leaps, boot, MASS, corrplot, reshape2

๐ŸŽ“ About the Project

This project was created as part of the Data Science & Data Analysis course (2024) for the Masterโ€™s Degree in Computer Engineering at the University of Salerno.

๐Ÿ‘ฅ Team โ€“ University of Salerno

๐Ÿ” Key Highlights

  • Demonstrates practical knowledge of regression theory and implementation
  • Hands-on experience with statistical computing in R
  • Capable of end-to-end modeling: from raw matrix algebra to model selection
  • Emphasizes explainability and statistical rigor

๐Ÿ“ฌ Contacts

โœ‰๏ธ Got feedback or want to contribute? Feel free to open an Issue or submit a Pull Request!

๐Ÿ“ˆ SEO Tags

regression, model-selection, linear-regression, ridge-regression, lasso, best-subset-selection, stepwise-selection, cross-validation, R, data-science, statistics, predictive-modeling, machine-learning, feature-selection, multicollinearity, MSE, OLS, glmnet, BIC, AIC

๐Ÿ“„ License

This project is licensed under the MIT License, a permissive open-source license that allows anyone to use, modify, and distribute the software freely, as long as credit is given and the original license is included.

In plain terms: use it, build on it, just donโ€™t blame us if something breaks.

โญ Like what you see? Consider giving the project a star!

About

๐Ÿ“ˆ Hands-on regression analysis project in R using a dataset with 30 predictors. Includes manual OLS implementation without lm(), p-value computation, and comparison with built-in functions. Applies stepwise selection (AIC/BIC), Ridge, and Lasso to minimize test error and identify key predictors.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages