Skip to content

In this repository, I have explored various regression algorithms.

Notifications You must be signed in to change notification settings

Shriyaak/RegressionModels

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SIMPLE LINEAR REGRESSION:

In this project, I begin by exploring Simple Linear Regression, which is a foundational regression technique used to model the relationship between a single independent variable (X1) and a dependent variable (y). The goal of this model is to find the best-fitting line that predicts y based on the value of X1.
y = β0 +β1 x1 +β2 x2 + ⋯ +βp xp + ϵ
where,
y = dependent variable( target/ output)
β0 = intercept (the value of y when all xi = 0 )
x1, x2,... xp = independent variables( features/predictors)
β1, β2.. βp = coefficients of the independent variables (weights showing the impact of each predictor on y)
ϵ = Error term (captures the variablitiy in y not explained by the predictors)

|                
|            + 
|        +
|    +
|+_______________ 

Objective: The primary goal of Simple Linear Regression is to find the best values for the coefficients β0 and β1 that minimize the sum of squared errors (SSE), which represents the difference between the actual values yi and the predicted values y^i
​SSE = i=1∑n { (yi −y^ i ) }^ 2

where,
yi = actual value of the response variables ;
y^ i = Predicted values (y^ I = β0 + β1xi1 + … βpxip)
The best fit line is such that the sum of the squares of the residuals is minimised

Data Used:
YearsExperience (X1): The independent variable representing years of experience.
Salary (y): The dependent variable representing the salary of an individual.


MULTIPLE LINEAR REGRESSION:

|                
|            + 
|        +
|    +
|+_______________ 

Next I explored Multiple Linear Regression to predict company profit based on various independent variables, such as R&D Spend, Administration Spend, Marketing Spend, and State. The goal is to analyze which factors contribute most to profit and determine which companies are performing better, not just based on maximum profit, but also by considering how each company allocates resources.
The general equation for Multiple Linear Regression is:
y = β0 +β1 x1 +β2 x2 + ⋯ +βp xp + ϵ

To find the best model, we use Backward Elimination:
2. Backward Elimination <-- you need all-in for this
step 1: select a significance level to stay in the model, α = 0.05
step 2: fit the full model with all possible predictors
step 3: Consider the predictor with the highest p-value if P > α, go to step 4 or go to FINISH
step 4: Remove that predictor
step 5: Fit the model without this variable
step 6: Go back to step 3, repeat until P < α if so FINISH

Data Used:
The dataset includes:
Independent Variables (X1, X2, X3, X4):
Dependent Variable (y): Profit: The profit of the company, which we aim to predict based on the independent variables.


POLYNOMIAL LINEAR REGRESSION:

Polynomial Linear Regression is an extension of simple linear regression that allows us to model non-linear relationships between the dependent variable (y) and the independent variables (x). In simple linear regression, the relationship between the independent variable and the dependent variable is represented by a straight line. However, when the relationship between the variables is more complex and not linear, we use polynomial regression to fit a curved line to the data.
y = b0 + b1x1 + b2(x1)^2 + ...bn(x1)^n

Y 
|             +        
|            +          
|           +           
|         +            
|      +                
|__+_______________ x 


SUPPORT VECTOR REGRESSION: (linear-SVR)

In Support Vector Regression (SVR), much like Simple Linear Regression (SLR), we consider a regression line. However, instead of just a single line, SVR introduces a "tube" around the line, referred to as the ε-insensitive tube (where ε = epsilon). This tube has a width of epsilon, measured vertically along the axis (not perpendicularly). The purpose of this tube is to define a margin of error within which we disregard any deviations or errors from the regression line.

Key Concepts:
ε-insensitive Tube:

  • Think of this tube as a margin of tolerance for the model. Any data points that fall inside the tube are considered "close enough" to the regression line, and their errors are ignored.
  • Only points that fall outside the tube are treated as contributing to the error, and the magnitude of this error is calculated as the distance between the point and the nearest boundary of the tube (not the regression line itself).

*Slack Variables (ξᵢ and ξᵢ)**:
For points outside the tube, errors are represented by slack variables:
ξᵢ* for points below the tube.
ξᵢ for points above the tube.

Objective Function:
The goal of SVR is to minimize the following function:
1/2 * (||w||)^2 + C ∑(ξi + ξi*) -> min
(implicit relationship involves transforming the data and modeling complex, non-linear dependencies without an explicit formula for the output.)
where,
(||w||)^2 : Regulation term to control models complexity
C: A constant that balances the trade-off between fitting the data and maintaining a simpler model.
ξi, ξi*: Slack variables representing errors for points outside the ε-insensitive tube.


DECISION TREE REGRESSION:

 
 Regression Tree 
    X2
    |      +  +
    | +  +   +  +
    | + + + + 
    | +  +     +
    |_____________ X1  
   /
  /
 
Y 
  • we have got two independent variables X1 and X2
  • and what we are predicting is the third variable a dependent variable ( Y is the third dimension)

so once you run the decision tree algorithm in the regression sense of it, your scatter plot will be split into segments(each one of these splits is called a leaf)

  • the final leafs are called the terminal leafs:
 
 SCATTER PLOT OF THE DATA: 
     X2 split1  split3
    | +  |    +|   +
    |  + |  +  |+ +
    | +  |+   +|
    | + +| + + |
170 |    |-----|----- split2 
    | +  |+     +
    |____|________ X1 
         20    50
  • The algorithm decides where to split the data by measuring something called information entropy, a mathematical concept that checks whether a split actually improves how well the data is grouped. If a split helps make the groups more meaningful, it's considered useful.
  • The splitting process continues until the improvement (information gain) becomes too small. When the additional value from a split drops below a certain threshold, the algorithm stops making further splits
 
 Decision Tree: 

so our 1st split happend at 20 so: 

                 X1 < 20 
             yes  /    \  no 
                 /      \

split 2 happens at 170 & only happens for points greater than 20 so: 

                  X1 < 20 
             yes  /    \  no 
                 /      \ 
         300.05       X2 < 170
                      yes /  \ no 
                         /    \
                    65.7    x3 < 50 
                         yes /  \ no 
                            /    \
                        -64.01   0.7 } average y-values 
  • We check which group (leaf) the new point belongs to, based on its features.
  • Each group already has some known Y-values.
  • We find the average of those Y-values.
  • The new point is given this average as its predicted Y-value


RANDOM FOREST REGRESSION:

step1: Pick at Random K data points from the training set
step2: Build the decision tree associated to these k data points
step3: Choose the number Ntrees f trees you want to build and repeat steps 1 & 2
step4: For a new data point, make each one of your Ntree trees predict the value of Y. for the data point in question, and assign the new data point the average across all of the predicted Y values.