Skip to content

Here is a set of regression algorithms. Logistics, Ridge and Lasso are all implemented in Python.

AnamorphicOptimus/Regression

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 

Repository files navigation

Regression

Here is a set of regression algorithms. Logistics, Ridge and Lasso are all implemented in Python.

Introduction

"Regression" is an experimental project. Through self-study on the principle of Regression, I finally completed the realization of the following two models, which are implemented in Python. Some suggestions and questions from the experiment were also put forward later.

Experiment Conclusions:

    1. Logistic regression: Logistic regression experiment results show that the overall accuracy rate is similar to the residual sum of squares in linear regression. null_deviance and residual_deviance are similar to the residual sum of squares in linear regression, where null_deviance (empty model) is equal to esidual_deviance. The square here indicates the change of the dependent variable caused by the change of the independent variable explained by the model. Experiments have shown that the selection of threshold is very important, such as 0.4 and 0.5, but the accuracy results differ by 0.4. Finally, we know that its time complexity is o(n·p+p).
    1. Ridge regression: The adjustment parameter selected according to the minimum MSE rule is 0.01, and the MSE is 92017.9. The correlation coefficient between the predicted value and the actual value is higher, which is 0.74, but the R-square is lower than 0.51. At the same time, the results of the adjustment parameter of 0 (linear regression) are taken separately here, and it is found that the two results are similar, indicating that the effect of ridge regression is not obvious. This is because the starting point of ridge regression is that when there is more collinearity in the improved least squares method, the model is unstable (one of the reasons). With the correlation matrix of this data set, we can see that the improvement effect is not obvious. In line with expectations. At the same time, it can be seen from the output graph that almost all variables in this data set are compressed and approached as the lambda value increases. Only one variable first increases and then decreases as the lambda value increases. But we can see that even when the compression is close to 0, all variables are still saved. This is also the disadvantage of ridge regression. Finally, we know that the time complexity of ridge regression is the same as that of least squares.
    1. Lasso regression: First, according to the image of the lasso regression coefficient changing with the adjustment parameters, it can be reduced to 0 at the end. Compared with ridge regression, it plays a role of screening and subtracting variables. From the evaluation index, the mean square error of lasso is slightly higher than that of ridge regression, but the R squares are almost equal. Therefore, when there are many variables, lasso regression can fit a model with a similar effect to ridge regression with fewer variables, which reduces the complexity of the model and makes the model easier to interpret.

Model Improvements & Problems:

    1. Parameter selection: parameter selection is involved in these three models. Regarding this problem, from several aspects: 1) The degree of influence of the parameters on the model effect: This depends on the data set as well as the model itself. For example, the logistic regression model is very sensitive to the threshold, so the choice should be very cautious; Ridge regression (based on the data set in this article) is not sensitive to the adjustment parameters, so it can be searched in the parameter list with a larger interval; 2) The evaluation index of parameter selection: the commonly used index is the mean square error and the adjusted R-square AIC As well as BIC, especially for data with many variables, AIC and BIC should be paid special attention to in the lasso model, because these two indicators take into account the penalty of excessive variables; 3) The calculation method of the evaluation index: Here you can pay attention Methods of improving the accuracy of indicators such as cross-validation.
    1. Standardization of variables: Different models have different responses to variable scales. Linear regression is not affected by variable scales, but the estimated coefficients of ridge regression are affected by both the adjustment parameters and the variable scales. Therefore, variables are generally standardized before performing ridge regression. However, I still have doubts about the influence of independent variables and dependent variables that are standardized together or only a certain standardization on the model effect, and the influence of dependent variable standardization on evaluation indicators. I hope I can explore this later.
    1. Coefficient estimation method: In this experiment, gradient ascent method, coordinate descent method and normal equations are used to solve the best estimated coefficients of the three models. The general idea is to find a coefficient vector that minimizes the cost function. The disadvantage of the normal system of equations is that this method will fail when the specific matrix cannot be solved when it is not invertible. The normal equations are often replaced by gradient descent.

About

Here is a set of regression algorithms. Logistics, Ridge and Lasso are all implemented in Python.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages