The aim of this project is to develop a multi linear regression model for predicting car prices with the available independent variables. This will allow the management to understand how the prices will vary with the independent variables and allow them to manipulate the design of the cars and business strategy to meet certain price levels.
The nummerical variables are plotted against each other using seaborn pair plot to observe relationships among all the features
- The distribution of prices is not normal and a log transformation could improve the model performance
- Compresion ration has outliers in the data
- There seem to be a lot of correlated variables which might lead to the problem of multicollinearity
- The price data which is the target variable is skewed and a log transformation is applied to transform the data into a normal distribution.
- Categorical variables are transformed to dummy variables and numerical variables.
- Data is normalized and split into train and test sets
The heat maps of the all the feature correlation is observed below
The following steps were followed to build the model
- The linear regression model summary on the train data is observed along with VIF (Variance inflation factor)
- As there are a lot of variable to eliminate using p-value and VIF, RFE (Recursive feature elimination) is utilized to select the most significant features with least colleration with other features.
- Both manual and automated feature elimination is employed to tune the model on the train set
As the test set R-squared is not similar to the train set r-squared the following changes can be implemented to improve the model performance
- Drop colinear variables
- Explore and drop ourliers in the data set
- Tranform features with skewed distribution
- Apply Principla component analysis to improve model performance for feature elimination