Step 1: Import Libraries

import sys
!{sys.executable} -m pip install --upgrade pip --user
!{sys.executable} -m pip install xlrd
!{sys.executable} -m pip install statsmodels
!{sys.executable} -m pip install requests
!{sys.executable} -m pip install seaborn
!{sys.executable} -m pip install scikit-learn
Requirement already satisfied: pip in ./.local/lib/python3.8/site-packages (20.3.3)
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: xlrd in /usr/lib/python3/dist-packages (1.1.0)
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: statsmodels in ./.local/lib/python3.8/site-packages (0.12.1)
Requirement already satisfied: numpy>=1.15 in /usr/lib/python3/dist-packages (from statsmodels) (1.17.4)
Requirement already satisfied: patsy>=0.5 in ./.local/lib/python3.8/site-packages (from statsmodels) (0.5.1)
Requirement already satisfied: pandas>=0.21 in /usr/lib/python3/dist-packages (from statsmodels) (0.25.3)
Requirement already satisfied: scipy>=1.1 in /usr/lib/python3/dist-packages (from statsmodels) (1.3.3)
Requirement already satisfied: six in ./.local/lib/python3.8/site-packages (from patsy>=0.5->statsmodels) (1.15.0)
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: requests in /usr/lib/python3/dist-packages (2.22.0)
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: seaborn in ./.local/lib/python3.8/site-packages (0.11.0)
Requirement already satisfied: scipy>=1.0 in /usr/lib/python3/dist-packages (from seaborn) (1.3.3)
Requirement already satisfied: pandas>=0.23 in /usr/lib/python3/dist-packages (from seaborn) (0.25.3)
Requirement already satisfied: matplotlib>=2.2 in /usr/lib/python3/dist-packages (from seaborn) (3.1.2)
Requirement already satisfied: numpy>=1.15 in /usr/lib/python3/dist-packages (from seaborn) (1.17.4)
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: scikit-learn in ./.local/lib/python3.8/site-packages (0.23.2)
Requirement already satisfied: scipy>=0.19.1 in /usr/lib/python3/dist-packages (from scikit-learn) (1.3.3)
Requirement already satisfied: joblib>=0.11 in ./.local/lib/python3.8/site-packages (from scikit-learn) (0.17.0)
Requirement already satisfied: numpy>=1.13.3 in /usr/lib/python3/dist-packages (from scikit-learn) (1.17.4)
Requirement already satisfied: threadpoolctl>=2.0.0 in ./.local/lib/python3.8/site-packages (from scikit-learn) (2.1.0)
import shutil
import pandas as pd
import numpy as np
import scipy as sp
import seaborn as sns
import matplotlib.pyplot as plt
import os
import requests
import sklearn as sc
from zipfile import ZipFile
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm

Downloading House Data

r = requests.get('', allow_redirects=True)
open('', 'wb').write(r.content);
with ZipFile('', 'r') as zipObj:
df = pd.read_csv('DATA/house_dataset.csv').iloc[:,2:]

Step 2: Exploratory Data Analysis (EDA)

Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
0 1461 20 RH 80.0 11622 Pave NaN Reg Lvl AllPub ... 0 NaN MnPrv NaN 0 6 2010 WD Normal NaN
1 1462 20 RL 81.0 14267 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN Gar2 12500 6 2010 WD Normal NaN
2 1463 60 RL 74.0 13830 Pave NaN IR1 Lvl AllPub ... 0 NaN MnPrv NaN 0 3 2010 WD Normal NaN
3 1464 60 RL 78.0 9978 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 6 2010 WD Normal NaN
4 1465 120 RL 43.0 5005 Pave NaN IR1 HLS AllPub ... 0 NaN NaN NaN 0 1 2010 WD Normal NaN

5 rows × 81 columns

count      1460.000000
mean     180921.195890
std       79442.502883
min       34900.000000
25%      129975.000000
50%      163000.000000
75%      214000.000000
max      755000.000000
Name: SalePrice, dtype: float64
numeric_features = df.select_dtypes(include=[np.number])
catagorical_features = df.select_dtypes(include="object")
print("No of Numerical Features ",len(numeric_features.columns))
print("No of catagorica Features ",len(catagorical_features.columns))
No of Numerical Features  38
No of catagorica Features  43
  warnings.warn(msg, FutureWarning)


Let’s have a more general view on the top 10 correlated features with the sale price:

k = 10 #number of variables for heatmap
corrmat = df.corr()
cols = corrmat.nlargest(k, 'SalePrice')['SalePrice'].index
f, ax = plt.subplots(figsize=(14, 10))
sns.heatmap(df[cols].corr(), vmax=.8, square=True);


cols = ['SalePrice', 'OverallQual', 'GrLivArea', 'GarageCars', 'TotalBsmtSF', 'FullBath','GarageArea','1stFlrSF']


Do we have missing data?

total = df.isnull().sum().sort_values(ascending=False)
percent = (df.isnull().sum()/df.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
Total Percent
PoolQC 2909 0.996574
MiscFeature 2814 0.964029
Alley 2721 0.932169
Fence 2348 0.804385
SalePrice 1459 0.499829
FireplaceQu 1420 0.486468
LotFrontage 486 0.166495
GarageQual 159 0.054471
GarageYrBlt 159 0.054471
GarageFinish 159 0.054471
GarageCond 159 0.054471
GarageType 157 0.053786
BsmtExposure 82 0.028092
BsmtCond 82 0.028092
BsmtQual 81 0.027749
BsmtFinType2 80 0.027407
BsmtFinType1 79 0.027064
MasVnrType 24 0.008222
MasVnrArea 23 0.007879
MSZoning 4 0.001370
clean_data.quantile([0.0, 0.01, 0.05, 0.10, 0.25, 0.50, 0.75, 0.90, 0.95, 0.99, 1.0])
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } </style>
SalePrice OverallQual GrLivArea GarageCars TotalBsmtSF FullBath GarageArea 1stFlrSF
0.00 34900.00 1.0 334.00 0.0 0.00 0.0 0.00 334.00
0.01 61815.97 3.0 675.42 0.0 0.00 1.0 0.00 520.00
0.05 88000.00 4.0 861.00 0.0 455.25 1.0 0.00 665.90
0.10 106475.00 5.0 923.80 1.0 600.00 1.0 240.00 744.80
0.25 129975.00 5.0 1126.00 1.0 793.00 1.0 320.00 876.00
0.50 163000.00 6.0 1444.00 2.0 989.50 2.0 480.00 1082.00
0.75 214000.00 7.0 1743.50 2.0 1302.00 2.0 576.00 1387.50
0.90 278000.00 8.0 2153.20 3.0 1614.00 2.0 758.00 1675.00
0.95 326100.00 8.0 2464.20 3.0 1776.15 2.0 856.15 1830.10
0.99 442567.01 10.0 2935.72 3.0 2198.30 3.0 1019.49 2288.02
1.00 755000.00 10.0 5642.00 5.0 6110.00 4.0 1488.00 5095.00
low = .01
high = .99

quant_df = clean_data.quantile([low, high])
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } </style>
SalePrice OverallQual GrLivArea GarageCars TotalBsmtSF FullBath GarageArea 1stFlrSF
0.01 61815.97 3.0 675.42 0.0 0.0 1.0 0.00 520.00
0.99 442567.01 10.0 2935.72 3.0 2198.3 3.0 1019.49 2288.02
clean_data = clean_data.loc[(clean_data["GrLivArea"] < quant_df.loc[high, "GrLivArea"])&
                            (clean_data["TotalBsmtSF"] > quant_df.loc[low, "TotalBsmtSF"]) &
                            (clean_data["TotalBsmtSF"] < quant_df.loc[high, "TotalBsmtSF"]) &
                            (clean_data["GarageArea"] > quant_df.loc[low, "GarageArea"]) &
                            (clean_data["GarageArea"] < quant_df.loc[high, "GarageArea"])&
                            (clean_data["1stFlrSF"] < quant_df.loc[high, "1stFlrSF"])&
                            (clean_data["SalePrice"] > quant_df.loc[low, "SalePrice"])&
                            (clean_data["SalePrice"] < quant_df.loc[high, "SalePrice"])]
X = clean_data.loc[:, ["OverallQual","GrLivArea","GarageCars","TotalBsmtSF","FullBath","GarageArea","1stFlrSF"]]
y = np.log(clean_data["SalePrice"])
sns.heatmap(clean_data.corr(), linewidth=1, annot=True)


Preparing the data

Feature scaling

We will do a little preprocessing to our data using the following formula (standardization):

$$x'= \frac{x - \mu}{\sigma}$$

where $\mu$ is the population mean and $\sigma$ is the standard deviation.

X = (X - X.mean()) / X.std()
X = np.c_[np.ones(X.shape[0]), X] 
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 50)

Linear Regression

Simple Linear Regression

Simple linear regression uses a traditional slope-intercept form, where $a$ and $b$ are the coefficients that we try to “learn” and produce the most accurate predictions. $X$ represents our input data and $Y$ is our prediction.

$$Y = bX + a$$

Multivariable Regression

A more complex, multi-variable linear equation might look like this, where w represents the coefficients, or weights, our model will try to learn.

$$ Y(x_1,x_2,x_3) = w_1 x_1 + w_2 x_2 + w_3 x_3 + w_0$$

The variables $x_1, x_2, x_3$ represent the attributes, or distinct pieces of information, we have about each observation.

lm = LinearRegression(), y_train)
print("Coefficients: ", lm.coef_)

result = lm.predict(x_test)
Coefficients:  [ 0.          0.13065804  0.11642741  0.03006242  0.07691202  0.00802758
  0.03020288 -0.00485464]
plt.scatter(y_test, result)
plt.xlabel("Actual values")
plt.ylabel("Predicted values")
Text(0, 0.5, 'Predicted values')


X2 = sm.add_constant(X)
est = sm.OLS(y, X2)
est2 =
                            OLS Regression Results                            
Dep. Variable:              SalePrice   R-squared:                       0.797
Model:                            OLS   Adj. R-squared:                  0.796
Method:                 Least Squares   F-statistic:                     725.0
Date:                Wed, 16 Dec 2020   Prob (F-statistic):               0.00
Time:                        01:25:48   Log-Likelihood:                 590.35
No. Observations:                1298   AIC:                            -1165.
Df Residuals:                    1290   BIC:                            -1123.
Df Model:                           7                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
const         12.0472      0.004   2818.002      0.000      12.039      12.056
x1             0.1430      0.006     23.817      0.000       0.131       0.155
x2             0.1062      0.006     17.373      0.000       0.094       0.118
x3             0.0293      0.008      3.504      0.000       0.013       0.046
x4             0.0752      0.010      7.718      0.000       0.056       0.094
x5             0.0077      0.006      1.320      0.187      -0.004       0.019
x6             0.0270      0.008      3.381      0.001       0.011       0.043
x7            -0.0005      0.010     -0.054      0.957      -0.020       0.019
Omnibus:                      205.633   Durbin-Watson:                   2.029
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              515.814
Skew:                          -0.857   Prob(JB):                    9.82e-113
Kurtosis:                       5.570   Cond. No.                         6.30


