import sys
!{sys.executable} -m pip install --upgrade pip --user
!{sys.executable} -m pip install xlrd
!{sys.executable} -m pip install statsmodels
!{sys.executable} -m pip install requests
!{sys.executable} -m pip install seaborn
!{sys.executable} -m pip install scikit-learn
Requirement already satisfied: pip in ./.local/lib/python3.8/site-packages (20.3.3)
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: xlrd in /usr/lib/python3/dist-packages (1.1.0)
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: statsmodels in ./.local/lib/python3.8/site-packages (0.12.1)
Requirement already satisfied: numpy>=1.15 in /usr/lib/python3/dist-packages (from statsmodels) (1.17.4)
Requirement already satisfied: patsy>=0.5 in ./.local/lib/python3.8/site-packages (from statsmodels) (0.5.1)
Requirement already satisfied: pandas>=0.21 in /usr/lib/python3/dist-packages (from statsmodels) (0.25.3)
Requirement already satisfied: scipy>=1.1 in /usr/lib/python3/dist-packages (from statsmodels) (1.3.3)
Requirement already satisfied: six in ./.local/lib/python3.8/site-packages (from patsy>=0.5->statsmodels) (1.15.0)
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: requests in /usr/lib/python3/dist-packages (2.22.0)
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: seaborn in ./.local/lib/python3.8/site-packages (0.11.0)
Requirement already satisfied: scipy>=1.0 in /usr/lib/python3/dist-packages (from seaborn) (1.3.3)
Requirement already satisfied: pandas>=0.23 in /usr/lib/python3/dist-packages (from seaborn) (0.25.3)
Requirement already satisfied: matplotlib>=2.2 in /usr/lib/python3/dist-packages (from seaborn) (3.1.2)
Requirement already satisfied: numpy>=1.15 in /usr/lib/python3/dist-packages (from seaborn) (1.17.4)
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: scikit-learn in ./.local/lib/python3.8/site-packages (0.23.2)
Requirement already satisfied: scipy>=0.19.1 in /usr/lib/python3/dist-packages (from scikit-learn) (1.3.3)
Requirement already satisfied: joblib>=0.11 in ./.local/lib/python3.8/site-packages (from scikit-learn) (0.17.0)
Requirement already satisfied: numpy>=1.13.3 in /usr/lib/python3/dist-packages (from scikit-learn) (1.17.4)
Requirement already satisfied: threadpoolctl>=2.0.0 in ./.local/lib/python3.8/site-packages (from scikit-learn) (2.1.0)
import shutil
import pandas as pd
import numpy as np
import scipy as sp
import seaborn as sns
import matplotlib.pyplot as plt
import os
import requests
import sklearn as sc
from zipfile import ZipFile
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
r = requests.get('https://www.dropbox.com/s/1fcws6aaodry54n/partii.zip?dl=1', allow_redirects=True)
open('partii.zip', 'wb').write(r.content);
with ZipFile('partii.zip', 'r') as zipObj:
zipObj.extractall('DATA')
shutil.rmtree('DATA/__MACOSX')
df = pd.read_csv('DATA/house_dataset.csv').iloc[:,2:]
df.head()
Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | ... | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1461 | 20 | RH | 80.0 | 11622 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | MnPrv | NaN | 0 | 6 | 2010 | WD | Normal | NaN |
1 | 1462 | 20 | RL | 81.0 | 14267 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | Gar2 | 12500 | 6 | 2010 | WD | Normal | NaN |
2 | 1463 | 60 | RL | 74.0 | 13830 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | MnPrv | NaN | 0 | 3 | 2010 | WD | Normal | NaN |
3 | 1464 | 60 | RL | 78.0 | 9978 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 6 | 2010 | WD | Normal | NaN |
4 | 1465 | 120 | RL | 43.0 | 5005 | Pave | NaN | IR1 | HLS | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 1 | 2010 | WD | Normal | NaN |
5 rows × 81 columns
df['SalePrice'].describe()
count 1460.000000
mean 180921.195890
std 79442.502883
min 34900.000000
25% 129975.000000
50% 163000.000000
75% 214000.000000
max 755000.000000
Name: SalePrice, dtype: float64
numeric_features = df.select_dtypes(include=[np.number])
catagorical_features = df.select_dtypes(include="object")
print("No of Numerical Features ",len(numeric_features.columns))
print("No of catagorica Features ",len(catagorical_features.columns))
No of Numerical Features 38
No of catagorica Features 43
sns.distplot(df['SalePrice']);
warnings.warn(msg, FutureWarning)
Let’s have a more general view on the top 10 correlated features with the sale price:
k = 10 #number of variables for heatmap
corrmat = df.corr()
cols = corrmat.nlargest(k, 'SalePrice')['SalePrice'].index
f, ax = plt.subplots(figsize=(14, 10))
sns.heatmap(df[cols].corr(), vmax=.8, square=True);
cols = ['SalePrice', 'OverallQual', 'GrLivArea', 'GarageCars', 'TotalBsmtSF', 'FullBath','GarageArea','1stFlrSF']
sns.pairplot(df[cols])
plt.show()
total = df.isnull().sum().sort_values(ascending=False)
percent = (df.isnull().sum()/df.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(20)
Total | Percent | |
---|---|---|
PoolQC | 2909 | 0.996574 |
MiscFeature | 2814 | 0.964029 |
Alley | 2721 | 0.932169 |
Fence | 2348 | 0.804385 |
SalePrice | 1459 | 0.499829 |
FireplaceQu | 1420 | 0.486468 |
LotFrontage | 486 | 0.166495 |
GarageQual | 159 | 0.054471 |
GarageYrBlt | 159 | 0.054471 |
GarageFinish | 159 | 0.054471 |
GarageCond | 159 | 0.054471 |
GarageType | 157 | 0.053786 |
BsmtExposure | 82 | 0.028092 |
BsmtCond | 82 | 0.028092 |
BsmtQual | 81 | 0.027749 |
BsmtFinType2 | 80 | 0.027407 |
BsmtFinType1 | 79 | 0.027064 |
MasVnrType | 24 | 0.008222 |
MasVnrArea | 23 | 0.007879 |
MSZoning | 4 | 0.001370 |
clean_data=df[cols]
clean_data.quantile([0.0, 0.01, 0.05, 0.10, 0.25, 0.50, 0.75, 0.90, 0.95, 0.99, 1.0])
SalePrice | OverallQual | GrLivArea | GarageCars | TotalBsmtSF | FullBath | GarageArea | 1stFlrSF | |
---|---|---|---|---|---|---|---|---|
0.00 | 34900.00 | 1.0 | 334.00 | 0.0 | 0.00 | 0.0 | 0.00 | 334.00 |
0.01 | 61815.97 | 3.0 | 675.42 | 0.0 | 0.00 | 1.0 | 0.00 | 520.00 |
0.05 | 88000.00 | 4.0 | 861.00 | 0.0 | 455.25 | 1.0 | 0.00 | 665.90 |
0.10 | 106475.00 | 5.0 | 923.80 | 1.0 | 600.00 | 1.0 | 240.00 | 744.80 |
0.25 | 129975.00 | 5.0 | 1126.00 | 1.0 | 793.00 | 1.0 | 320.00 | 876.00 |
0.50 | 163000.00 | 6.0 | 1444.00 | 2.0 | 989.50 | 2.0 | 480.00 | 1082.00 |
0.75 | 214000.00 | 7.0 | 1743.50 | 2.0 | 1302.00 | 2.0 | 576.00 | 1387.50 |
0.90 | 278000.00 | 8.0 | 2153.20 | 3.0 | 1614.00 | 2.0 | 758.00 | 1675.00 |
0.95 | 326100.00 | 8.0 | 2464.20 | 3.0 | 1776.15 | 2.0 | 856.15 | 1830.10 |
0.99 | 442567.01 | 10.0 | 2935.72 | 3.0 | 2198.30 | 3.0 | 1019.49 | 2288.02 |
1.00 | 755000.00 | 10.0 | 5642.00 | 5.0 | 6110.00 | 4.0 | 1488.00 | 5095.00 |
low = .01
high = .99
quant_df = clean_data.quantile([low, high])
quant_df.head()
SalePrice | OverallQual | GrLivArea | GarageCars | TotalBsmtSF | FullBath | GarageArea | 1stFlrSF | |
---|---|---|---|---|---|---|---|---|
0.01 | 61815.97 | 3.0 | 675.42 | 0.0 | 0.0 | 1.0 | 0.00 | 520.00 |
0.99 | 442567.01 | 10.0 | 2935.72 | 3.0 | 2198.3 | 3.0 | 1019.49 | 2288.02 |
clean_data = clean_data.loc[(clean_data["GrLivArea"] < quant_df.loc[high, "GrLivArea"])&
(clean_data["TotalBsmtSF"] > quant_df.loc[low, "TotalBsmtSF"]) &
(clean_data["TotalBsmtSF"] < quant_df.loc[high, "TotalBsmtSF"]) &
(clean_data["GarageArea"] > quant_df.loc[low, "GarageArea"]) &
(clean_data["GarageArea"] < quant_df.loc[high, "GarageArea"])&
(clean_data["1stFlrSF"] < quant_df.loc[high, "1stFlrSF"])&
(clean_data["SalePrice"] > quant_df.loc[low, "SalePrice"])&
(clean_data["SalePrice"] < quant_df.loc[high, "SalePrice"])]
X = clean_data.loc[:, ["OverallQual","GrLivArea","GarageCars","TotalBsmtSF","FullBath","GarageArea","1stFlrSF"]]
y = np.log(clean_data["SalePrice"])
plt.figure(figsize=(30,10))
sns.heatmap(clean_data.corr(), linewidth=1, annot=True)
plt.show()
We will do a little preprocessing to our data using the following formula (standardization):
where
X = (X - X.mean()) / X.std()
X = np.c_[np.ones(X.shape[0]), X]
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 50)
Simple linear regression uses a traditional slope-intercept form, where
A more complex, multi-variable linear equation might look like this, where w represents the coefficients, or weights, our model will try to learn.
The variables
lm = LinearRegression()
lm.fit(x_train, y_train)
print("Coefficients: ", lm.coef_)
result = lm.predict(x_test)
Coefficients: [ 0. 0.13065804 0.11642741 0.03006242 0.07691202 0.00802758
0.03020288 -0.00485464]
plt.scatter(y_test, result)
plt.xlabel("Actual values")
plt.ylabel("Predicted values")
Text(0, 0.5, 'Predicted values')
X2 = sm.add_constant(X)
est = sm.OLS(y, X2)
est2 = est.fit()
print(est2.summary())
OLS Regression Results
==============================================================================
Dep. Variable: SalePrice R-squared: 0.797
Model: OLS Adj. R-squared: 0.796
Method: Least Squares F-statistic: 725.0
Date: Wed, 16 Dec 2020 Prob (F-statistic): 0.00
Time: 01:25:48 Log-Likelihood: 590.35
No. Observations: 1298 AIC: -1165.
Df Residuals: 1290 BIC: -1123.
Df Model: 7
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 12.0472 0.004 2818.002 0.000 12.039 12.056
x1 0.1430 0.006 23.817 0.000 0.131 0.155
x2 0.1062 0.006 17.373 0.000 0.094 0.118
x3 0.0293 0.008 3.504 0.000 0.013 0.046
x4 0.0752 0.010 7.718 0.000 0.056 0.094
x5 0.0077 0.006 1.320 0.187 -0.004 0.019
x6 0.0270 0.008 3.381 0.001 0.011 0.043
x7 -0.0005 0.010 -0.054 0.957 -0.020 0.019
==============================================================================
Omnibus: 205.633 Durbin-Watson: 2.029
Prob(Omnibus): 0.000 Jarque-Bera (JB): 515.814
Skew: -0.857 Prob(JB): 9.82e-113
Kurtosis: 5.570 Cond. No. 6.30
==============================================================================