Created using Colaboratory

sandipanpaul21 · sandipanpaul21 · commit 11ca5941f84d · 2021-12-07T12:41:09.000+05:30
diff --git a/07_Simple_and_Multiple_Regression.ipynb b/07_Simple_and_Multiple_Regression.ipynb
@@ -6,7 +6,7 @@
       "name": "07 Simple and Multiple Regression.ipynb",
       "provenance": [],
       "collapsed_sections": [],
-      "authorship_tag": "ABX9TyP3daddsjDhXZ17CoSmuNyw",
+      "authorship_tag": "ABX9TyOAJPmXXYrhwvtGuE3Xfsil",
       "include_colab_link": true
     },
     "kernelspec": {
@@ -25,14 +25,109 @@
         "<a href=\"https://colab.research.google.com/github/sandipanpaul21/Machine-Learning-in-Python-Code/blob/master/07_Simple_and_Multiple_Regression.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
       ]
     },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "O1w94qbJnRld"
+      },
+      "source": [
+        "**LINEAR REGRESSION**\n",
+        "\n",
+        "Linear regression is studied as a model for understanding the relationship between input and output numerical variables.\n",
+        "\n",
+        "**Linear regression is a linear model**, e.g. a model that assumes a linear relationship between the input variables (x) and the single output variable (y). More specifically, that y can be calculated from a linear combination of the input variables (x).\n",
+        "\n",
+        "When there is a single input variable (x), the method is referred to as **simple linear regression**. When there are multiple input variables, literature from statistics often refers to the method as **multiple linear regression**\n",
+        "\n",
+        "Different techniques can be used to prepare or train the linear regression equation from data, the most common of which is called **Ordinary Least Squares**. It is common to therefore refer to a model prepared this way as **Ordinary Least Squares Linear Regression** or just **Least Squares Regression**."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "a7ibydXynwkK"
+      },
+      "source": [
+        "The representation is a linear equation that combines a specific set of **input values (x)** the solution to which is the predicted output for that set of **output values (y)**. As such, both the input values (x) and the output value are numeric.\n",
+        "\n",
+        "The linear equation assigns one scale factor to each input value or column, called a **coefficient** and represented by the capital Greek letter **Beta (B)**. One additional coefficient is also added, giving the line an additional degree of freedom (e.g. moving up and down on a two-dimensional plot) and is often called the **intercept or the bias coefficient.**\n",
+        "\n",
+        "For example, in a simple regression problem (a single x and a single y), the form of the model would be:\n",
+        "\n",
+        "y = B0 + B1 * x1\n",
+        "\n",
+        "In higher dimensions when we have more than one input (x), the line is called a **plane or a hyper-plane**. The representation therefore is the form of the equation and the specific values used for the coefficients (e.g. B0 and B1 in the above example).\n",
+        "\n",
+        "It is common to talk about the complexity of a regression model like linear regression. This refers to the **number of coefficients used in the model.**\n",
+        "\n",
+        "When a **coefficient becomes zero,** it effectively removes the influence of the input variable on the model and therefore from the prediction made from the model (0 * x = 0). This becomes  relevant if you look at **regularization methods** that change the learning algorithm to reduce the complexity of regression models by putting pressure on the absolute size of the coefficients, driving some to zero."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "R7lnhXAzoleK"
+      },
+      "source": [
+        "**Simple Linear Regression**\n",
+        "\n",
+        "With simple linear regression when we have a *single input*, we can use statistics to estimate the coefficients\n",
+        "\n",
+        "**Ordinary Least Squares**\n",
+        "  - When we have *more than one input* we can use Ordinary Least Squares to estimate the values of the coefficients.\n",
+        "  - The *Ordinary Least Squares procedure seeks to minimize the sum of the squared residuals.* This means that given a regression line through the data we calculate the distance from each data point to the regression line, square it, and sum all of the squared errors together. This is the quantity that ordinary least squares seeks to minimize.\n",
+        "\n",
+        "**Regularization**\n",
+        "  - There are extensions of the training of the linear model called regularization methods. \n",
+        "  - These seek to *both minimize the sum of the squared error of the model on the training data (using ordinary least squares) but also to reduce the complexity of the model (like the number or absolute size of the sum of all coefficients in the model).*\n",
+        "  - Two popular examples of regularization procedures for linear regression are:\n",
+        "    1. **Lasso Regression:** where Ordinary Least Squares is modified to also minimize the absolute sum of the coefficients (called L1 regularization).\n",
+        "    2. **Ridge Regression:** where Ordinary Least Squares is modified to also minimize the squared absolute sum of the coefficients (called L2 regularization).\n",
+        "  - *These methods are effective to use when there is collinearity in your input values and ordinary least squares would overfit the training data.*\n",
+        "\n",
+        "\n",
+        "**Data Preparation for Linear Regression Model**\n",
+        "1. **Linear Assumption.** \n",
+        "  - Linear regression assumes that the relationship between your input and output is linear. It does not support anything else. \n",
+        "  - This may be obvious, but it is good to remember when you have a lot of attributes. You may need to transform data to make the relationship linear (e.g. log transform for an exponential relationship).\n",
+        "2. **Remove Noise.**\n",
+        "  - Linear regression assumes that your input and output variables are not noisy. Consider using data cleaning operations that let you better expose and clarify the signal in your data. \n",
+        "  - This is most important for the output variable and you want to remove outliers in the output variable (y) if possible.\n",
+        "3. **Remove Collinearity.** \n",
+        "  - Linear regression will over-fit your data when you have highly correlated input variables. Consider calculating pairwise correlations for your input data and removing the most correlated.\n",
+        "4. **Gaussian Distributions.** \n",
+        "- Linear regression will make more reliable predictions if your input and output variables have a Gaussian distribution. \n",
+        "- You may get some benefit using transforms (e.g. log or BoxCox) on you variables to make their distribution more Gaussian looking.\n",
+        "5. **Rescale Inputs:** \n",
+        "  - Linear regression will often make more reliable predictions if you rescale input variables using standardization or normalization."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "FUOE9uSRrEca"
+      },
+      "source": [
+        "**Assumptions of Linear Regression**\n",
+        "\n",
+        "Linear regression is an analysis that assesses whether one or more predictor variables explain the dependent (criterion) variable.  The regression has five key assumptions:\n",
+        "1. Linear relationship\n",
+        "2. Multivariate normality\n",
+        "3. No or little multicollinearity\n",
+        "4. No auto-correlation\n",
+        "5. Homoscedasticity\n",
+        "\n",
+        "A note about sample size.  In Linear regression the sample size rule of thumb is that the regression analysis requires at least 20 cases per independent variable in the analysis."
+      ]
+    },
     {
       "cell_type": "code",
       "metadata": {
         "id": "VGe8PLvZqOpy",
         "colab": {
           "base_uri": "https://localhost:8080/"
         },
-        "outputId": "fbd45f76-d99a-4595-9d33-f4f1b66bcacc"
+        "outputId": "6e549da6-7c08-40df-8d75-5759a479ae7a"
       },
       "source": [
         "# Libraries \n",
@@ -68,7 +163,7 @@
           "base_uri": "https://localhost:8080/",
           "height": 204
         },
-        "outputId": "9678cb52-bc86-4dab-b4f7-746dcb0074a6"
+        "outputId": "d26013be-da7a-4fcb-9edc-b79aeeb7e58e"
       },
       "source": [
         "# Boston Dataset for Regression\n",
@@ -231,7 +326,7 @@
         "colab": {
           "base_uri": "https://localhost:8080/"
         },
-        "outputId": "f29260b9-a025-41dc-c064-2159b10c02f0"
+        "outputId": "baeb118a-b574-4953-8911-1149cb253f64"
       },
       "source": [
         "# Dataset overall Information\n",
@@ -275,7 +370,7 @@
         "colab": {
           "base_uri": "https://localhost:8080/"
         },
-        "outputId": "88107e5a-48bc-4dbf-d7bd-76565c9507be"
+        "outputId": "0034a431-4553-456d-f8c9-a9757d35fa19"
       },
       "source": [
         "# Let set the BASE MODEL on which we will improve\n",
@@ -312,7 +407,7 @@
           "base_uri": "https://localhost:8080/"
         },
         "id": "SrUy95nr3NOo",
-        "outputId": "cf4d6d11-dc54-4d18-e303-5875b45141ee"
+        "outputId": "51914598-0e9b-442b-ddc3-d8d59079aae2"
       },
       "source": [
         "# Add constant to the dataframe\n",
@@ -343,7 +438,7 @@
           "base_uri": "https://localhost:8080/",
           "height": 709
         },
-        "outputId": "58797f03-78ba-4194-e02a-50193118017f"
+        "outputId": "84174b49-11de-4af1-d88a-3a49a3efb08d"
       },
       "source": [
         "# Base Model\n",
@@ -371,10 +466,10 @@
               "  <th>Method:</th>             <td>Least Squares</td>  <th>  F-statistic:       </th> <td>   89.01</td> \n",
               "</tr>\n",
               "<tr>\n",
-              "  <th>Date:</th>             <td>Tue, 23 Nov 2021</td> <th>  Prob (F-statistic):</th> <td>4.90e-115</td>\n",
+              "  <th>Date:</th>             <td>Tue, 07 Dec 2021</td> <th>  Prob (F-statistic):</th> <td>4.90e-115</td>\n",
               "</tr>\n",
               "<tr>\n",
-              "  <th>Time:</th>                 <td>14:19:50</td>     <th>  Log-Likelihood:    </th> <td> -1548.6</td> \n",
+              "  <th>Time:</th>                 <td>06:55:54</td>     <th>  Log-Likelihood:    </th> <td> -1548.6</td> \n",
               "</tr>\n",
               "<tr>\n",
               "  <th>No. Observations:</th>      <td>   506</td>      <th>  AIC:               </th> <td>   3123.</td> \n",
@@ -456,8 +551,8 @@
               "Dep. Variable:             HOUSEPRICE   R-squared:                       0.684\n",
               "Model:                            OLS   Adj. R-squared:                  0.677\n",
               "Method:                 Least Squares   F-statistic:                     89.01\n",
-              "Date:                Tue, 23 Nov 2021   Prob (F-statistic):          4.90e-115\n",
-              "Time:                        14:19:50   Log-Likelihood:                -1548.6\n",
+              "Date:                Tue, 07 Dec 2021   Prob (F-statistic):          4.90e-115\n",
+              "Time:                        06:55:54   Log-Likelihood:                -1548.6\n",
               "No. Observations:                 506   AIC:                             3123.\n",
               "Df Residuals:                     493   BIC:                             3178.\n",
               "Df Model:                          12                                         \n",
@@ -506,39 +601,37 @@
         "**OUTPUT EXPLAINATION**\n",
         "\n",
         "**Omnibus/Prob(Omnibus)**\n",
-        "- Omnibus describes the normalcy of the distribution of our residuals using skew and kurtosis as measurements. A 0 would indicate perfect normalcy. \n",
-        "- Prob(Omnibus) is a statistical test measuring the probability the residuals are normally distributed. 1 would indicate perfectly normal distribution. \n",
-        "- Skew is a measurement of symmetry in our data, with 0 being perfect symmetry. \n",
-        "- Kurtosis measures the peakiness of our data, or its concentration around 0 in a normal curve. Higher kurtosis implies fewer outliers.\n",
-        "- The Prob(Omnibus) indicates the probability that the residuals are normally distributed. \n",
-        "- We hope to see something close to 1 here. \n",
+        "- **Prob(Omnibus)** is a statistical test measuring the *probability the residuals are normally distributed. 1 would indicate perfectly normal distribution.* \n",
+        "- **Omnibus** describes the normalcy of the *distribution of our residuals* using skew and kurtosis as measurements. A *0 would indicate perfect normalcy.* \n",
+        "- The **Prob(Omnibus)** indicates the probability that the residuals are normally distributed. We hope to see something close to 1 here. \n",
         "- In this case Omnibus = 267 (way higher than 1) and Prob(Omnibus) = 0 which (normally = 1)is low. So the data is not normal, not ideal. \n",
         "\n",
-        "**Skew -** \n",
-        "- Measure of data symmetry.\n",
+        "**Skew** \n",
+        "- Skew is a measurement of symmetry in our data, with 0 being perfect symmetry. \n",
         "- We want to see something close to zero, indicating the residual distribution is normal. \n",
         "- Note that this value also drives the Omnibus\n",
         "- In this case, Skewness = 2.1, way higher than 0 so error/residual is skewwed\n",
         "\n",
-        "**Kurtosis -** \n",
-        "- Measure of \"peakiness\", or curvature of the data. \n",
-        "- Kurtosis of the normal distribution is 3.0.\n",
+        "**Kurtosis** \n",
+        "- Kurtosis measures the *peakiness of our data.*\n",
+        "- *Kurtosis of the normal distribution is 3.0*\n",
         "- In this case, Kurtosis = 13.14 which is way too higher\n",
         "\n",
-        "**Durbin-Watson -**\n",
-        "- Durbin-Watson is a measurement of homoscedasticity, or an even distribution of errors throughout our data. - Heteroscedasticity would imply an uneven distribution, for example as the data point grows higher the relative error grows higher. Ideal homoscedasticity will lie between 1 and 2. \n",
+        "**Durbin-Watson**\n",
+        "- Durbin-Watson is a *measurement of homoscedasticity, or an even distribution of errors throughout our data.* \n",
+        "- Heteroscedasticity would imply an uneven distribution, for example as the data point grows higher the relative error grows higher. *Ideal homoscedasticity will lie between 1 and 2.* \n",
         "- In this case, Durbin-Watson = 0.78 is close, but within limits.\n",
         "\n",
         "**Jarque-Bera (JB)/Prob(JB) -** \n",
-        "- Jarque-Bera (JB) and Prob(JB) are alternate methods of measuring the same value as Omnibus and Prob(Omnibus) using skewness and kurtosis. We use these values to confirm each other\n",
+        "- Jarque-Bera (JB) and Prob(JB) are *alternate methods of measuring the same value as Omnibus and Prob(Omnibus)* using skewness and kurtosis. We use these values to confirm each other\n",
         "- It is like the Omnibus test in that it tests both skew and kurtosis. \n",
         "- It is also performed for the distribution analysis of the regression errors.\n",
-        "- A large value of JB test indicates that the errors are not normally distributed.\n",
+        "- A *large value of JB test indicates that the errors are not normally distributed.*\n",
         "- In this case, JB = 2542 which is way too higher so error are not normally distributed\n",
         "\n",
         "**Condition Number -**\n",
         "- Condition number is a measurement of the sensitivity of our model as compared to the size of changes in the data it is analyzing. \n",
-        "- Multicollinearity is strongly implied by a high condition number. \n",
+        "- *Multicollinearity is strongly implied by a high condition number.* \n",
         "- Multicollinearity a term to describe two or more independent variables that are strongly related to each other and are falsely affecting our predicted variable by redundancy.\n",
         "- When we have multicollinearity, we can expect much higher fluctuations to small changes in the data hence, we hope to see a relatively small number, something below 30. \n",
         "- In this case, Condition Number = well above 30, so multicollinearity present\n",
@@ -547,7 +640,7 @@
         "- R Squared is the measurement of how much of the independent variable is explained by changed in our dependent variable\n",
         "- In percentage terms, 0.68 would mean our model explains 68% of the dependent variable\n",
         "- Adjusted R-squared is important for analyzing multiple dependent variables’ efficacy on the model. \n",
-        "- Linear regression has the quality that your model’s R-squared value will never go down with additional variables, only equal or higher. \n",
+        "- R-squared value will never go down with additional variables, only equal or higher. \n",
         "- Therefore, your model could look more accurate with multiple variables even if they are poorly contributing. \n",
         "- The adjusted R-squared penalizes the R-squared formula based on the number of variables, therefore a lower adjusted score may be telling you some variables are not contributing to your model’s R-squared properly.\n",
         "- Both measures model performance and Possible values range from 0.0 to 1.0. \n",