This project studies regression of data whose underlying function varies from linear to very nonlinear using methods with different flexibility. From the class, we learned that the mean-squared errors
where
Data Generation: I will be generating data using the following models. Let
- A linear function
$f(x) = ax + b$ ; - A quadratic function
$f(x) = ax^2 + bx + c$ ; - A nonlinear function: e.g.
$f(x) = x\sin(x)$ ,$f(x) = \frac{1}{1 + 25x^2}$
NOTE: I will be picking a domain and deciding how many data points to produce. I will be splitting my data into the training set and the test set randomly and so validation can be performed. For the random noise, I will be controlling the mean and variance, choosing them arbritrarily as I see fit.
Regression Models: I will use three regression methods to fit the data.
-
$Y = \beta_0 + \beta_1X$ $(df=2)$ ; - A smoothing cubic spline with degree of freedom 5;
- A smoothing cubic spline with effective degree of freedom approximately 25. (NOTE: for smoothing spline I will adjust the smoothing parameters to control the level of flexibility or degree of freedom.)
Research Objective: I will use my data and methods to investigate the following topics.
- The performance of methods with different flexibility on data from linear to nonlinear underlying functions.
- How does the
$MSE$ ,$Bias$ , and$Var$ vary with method of different flexibility? - How does the variance of noise affect the performance of different method (e.g. similar to (1) but with a different noise level characterized by the variance). I will use the nonlinear data with different noise levels fitted by three methods, plot the model and calculate training
$MSE$ and test$MSE$ .
To generate data, I will first be establishing the true population models that I will be using to arbitrarily make predictions using various regression techniques. Here are my three functions that I'm choosing for this project, all of which have extreme varying complexity:
Linear Model:
Quadratic Model:
Non-linear Model:
The reason why I am using multiple models of varying complexities is because I want to observe exactly how our Regression methods react to different patterns of data. Depending on the flexibility and bias, Spline regression with a high degree of freedom for example might have a much more severe reaction to a more complex non-linear
We fix our noise with
Now that we have our data, let's evaluate how various regression methods perform when fitting to the data. As a reminder, I will be using the following methods to evaluate which method fits the data best and which method fits the model best:
-
$Y = \beta_0 + \beta_1X$ $(df=2)$ ; - A smoothing cubic spline with degree of freedom 5;
- A smoothing cubic spline with effective degree of freedom approximately 25. (NOTE: for smoothing spline I will adjust the smoothing parameters to control the level of flexibility or degree of freedom.)
We have our data, but we need to split it now into a training dataset and a testing dataset. The reasoning why I am following this approach is because this enables me to both create a model
Now, I will calculate the necessary smoothness values that correspond with degrees of freedom 2, 5, and 25 respectively. Degrees of freedom are associated with the amount of piecewise fragments used to calculate the cubic spline. For example, if my degrees of freedom equals 5, then there exists 4 separate piecewise fragments in the overall cubic spline piecewise function. Note that as degrees of freedom increase, flexibility of our model decreases.
Now that we have the necessary smoothness values associated with our target degrees of freedom, we now can input our data into the csaps() function to obtain predicted y values based on each corresponding model.
Now, I will use matplotlib to plot the target
In the linear model, overfitting to the training data becomes extremely apparent as our degrees of freedom increases. The true model is linear and the predicted model for df=2, which is a linear prediction, follows the true model almost exactly. However, overfitting occurs in the red dotted line, and we can see that the model bends and curve to slight variations in the training data caused by Gaussian noise, which distances it from the testing data and makes it not resemble reality.
In the quadratic model, we see the same phenomenon for the most part, except in the predicted linear model for this true quadratic model when df=2, we can see that the model underfits pretty extremely.
In the complex non-linear model, increases in degrees of freedom actually resemble reality much more than predicted cubic spline models with smaller degrees of freedom. The df=2 and df=5 models extremely underfit reality here, but the red dotted line resembles the true
Observing
In order to calculate the
We will need to perform Bootstrap Sampling as our form of resampling. Bootstrap Sampling is the repetitive random selection of sample data to estimate a population parameter. To put that in Layman's terms, Bootstrap Sampling involves selecting a subset of our sample multiple times with replacement—meaning that the data selected in each subset can reappear in the next subset and so forth—to predict
The Mean-Squared Error
where
The
where
The
where
For the total test dataset, average of values at individual
Note that throughout this section, whenever I refer to Bias, I am actually talking about
I want to plot the flexibility on our x-axis with degrees of freedom instead of smoothness values, so I will generate the plotting space for these smoothness values then translate them to degrees of freedom values.
To observe how MSE, Bias, and Variance change in each model as the degrees of freedom increases, I will plot the values against one another below.
For every single one of these plots, the MSE and Bias decrease as the degrees of freedom increases. This phenomenon makes sense, because the lower the flexibility of our model becomes, the lower the Bias and MSE become in our training data. The increase in degrees of freedom is intended to cause our model to fit closer and closer to the training data, so in every single one of these cases, the plot's pattern more closely fits to those data values.
I do not expect this behavior to hold true when it comes to the MSE, Bias, and Variance of our testing data, because while this fitting phenomenon may occur with the increase in fit in our training data, it may lead to severe overfitting in our testing data.
Below, I will plot degrees of freedom with the testing MSE, Bias, and Variance to check how behavior changes with this decrease in flexibility.
Within the true linear model, the
Within the true quadratic model, we see a similar trend here but with one notable difference. The initial
Within the true complex nonlinear model, a similar trend unveils itself as well, but the linear
Now, I want to observe the differences between Training and Testing
Within the true linear model, testing
Within the true quadratic model, the same trend occurs for the most part, except it is clear that the lowest degrees of freedom values lead to overfitting because of the absurdly large testing
Within the true non-linear model, both the training and testing
Specifically on the non-linear model, I am now changing up the Gaussian noise parameter to understand necessarily how an increase of nuance in our data adjusts
With minimal Gaussian noise in the data, we see that Training and Testing
As the level of Gaussian noise increases, however, it seems that the testing
[2] https://stackabuse.com/matplotlib-scatter-plot-with-distribution-plots-histograms-jointplot/
[4] https://medium.com/analytics-vidhya/calculation-of-bias-variance-in-python-8f96463c8942