-
Notifications
You must be signed in to change notification settings - Fork 148
Dec 21, 2020
This programming exercise instruction was originally developed and written by Prof. Andrew Ng as part of his machine learning course on Coursera platform. I have adapted the instruction for R language, so that its users, including myself, could also take and benefit from the course.
In this exercise, you will implement logistic regression and apply it to
two different datasets. Before starting on the programming exercise, we
strongly recommend watching the video lectures and completing the review
questions for the associated topics. To get started with the exercise,
you will need to download the starter code and unzip its contents to the
directory where you wish to complete the exercise. If needed, use the
setwd()
command in R to change to this directory before starting this
exercise.
Files included in this exercise:
-
ex2.R
- R script that steps you through the exercise -
ex2_reg.R
- R script for the later parts of the exercise -
ex2data1.txt
- Training set for the first half of the exercise -
ex2data2.txt
- Training set for the second half of the exercise -
submit.R
- Submission script that sends your solutions to our servers -
mapFeature.R
- Function to generate polynomial features -
plotDecisionBoundary.R
- Function to plot classifier’s decision boundary - ⋆
plotData.R
- Function to plot 2D classification data - ⋆
sigmoid.R
- Sigmoid Function - ⋆
costFunction.R
- Logistic Regression Cost Function - ⋆
predict.R
- Logistic Regression Prediction Function - ⋆
costFunctionReg.R
- Regularized Logistic Regression Cost
⋆ indicates files you will need to complete
Throughout the exercise, you will be using the scripts ex2.R
and
ex2_reg.R
. These scripts set up the dataset for the problems and make
calls to functions that you will write. You do not need to modify either
of them. You are only required to modify functions in other files, by
following the instructions in this assignment.
The exercises in this course use R, a high-level programming language
well-suited for numerical computations. If you do not have R installed,
please download a Windows installer from
R-project website.
R-Studio is a free and
open-source R integrated development environment (IDE) making R script
development a bit easier when compared to the R’s own basic GUI. You may
start from the .Rproj
(a R-Studio project file) in each exercise
directory. At the R command line, typing help followed by a function
name displays documentation for that function. For example,
help('plot')
or simply ?plot
will bring up help information for
plotting. Further documentation for R functions can be found at the R
documentation pages.
In this part of the exercise, you will build a logistic regression model
to predict whether a student gets admitted into a university. Suppose
that you are the administrator of a university department and you want
to determine each applicant’s chance of admission based on their results
on two exams. You have historical data from previous applicants that you
can use as a training set for logistic regression. For each training
example, you have the applicant’s scores on two exams and the admissions
decision. Your task is to build a classification model that estimates an
applicant’s probability of admission based on the scores from those two
exams. This outline and the framework code in ex2.R
will guide you
through the exercise.
Before starting to implement any learning algorithm, it is always good
to visualize the data if possible. In the first part of ex2.R
, the
code will load the data and display it on a 2-dimensional plot by
calling the function plotData
. You will now complete the code in
plotData
so that it displays a figure like Figure 1, where the axes
are the two exam scores, and the positive and negative examples are
shown with different markers.
Figure 1: Scatter plot of training data
To help you get more familiar with plotting, we have left plotData.R
empty so you can try to implement it yourself. However, this is an
optional (ungraded) exercise. We also provide our implementation below
so you can copy it or refer to it. If you choose to copy our example,
make sure you learn what each of its commands is doing by consulting the
R documentation.
# plus and empty circle character codes
symbols <- c(3,21)
yfac <- factor(y)
plot(X[,1], X[,2], pch = symbols[yfac],
bg = "yellow", lwd = 1.3, xlab = axLables[1], ylab = axLables[2])
legend("topright", legLabels, pch = rev(symbols), pt.bg = "yellow")
Before you start with the actual cost function, recall that the logistic regression hypothesis is defined as:
where function is the sigmoid function. The sigmoid function is defined as:
Your first step is to implement this function in sigmoid.R
so it can
be called by the rest of your program. When you are finished, try
testing a few values by calling sigmoid(x)
at the R command line. For
large positive values of x, the sigmoid should be close to 1, while for
large negative values, the sigmoid should be close to 0. Evaluating
sigmoid(0)
should give you exactly 0.5. Your code should also work
with vectors and matrices. For a matrix, your function should perform
the sigmoid function on every element. You can submit your solution for
grading by typing submit()
at the R command line. The submission
script will prompt you for your login e-mail and submission token and
ask you which files you want to submit. You can obtain a submission
token from the web page for the assignment.
You should now submit your solutions.
Now you will implement the cost function and gradient for logistic
regression. Complete the code in costFunction.R
to return the cost and
gradient. Recall that the cost function in logistic regression is
and the gradient of the cost is a vector of the same length as where the element (for ) is defined as follows:
Note that while this gradient looks identical to the linear regression
gradient, the formula is actually different because linear and logistic
regression have different definitions of . Once
you are done, ex2.R
will call your costFunction
using the initial
parameters of . You should see that the cost is
about 0.693.
You should now submit your solutions.
In the previous assignment, you found the optimal parameters of a linear
regression model by implementing gradent descent. You wrote a cost
function and calculated its gradient, then took a gradient descent step
accordingly. This time, instead of taking gradient descent steps, you
will use an R built-in function called optim
. R’s optim
is an
optimization solver that finds the minimum of an unconstrained [2]
function. For logistic regression, you want to optimize the cost
function with parameters .
Concretely, you are going to use optim
to find the best parameters
for the logistic regression cost function, given
a fixed dataset (of X and y values). You will pass to optim
the
following inputs:
- The initial values of the parameters we are trying to optimize.
- Two functions that, when given the training set and a particular , computes the logistic regression cost and gradient respectively with respect to for the dataset (X, y)
In ex2.R
, we already have code written to call optim
with the
correct arguments.
optimRes <- optim(par = initial_theta, fn = costFunction(X,y), gr = grad(X,y),
method="BFGS", control = list(maxit = 400))
theta <- optimRes$par
cost <- optimRes$value
In this code snippet, we first defined the options to be used with
optim
. Specifically, we set the gr
parameter to a function to tell
optim
to use the gradient when minimizing the function. If it is
unset, optim
uses a finite difference approximation to compute
gradient. Method BFGS is a quasi-Newton method (also known as a variable
metric algorithm). This uses both the function values and gradients to
build up a picture of the surface to be optimized. Furthermore, we set
the maxit
option to 400, so that optim
will run for at most 400
steps before it terminates.
If you have completed the costFunction
correctly, optim
will
converge on the right optimization parameters and return the final
values of the cost and . Notice that by using
optim
, you did not have to write any loops yourself, or set a learning
rate like you did for gradient descent. This is all done by optim
: you
only needed to provide a function calculating the cost and another
function for the gradient.
Once optim
completes, ex2.R
will call your costFunction
function
using the optimal parameters of . You should see
that the cost is about 0.203. This final value
will then be used to plot the decision boundary on the training data,
resulting in a figure similar to Figure 2. We also encourage you to look
at the code in plotDecisionBoundary.R
to see how to plot such a
boundary using the values.
After learning the parameters, you can use the model to predict whether a particular student will be admitted. For a student with an Exam 1 score of 45 and an Exam 2 score of 85, you should expect to see an admission probability of 0.776. Another way to evaluate the quality of the parameters we have found is to see how well the learned model predicts on our training set.
Figure 2: Training data with decision boundary
In this part, your task is to complete the code in predict.R
. The
predict
function will produce 1 or 0 predictions given a dataset and a
learned parameter vector . After you have
completed the code in predict.R
, the ex2.R
script will proceed to
report the training accuracy of your classifier by computing the
percentage of examples it got correct.
You should now submit your solutions.
In this part of the exercise, you will implement regularized logistic regression to predict whether microchips from a fabrication plant passes quality assurance (QA). During QA, each microchip goes through various tests to ensure it is functioning correctly. Suppose you are the product manager of the factory and you have the test results for some microchips on two different tests. From these two tests, you would like to determine whether the microchips should be accepted or rejected. To help you make the decision, you have a dataset of test results on past microchips, from which you can build a logistic regression model.
You will use another script, ex2_reg.R
to complete this portion of the
exercise.
Similar to the previous parts of this exercise, plotData
is used to
generate a figure like Figure 3, where the axes are the two test scores,
and the positive (y = 1, accepted) and negative (y = 0, rejected)
examples are shown with different markers.
Figure 3: Plot of training data
Figure 3 shows that our dataset cannot be separated into positive and negative examples by a straight-line through the plot. Therefore, a straightforward application of logistic regression will not perform well on this dataset since logistic regression will only be able to find a linear decision boundary.
One way to fit the data better is to create more features from each data
point. In the provided function mapFeature.R
, we will map the features
into all polynomial terms of and
up to the sixth power.
As a result of this mapping, our vector of two features (the scores on two QA tests) has been transformed into a 28-dimensional vector. A logistic regression classifier trained on this higher-dimension feature vector will have a more complex decision boundary and will appear nonlinear when drawn in our 2-dimensional plot. While the feature mapping allows us to build a more expressive classifier, it is also more susceptible to overfitting. In the next parts of the exercise, you will implement regularized logistic regression to fit the data and also see for yourself how regularization can help combat the overfitting problem.
Now you will implement code to compute the cost function and gradient
for regularized logistic regression. Complete the code in
costFunctionReg.R
to return the cost and gradient. Recall that the
regularized cost function in logistic regression is
Note that you should not regularize the parameter
. In R, recall that indexing starts from 1,
hence, you should not be regularizing the theta[1]
parameter (which
corresponds to ) in the code. The gradient of the
cost function is a vector where the element is
defined as follows:
for
for
Once you are done, ex2_reg.R
will call your costFunctionReg
function
using the initial value of (initialized to all
zeros). You should see that the cost is about 0.693.
You should now submit your solutions.
Similar to the previous parts, you will use optim
to learn the optimal
parameters . If you have completed the cost and
gradient for regularized logistic regression (costFunctionReg.R
)
correctly, you should be able to step through the next part of
ex2_reg.R
to learn the parameters using
optim
.
To help you visualize the model learned by this classifier, we have
provided the function plotDecisionBoundary.R
which plots the
(non-linear) decision boundary that separates the positive and negative
examples. In plotDecisionBoundary.R
, we plot the non-linear decision
boundary by computing the classifier’s predictions on an evenly spaced
grid and then drew a contour plot of where the predictions change from
to . After learning the
parameters , the next step in the ex2_reg.R
will
plot a decision boundary similar to Figure 4.
In this part of the exercise, you will get to try out different regularization parameters for the dataset to understand how regularization prevents overfitting. Notice the changes in the decision boundary as you vary . With a small , you should find that the classifier gets almost every training example correct, but draws a very complicated boundary, thus overfitting the data (Figure 5). This is not a good decision boundary: for example, it predicts that a point at x = (−0.25, 1.5) is accepted (y = 1), which seems to be an incorrect decision given the training set. With a larger , you should see a plot that shows a simpler decision boundary which still separates the positives and negatives fairly well. However, if is set to too high a value, you will not get a good fit and the decision boundary will not follow the data so well, thus underfitting the data (Figure 6).
You do not need to submit any solutions for these optional (ungraded) exercises.
Figure 4: Training data with decision boundary ()
Figure 5: No regularization (Overfitting) ()
Figure 6: Too much regularization (Underfitting) ()
After completing various parts of the assignment, be sure to use the submit function system to submit your solutions to our servers. The following is a breakdown of how each part of this exercise is scored.
Part | Submitted File | Points |
---|---|---|
Sigmoid Function | sigmoid.R |
5 points |
Compute cost for logistic regression | costFunction.R |
30 points |
Gradient for logistic regression | costFunction.R |
30 points |
Predict Function | predict.R |
5 points |
Compute cost for regularized LR | costFunctionReg.R |
15 points |
Gradient for regularized LR | costFunctionReg.R |
15 points |
Total Points | 100 points |
You are allowed to submit your solutions multiple times, and we will take only the highest score into consideration.
-
Beware that
fmincg
orfminunc
optimization function in MATLAB takes one function as input and computes cost and gradient simultaneously. However, cost and gradient functions must be supplied intooptim
orlbfgsb3
functions individually. -
Constraints in optimization often refer to constraints on the parameters, for example, constraints that bound the possible values can take (e.g., ). Logistic regression does not have such constraints since is allowed to take any real value.