I wrote this code for my own edification and for my portfolio. The main goal was to create very clear implementations of an MLP, forward propagation, backward propagation and an actor-critic training algorithm.
I also wanted to write up a throrough derivation of the back propagation algorithm. That derivation is provided below in the back propagation section.
import Pkg; Pkg.add("Plots")
Use the actor-critic algorithm to train an MLP on a cartpole environment and then render an animation:
include("src/demo/cartpole_actor_critic.jl");
actor_network = train_cartpole(1e5);
render_cartpole(actor_network, 200)
After 10000 training iterations:
After 1000000 training iterations:
-
A Multilayer Perceptron (MLP) is a type of neural network with certain evaluation rules.
-
An MLP contains
$L$ layers and$W=L-1$ sets of weights. -
Each layer is a vector of real numbers. We represent the
$i^{\text{th}}$ element of the$k^{\text{th}}$ layer as$l_i^{(k)}$ . The length (number of neurons) of a layer is given by$|l_i^{(k)}|$ . -
Each set of weights is a matrix of real numbers. We represent the
$(i,j)^{\text{th}}$ element of the$k^{\text{th}}$ set of weights as$w_{ij}^{(k)}$ . -
For simplified computation biases are integrated into the weight matrices. This is accomplished by adding a zeroth column to each weight matrix that contains the biases associated with the corresponding layer. Additionally a zeroth element is added to each layer vector that's always equal to 1. Specifically, for all
$1 \le m \le L$ and$1 \le n \le W$ we have$$l_0^{(m)} = 1$$ $$w_{i0}^{(n)} = b_i$$
The simplest MLP we can have contains just 2 layers and 1 set of weights. Given the first layer (inputs), we compute the values of the second layer via
Note that the sum starts at
We could also represent this expression as matrix-vector multplication. For example,
For a general MLP the values of the hidden layers are computed by applying an activation function
Hidden layers: $2 \le k \le L-1$
The outputs are computed without applying an activation function:
For certain training algorithms, however, other functions are applied to the last outputs. For example, the softmax()
function is used to convert the outputs to a set of probabilities.
The forward propagation algorithm is implemented in propagate.jl
:
layers[1].values .= inputs
for i in 2:lastindex(layers)-1
layers[i].values .= weights[i-1].values_with_bias * layers[i-1].values_with_bias
layers[i].values .= propagator.activation.(layers[i].values)
end
layers[end].values .= weights[end].values_with_bias * layers[end-1].values_with_bias
The expressions above define a recurrence relation that can be combined to express the values of the last layer as a function of the first layer (and all the weights):
Let
Let
If we first run forward propagation on the MLP then we have
And so we can write
Given a set of inputs and outputs we improve the accuracy of an MLP by adjusting the weights (and biases). This means that we need to consider the loss function as a function of the weights:
We update the weights incrementally by applying the gradient descent algorithm to the loss function. In order to do that we need an expression for the gradient of the loss function with respect to the weights. In other words, for each
First let's compute the gradients for the last set of weights:
In the last step we used the fact that, according to the forward propagation recurrence, the
second to last layer
The partial derivative at the end of the last expression is either 1 or 0. It's 1 if both
The gradients for the last set of weights are thus
To compute the gradients for the other sets of weights we also need to handle derivatives of the activation function
In that last step we were able to move
We now define another function that simplifies the notation. We'll explain later why this notation is useful.
Putting everything together we now have the following expression for the gradients of the second to last set of weights:
Once we compute the gradients for the third to last set of weights the recurrence will become apparent.
And then
Putting everything together we now have the following expression for the gradients of the third to last set of weights:
And now let's swap the indices
We now have what looks like the recurrence relation that we need.