Now that we've investigated some methods for tuning our networks, we will investigate some further methods and concepts regarding reducing training time. These concepts will begin to form a more cohesive framework for choices along the modeling process.
You will be able to:
- Explain what normalization does to training time with neural networks and why
- Explain what a vanishing or exploding gradient is, and how it is related to model convergence
- Compare the different optimizer strategies for neural networks
One way to speed up training of your neural networks is to normalize the input. In fact, even if training time were not a concern, normalization to a consistent scale (typically 0 to 1) across features should be used to ensure that the process converges to a stable solution. Similar to some of our previous work in training models, one general process for standardizing our data is subtracting the mean and dividing by the standard deviation.
Not only will normalizing your inputs speed up training, it can also mitigate other risks inherent in training neural networks. For example, in a neural network, having input of various ranges can lead to difficult numerical problems when the algorithm goes to compute gradients during forward and back propagation. This can lead to untenable solutions and will prevent the algorithm from converging to a solution. In short, make sure you normalize your data! Here's a little more mathematical background:
To demonstrate, imagine a very deep neural network. Assume
recall that
similarly,
Imagine two nodes in each layer,
Even if the
Aside from normalizing the data, you can also investigate the impact of changing the initialization parameters when you first launch the gradient descent algorithm.
For initialization, the more input features feeding into layer l, the smaller you want each
A common rule of thumb is:
or
One common initialization strategy for the relu activation function is:
w^{[l]} = np.random.randn(shape)*np.sqrt(2/n_(l-1))
Later, we'll discuss other initialization strategies pertinent to other activation functions.
In addition, you could even use an alternative convergence algorithm instead of gradient descent. One issue with gradient descent is that it oscillates to a fairly big extent, because the derivative is bigger in the vertical direction.
With that, here are some optimization algorithms that work faster than gradient descent:
Compute an exponentially weighted average of the gradients and use that gradient instead. The intuitive interpretation is that this will successively dampen oscillations, improving convergence.
Momentum:
-
compute
$dW$ and$db$ on the current minibatch -
compute
$V_{dw} = \beta V_{dw} + (1-\beta)dW$ and -
compute
$V_{db} = \beta V_{db} + (1-\beta)db$
These are the moving averages for the derivatives of $W$ and $b$
This averages out gradient descent, and will "dampen" oscillations. Generally, $\beta=0.9$ is a good hyperparameter value.
RMSprop stands for "root mean square" prop. It slows down learning in one direction and speed up in another one. On each iteration, it uses exponentially weighted average of the squares of the derivatives.
In the direction where we want to learn fast, the corresponding
Often, add small
"Adaptive Moment Estimation", basically using the first and second moment estimations. Works very well in many situations! It takes momentum and RMSprop to put it together!
Initialize:
For each iteration:
Compute
It's like momentum and then RMSprop. We need to perform a correction! This is sometimes also done in RSMprop, but definitely here too.
Hyperparameters:
$\alpha$ $\beta_1 = 0.9$ $\beta_2 = 0.999$ $\epsilon = 10^{-8}$
Generally, only
Learning rate decreases across epochs.
other methods:
(or exponential decay)
or
or
Manual decay!
Now that you've seen some optimization algorithms, take another look at all the hyperparameters that need tuning:
Most important:
$\alpha$
Next:
-
$\beta$ (momentum) - Number of hidden units
- mini-batch-size
Finally:
- Number of layers
- Learning rate decay
Almost never tuned:
-
$\beta_1$ ,$\beta_2$ ,$\epsilon$ (Adam)
Things to do:
- Don't use a grid, because hard to say in advance which hyperparameters will be important
- https://www.coursera.org/learn/deep-neural-network/lecture/lXv6U/normalizing-inputs
- https://www.coursera.org/learn/deep-neural-network/lecture/y0m1f/gradient-descent-with-momentum
In this lesson you began learning about issues regarding the convergence of neural networks training. This included the need for normalization as well as initialization parameters and some optimization algorithms. In the upcoming lab, you'll further investigate these ideas in practice and observe their impacts from various perspectives.