while it is non-differentiable at 0, in practice, we default to the gradient of leftside of 0, i.e. 0. This might not be theoretically correct but first of all, we might not need it at all during training (unless working with very low precisions), and even if it happens, we can neglect it for one iteration and hope for training in another one.
It's gradient is well-behaved. either passes the gradient or doesn't, without influencing it's value in the positive region. This mitigated the vanishing gradient problem?.
There are also variations of relu, e.g. parametrized relu (pReLU) defined by
Other variations are GeLU(Gaussian error linear unit)
Or called the squash function, maps every number
$ \text{sigmoid}(x) = \frac{1}{1+\exp(-x)} \ \frac{d}{dx}\text{sigmoid}(x) = \frac{\exp(-x)}{(1 + \exp(-x))^2} = \text{sigmoid}(x)(1-\text{sigmoid}(x)) $
Note: We usually select width of layers as powers of 2https://d2l.ai/chapter_multilayer-perceptrons/backprop.html
This is computationally efficient due to the way memory is allocated and addressed in hardware implementations.
Consider a simple two-layer MLP (one hidden layer). Draw the computational graph, and do forward and backward propagation on paper. Not taking notes because it is too much but you can look it up here.
In training using mini-batch SGD using the backpropagation algorithm, The current values of the parameters are used to obtain the loss, and also store all values of the intermediate variables. In backpropagation, which goes a path from loss to each paramter in the reverse order of forward pass, the intermediate values of parameters are used to obtain gradients and update parameter values. These intermediate values are stored to not be calculated again for the backpropagation. This is why training requires much more memory than inference. It grows linearly with number of exmaples in the batch and depth of network.
Consider a deep neural network. Each layer can be written as
One reason for vanishing gradients is poor activation functions such as the sigmoid making training of starting layers very unstable and impossible. remedy can be e.g. using ReLU.
In the case of exploding gradients, sometimes initialization can play a big role. large gradients can cause training unstable (think of jumping from one side of valley to the other.)
One way to mitigate these problems is proper initialization (but also many more).
It builds on the following and a crucial assumption is that each layer is connected to the next only linearly without any non-linear functions, but is still very usefull in practice nevertheless. For each layer, for each neuron we can define
$ \mathbb{E} [o_i] = \sum_{j=1}^{n_{in}} \mathbb{E} [w_{ij}x_j] = \sum_{j=1}^{n_{in}} \mathbb{E}[w_{ij}] \mathbb{E} [x_j] = 0 \ \text{var}[o_i] = \mathbb{E}[o_i^2]-\mathbb{E}[o_i]^2 = \mathbb{E}[\sum_{j=1}^{n_{in}} w_{ij}^2 x_j^2] - 0 = n_{in} \sigma_{w}^2 \sigma_{x}^2 $
We can do the same analogy for backward propagation where gradient are computed, but this time
The Xavier initialization recommends to choose the initializations of weights such that it accounts for
$ 1/2(\sigma_{in}^2 + \sigma_{out}^2) = 1 \rightarrow \sigma=\sqrt{\frac{2}{n_{in} + n_{out}}} $
There are some research showing that albeit Deep Networks can arbitrarily any random labeling or noise in the dataset, they tend to first attend and learn the patterns and then overfit to the noise. This motivates, a early-stopping technique can be used as generalization technique for DNN. Instead of constraining the value of parameters, here we put a constrain on number of training epochs. A common way to determine stopping criterion is to monitor validation error throughout training, typically by checking once after each epoch of training, and to cut-off training when the validation error has not decreased by more than some small amount
In classical generalization theory, one way to reduce the generalization gap is to use simpler functions. One notion of simplicity could be constraining the parameter space of the model, for example with weight decay, i.e. the norm of the parameters could be thought of simpler. Another notion for simple functions is smoothness, i.e. small perturbations in input, leads to small perturbations in output.
Dropout has some analogy to this idea, and the standard dropout is to drop some of the neurons at each layer during the forward propagation before computing the subsequent layer. One thing to consider is to let the expectation of a layer be the same, before and after this perturbation. In practice this means (for standard dropout regularization):
$ h' = \begin{cases} 0 & \text{with probability p}\ \frac{h}{1-p} & \text{otherwise} \end{cases} $
where h are neurons before dropout and h' afterwards. This way,
For example for price prediction, since the difference depends on the item, we use the difference of the log of the target and estimate. This gives a relative difference measure. we can just do MSE on the log values.
With cross-validation, you can tune hyperparameters for best model selection. But remember that although k-fold cross-validation can be good, doing it for an absurd large set of hyperparameters might turn to be bad and select a hyperparameter set that overfits on the k-fold training procedure but does poorly on test set.