Ways to show convexity of a function:
-
f is convex iff:
$\forall x, y \in X, \lambda \in [0,1]: f(\lambda x + (1-\lambda)y) \leq \lambda f(x) + (1-\lambda)f(y)$ -
The hessian of f is positive semi-definite: f is convex iff
$\nabla^2f \succeq 0$ or in other terms$x^THx \geq 0$ for all x. Hessian = second derivative.
-
Lagrangian: Add all the
$c_i(x) \leq 0$ constraints with a lagrange multiplier to the objective function. The saddle point solution of the problem is the optimal point for the original constrained problem. -
Projection: projecting the gradients or the destination point at each step onto the convex set.
In ML, to compute the gradient for all training examplpes is very expensive especially for large datasets. Therefore, i.i.d samples of the data is used to fix the computation cost for each gradient to a reasonable number.
Several choices, usually updated in fixed intervals or when progress in optimization does not change or get's too low.
-
piecewise constant:
$\eta(t) = \eta_i$ if$t_i \leq t \leq t_{i+1}$ -
exponential decay:
$\eta(t) = \eta_0 e^{-\lambda t}$ -
polynomial decay:
$\eta(t) = \eta_0(\beta t +1)^{-\alpha}$ where usually$\alpha=0.5$
Here a velocity
$
v_t = \sum_{\tau=0}^{t-1} \beta^\tau g_{t-\tau, t-\tau-1} =
g_{t,t-1} + \beta v_{t-1}
$
where g is the gradient for step
Accelerated (Momentum) optimization is especially effective for ill-posed problems (Here ill-posed was meant as objective function being too skewed in one direction). Consider the example for
If we choose higher learning rate to make training faster, x2 overshoots as a result.
With momentum (if
Typically
In some cases, such as in NLP, recommender systems, ..., we deal with sparse features. For example, in NLP, the word 'heterogenous' appears rarely in training data. Also note that we generally reduce learnning rate as training goes on (usually at a rate of
A hack to address this issue is instead of rescheduling lr by
This is nice, but it is tricky to judge when a feature was seen, and with which magnitude, etc. Adagrad uses the same idea but differently. It adjust the lr for each dimension based on the gradient it has seen so far for that dimension.
One issue with Adagrad is that the state variable adds square of gradients per coordinate without any normalization which can lead to unbounded values. Idea here is to use a leaky average to address this.
$ s_t = \lambda s_{t-1} + (1-\lambda)g_{t}^2 \ x_t = x_{t-1} - \frac{\eta}{s_t + \epsilon} \odot g_t $
with this normalizatin, we can schedule the learning rate
So, it is similar to Adagrad by preconditioning scales coordinate-wise. But uses leaky-average for that.
Is another variation of Adagrad. Uses leaky-average of second momentum and also the change of variable to do the step updates. It is known to not having a learning rate.
This is like RMSprop. Difference is that update step uses the rescaled gradient.
$ x_t = x_{t-1} - g'_{t} \ g't = \frac{\sqrt{\Delta x{t-1} + \epsilon}}{\sqrt{s_t+\epsilon}}\odot g_t $
where
Adam uses all the tricks before for a robust optimization algorithm.
commmon choices are
Normalized state variables is used for updates,
rescaled gradient is computed with,
The update step now is:
Many ways to do; theory not well-understood why it benefits. But there are cases where:
-
initial warm-up can be useful: idea is, parameter initialization might not be enough, so for the first few epochs, by starting from a small learning-rate, one would linearly increase lr to the value wanted initially, and then let the scheduler kick in.
-
the scheduling can be done polynomial, factor, or piece-wise among many other.