This work tries improvements to the existing 'Hypergradient' based optimizers proposed in the paper Online Learning Rate Adaptation with Hypergradient Descent. The report summarises the work and can be found here.
The method proposed in the paper "Online Learning Rate Adaptation with Hypergradient Descent" automatically adjusts the learning rate to minimize some estimate of the expectation of the loss, by introducing the “hypergradient” - the gradient of any loss function w.r.t hyperparameter “eta” (the optimizer learning rate). It learns the step-size via an update from gradient descent of the hypergradient at each training iteration, and uses it alongside the model optimizers SGD, SGD with Nesterov (SGDN) and Adam resulting in their hypergradient counterparts SGD-HD, SGDN-HD and Adam-HD, which demonstrate faster convergence of the loss and better generalization than solely using the original (plain) optimizers.
But we expect that the hypergradient based learning rate update could be more accurate and aim to exploit the gains much better by boosting the learning rate updates with momentum and adaptive gradients, experimenting with
- Hypergradient descent with momentum, and
- Adam with Hypergradient,
alongside the model optimizers SGD, SGD with Nesterov(SGDN) and Adam.
The naming convention used is: {model optimizer}op-{learning rate optimizer}lop, following which we have {model optimizer}op-SGDNlop (when the l.r. optimizer is hypergradient descent with momentum) and {model optimizer}op-Adamlop (when the l.r. optimizer is adam with hypergradient).
The new optimizers and the respective hypergradient-descent baselines from which their performance are compared are given as
- SGDop-SGDNlop, with baseline SGD-HD (i.e. SGDop-SGDlop)
- SGDNop-SGDNlop, with baseline SGDN-HD (i.e SGDNop-SGDlop)
- Adamop-Adamlop, with baseline Adam-HD (i.e Adamop-SGDlop)
The optimizers provide the following advantages when evaluated against their hypergradient-descent baselines: Better generalization, Faster convergence, Better training stability (less sensitive to the initial chosen learning rate).
The alpha_0 (initial learning rate) and beta (hypergradient l.r) configurations for the new optimizers are kept the same as the respective baselines from the paper (see run.sh for details). The results show that the new optimizers perform better for all the three models (VGGNet, LogReg, MLP). More description about the optimizers can be found in the project report here.
Behavior of the optimizers compared with their hypergradient-descent baselines.
Columns: left: logistic regression on MNIST; middle: multi-layer neural network on MNIST; right: VGG Net on CIFAR-10.
The project is organised as follows:
.
├── hypergrad/
│ ├── __init__.py
│ ├── sgd_Hd.py # model op. sgd, l.r. optimizer Hypergadient-descent (original)
│ └── adam_Hd.py #model op. adam, l.r. optimizer Hypergadient-descent (original)
├── op_sgd_lop_sgdn.py # model op. sgd, l.r. optimizer Hypergadient-descent with momentum
├── op_sgd_lop_adam.py # model op. sgd, l.r. optimizer Adam with Hypergadient
├── op_adam_lop_sgdn.py # model op. adam, l.r. optimizer Hypergadient-descent with momentum
├── op_adam_lop_adam.py # model op. adam, l.r. optimizer Adam with Hypergadient
├── vgg.py
├── train.py
├── test/ # results of the experiments
├── plot_src/
├── plots/ # Experiment plots
├── run_.sh # to run the experiments
.
folders and files below will be generated after running the experiments
.
├── {model}_{optimizer}_{beta}_epochs{X}.pth # Model checkpoint
└── test/{model}/{alpha}_{beta}/{optimizer}.csv # Experiment results
The experiment configurations (hyperparameters alpha_0 and beta) are defined in run.sh for the optimizers and three model classes. The experiments for the new optimizers are run following the same settings as their Hypergradient-descent versions: Logreg (20 epochs on MNIST), MLP (100 epochs on MNIST) and VGGNet (200 epochs on CIFAR-10).
- harshalmittal4
- yashkant
- Ankit-Dhankhar