You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A model-agnostic (only gradient decent) meta learning algorithm aims to find a good initialization point for model such that it can be fine-tuned quickly on new tasks.
Shows STOA on few-shot image classification, regression and fast fine-tuning for policy gradient.
How does it work?
Sample a batch of tasks (with few training data and "virtual" test data; the "virtual" test data is constructed from training data).
For each task_{i}
Compute gradient w.r.t L(θ) on training data and update model θ -> θ'_{i}.
Compute L(θ'_{i}) on virtual test data.
Compute gradient w.r.t. Σ_{i} L(θ'_{i}) and update model θ.
The loss can be cross-entropy loss for classification, MSE for regression and reward for RL.
FOMAML
They also tried ignoring second derivatives θ'{i} in task{i}, just update θ using Σ_{i} L(θ_{i}) (no need train/test splits in original task setup).
This is denoted as first-Order MAML (FOMAML).
Shows comparable to second-order MAML and save more computation time.
Introduce a new FOMAML algorithm, Reptile, which works by repeatedly sampling a task, training on it, and moving the initialization towards the trained weights on that task.
Really worth-reading, especially its analysis on SGD and MAML!
How does it work?
The U^{k}_{T} means we take k gradient updates in the sampled task T; ϵ = 1/α, i.e. learning rate.
We can update in batch version (n = number of tasks):
If we only take k = 1 update, this is essentially SGD on the expected loss.
If we take k > 1 updates, it is not. This will converge differently which considers second-and-higher derivatives.
Experiment on 5-shot 5-way Omniglot shows different inner-loop gradient combinations:
Why does it work?
Through Taylor expansion analysis, both MAML and Reptile contain the same leading order terms:
First: minimizing the expected loss (joint training on different tasks).
Second: maximizing within-task generalization, i.e., maximizing the inner product between k gradients from the same task. If gradients from different batch has positive inner product, then taking a gradient step on one batch improves performance on the other batch.
The result of Taylor expansion on SGD and MAML (i in [1, k]):
This explains k = 2 in the above experiment is still insufficient since it puts less weight on the second inner product term relative to the first term.
Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks (MAML)
On First-Order Meta-Learning Algorithms (Reptile)
Probabilistic Model-Agnostic Meta-Learning
Bayesian Model-Agnostic Meta-Learning
Meta-Learning with Latent Embedding Optimization (LEO)
How to Train Your MAML (MAML++)
The text was updated successfully, but these errors were encountered: