A comprehensive, all-in-one collection of optimization algorithms for deep learning, designed for maximum efficiency, minimal memory footprint, and superior performance across diverse model architectures and training scenarios.
pip install adv_optmThis library integrates multiple state-of-the-art optimization techniques validated through extensive research and practical training, with 1-bit compression for optimizer states:
- Paper: SMMF: Square-Matricized Momentum Factorization
- Approach: Uses rank-1 non-negative matrix factorization with reconstruction cycle (factor β reconstruct β update β factor)
- Innovation:
- First moment split into 1-bit sign + absolute value
- Final storage: four factored vectors + one 1-bit sign state
- Preserves Adam-like update quality with drastically reduced memory
| Optimizer | Memory Usage | Description |
|---|---|---|
Adopt_Factored |
328 MB | 4 small vectors + 1-bit state |
Adopt_Factored + AdEMAMix |
625 MB | 6 small vectors + two 1-bit states |
Simplified_AdEMAMix |
328 MB | Same as standard factored (no extra state) |
| Optimizer | Speed | Notes |
|---|---|---|
Adafactor |
~8.5s/it | Baseline |
Adopt_Factored |
~10s/it | +18% overhead from compression |
Adopt_Factored + AdEMAMix |
~12s/it | +41% overhead (3 factored states) |
| Optimizer | Description | Best For |
|---|---|---|
Adam_Adv |
Advanced Adam implementation | General purpose |
Adopt_Adv |
Adam-variant with independent beta2 | Stable training for small batch size regimes |
Prodigy_Adv |
Prodigy with D-Adaptation | Adam with automatic LR tuning |
Simplified_AdEMAMix |
Adam variant with accumulator momentum | Small/large batch training when tuned correctly |
Lion_Adv |
Advanced Lion implementation | Memory-constrained environments |
Prodigy_Lion_Adv |
Prodigy + Lion combination | Lion with automatic LR tuning |
| Feature | Adam_Adv | Adopt_Adv | Prodigy_Adv | Simplified_AdEMAMix | Lion_Adv |
|---|---|---|---|---|---|
| Factored | β | β | β | β | β |
| AdEMAMix | β | β | β | β | β |
| Simplified_AdEMAMix | β | β | β | β | β |
| OrthoGrad | β | β | β | β | β |
| Grams | β | β | β | β | β |
| Cautious | β | β | β | β | β |
| atan2 | β | β | β | β | β |
| Stochastic Rounding | β | β | β | β | β |
| Fused Backward Pass | β | β | β | β | β |
| Kourkoutas-Ξ² | β | β | β | β | β |
These features work with all optimizers and are generally safe to enable.
| Feature | Description | Recommended Usage | Performance Impact | Theoretical Basis | Compatibility |
|---|---|---|---|---|---|
| Fused Back Pass | Fuses backward pass; gradients used immediately and memory freed on-the-fly | Memory-constrained environments | Reduces peak memory | Memory optimization | All optimizers |
| Stochastic Rounding | Replaces nearest rounding with stochastic rounding to preserve small gradient updates in BF16 | BF16 training | Minimal overhead (<5%) | Revisiting BFloat16 Training | All optimizers |
| OrthoGrad | Removes gradient component parallel to weights to reduce overfitting | Full fine-tuning without weight decay | +33% time overhead (BS=4); less at larger BS | Grokking at Edge | All optimizers |
| Factored | Memory-efficient optimization via rank-1 1-bit factorization of optimizer states | Large models / memory-limited hardware | Adds compression overhead | SMMF | All optimizers |
| Feature | Description | Recommended Usage | Performance Impact | Theoretical Basis | Compatibility |
|---|---|---|---|---|---|
| Cautious | Only applies update if gradient direction aligns with momentum direction | Accelerating convergence | No overhead | C-Optim | Adam/Adopt/Prodigy/Lion |
| Grams | Update direction derived purely from current gradient | When Cautious is insufficient | No overhead | Grams | Adam/Adopt/Prodigy |
| AdEMAMix | Dual EMA system that retains relevance of gradients over tens of thousands of steps | Long training runs, especially where model forgetting is a concern | +1 state memory | AdEMAMix | Adam/Adopt/Prodigy |
| Simplified_AdEMAMix | Accumulator-based momentum, single EMA variant of AdEMAMix | All scenarios when tuned correctly | No overhead | Connections | Adam/Adopt/Prodigy |
| atan2 | Robust epsilon replacement with built-in gradient clipping | Use for stable bounded updates (or for Adopt as it needs that) | No overhead | Adam-atan2 | Adam/Adopt/Prodigy |
| Kourkoutas-Ξ² | Layer-wise adaptive Ξ²β based on gradient βsunspikeβ ratio | Noisy/small/large-batch/high-LR training | No overhead | Kourkoutas-Ξ² | Adam/Adopt/Prodigy/Simplified_AdEMAMix |
Note: If both Cautious and Grams are enabled, Grams takes precedence and Cautious is disabled.
- Adds a slow-decaying second EMA (
beta3) that retains gradient memory over tens of thousands of steps. - Particularly effective for small batch sizes, where Adamβs standard first moment is nearly useless.
| Parameter | Default | Tuning Guide |
|---|---|---|
beta3 |
0.9999 | β’ Runs >120k steps: 0.9999 β’ Runs β€120k steps: 0.999 |
alpha |
5 | β’ Reduce to 2β3 if diverging β’ Increase to strengthen long-term memory |
β Pro Tip: Set
beta1=0in Adam/Adopt/Prodigy to skip standard EMA entirely and rely solely on AdEMAMixβs slow EMA, ideal for small-batch regimes.
- Introduced in Connections between Schedule-Free Optimizers, AdEMAMix, and Accelerated SGD Variants (arXiv:2502.02431).
- Replaces Adamβs first moment with a theory-based momentum with emphasize on raw gradient, combining the stability of long memory with responsiveness to recent gradients.
- Key insight: Classical momentum does not accelerate in noisy (small-batch) regimes; this accumulator do.
| Parameter | Default | Tuning Guide |
|---|---|---|
beta1 |
0.99 | Controls accumulator memory length: β’ Small BS: 0.99β0.9999 β’ Large BS: 0.9 |
Grad Ξ± |
100 | Most critical parameter: β’ Inversely scales with batch size β’ 100β10 for small BS (β€32) β’ 1β0.1 for large BS (β₯512) |
β οΈ Critical: Requires ~100x smaller learning rate than AdamW (e.g., 1e-6 vs 1e-4).
ForProdigy_Adv, setinitial_dto:
- LoRA:
1e-8- Full FT:
1e-10- Embedding:
1e-7
β οΈ Incompatible with: Cautious, Grams, atan2, and standard update clipping.
- Replaces
epsin Adam-family optimizers with a scale-invariant, bounded update rule. - Automatically clips updates to [-2, 2], preventing destabilizing jumps.
- Highly recommended for
Adopt_Adv, which is prone to instability without clipping.
π Reference:
Kourkoutas-Ξ² introduces a sunspike-driven, layer-wise adaptive second-moment decay (Ξ²β) as an optional enhancement for Adam_Adv, Adopt_Adv, Prodigy_Adv, and Simplified_AdEMAMix.
Instead of using a fixed Ξ²β (e.g., 0.999 or 0.95), it dynamically modulates Ξ²β per layer based on a bounded sunspike ratio:
- During gradient bursts β Ξ²β β toward
Lower Ξ²ββ faster reaction - During calm phases β Ξ²β β toward
The Selected Ξ²ββ stronger smoothing
This is especially effective for noisy training, small batch sizes, and high learning rates, where gradient norms shift abruptly due to noise or aggressive LR schedules.
| Category | Details |
|---|---|
| β Pros | β’ Layer-wise adaptation blends benefits of high Ξ²β (strong smoothing) and low Ξ²β (fast reaction). β’ Robust to sudden loss landscape shifts, reacts quickly during gradient bursts, smooths during calm phases. β’ High tolerance to aggressive learning rates. |
β’ Potentially unstable at the start of training due to unreliable early gradient norms; mitigated by using K-Ξ² Warmup Steps. |
π‘ Best Practice: Set
K_warmup_stepsequal to your standard LR warmup steps. During warmup, the optimizer uses the staticbeta2; adaptation begins only after warmup ends.
π Reference: