26
26
class DecoupledWeightDecayExtension :
27
27
"""This class allows to extend optimizers with decoupled weight decay.
28
28
29
- It implements the decoupled weight decay described by Loshchilov & Hutter
29
+ It implements the decoupled weight decay described by [ Loshchilov & Hutter]
30
30
(https://arxiv.org/pdf/1711.05101.pdf), in which the weight decay is
31
31
decoupled from the optimization steps w.r.t. to the loss function.
32
32
For SGD variants, this simplifies hyperparameter search since it decouples
@@ -343,7 +343,7 @@ class OptimizerWithDecoupledWeightDecay(
343
343
This class computes the update step of `base_optimizer` and
344
344
additionally decays the variable with the weight decay being
345
345
decoupled from the optimization steps w.r.t. to the loss
346
- function, as described by Loshchilov & Hutter
346
+ function, as described by [ Loshchilov & Hutter]
347
347
(https://arxiv.org/pdf/1711.05101.pdf). For SGD variants, this
348
348
simplifies hyperparameter search since it decouples the settings
349
349
of weight decay and learning rate. For adaptive gradient
@@ -367,9 +367,8 @@ class SGDW(DecoupledWeightDecayExtension, tf.keras.optimizers.SGD):
367
367
"""Optimizer that implements the Momentum algorithm with weight_decay.
368
368
369
369
This is an implementation of the SGDW optimizer described in "Decoupled
370
- Weight Decay Regularization" by Loshchilov & Hutter
371
- (https://arxiv.org/abs/1711.05101)
372
- ([pdf])(https://arxiv.org/pdf/1711.05101.pdf).
370
+ Weight Decay Regularization" by [Loshchilov & Hutter]
371
+ (https://arxiv.org/pdf/1711.05101.pdf).
373
372
It computes the update step of `tf.keras.optimizers.SGD` and additionally
374
373
decays the variable. Note that this is different from adding
375
374
L2 regularization on the variables to the loss. Decoupling the weight decay
@@ -447,9 +446,8 @@ class AdamW(DecoupledWeightDecayExtension, tf.keras.optimizers.Adam):
447
446
"""Optimizer that implements the Adam algorithm with weight decay.
448
447
449
448
This is an implementation of the AdamW optimizer described in "Decoupled
450
- Weight Decay Regularization" by Loshch ilov & Hutter
451
- (https://arxiv.org/abs/1711.05101)
452
- ([pdf])(https://arxiv.org/pdf/1711.05101.pdf).
449
+ Weight Decay Regularization" by [Loshchilov & Hutter]
450
+ (https://arxiv.org/pdf/1711.05101.pdf).
453
451
454
452
It computes the update step of `tf.keras.optimizers.Adam` and additionally
455
453
decays the variable. Note that this is different from adding L2
0 commit comments