Closed
Description
Describe the feature and the current behavior/state.
The LAMB optimizer has this option, but AdamW does not. This is necessary to train transformer models with Adam.
Relevant information
- Are you willing to contribute it (yes/no): no
- Are you willing to maintain it going forward? (yes/no): no
- Is there a relevant academic paper? (if so, where):
- Is there already an implementation in another framework? (if so, where): Yes, TF 1.
- Was it part of tf.contrib? (if so, where): no
Which API type would this fall under (layer, metric, optimizer, etc.)
Optimizer
Who will benefit with this feature?
User training NLP models with LayerNorm.