Add the Adam optimizer from [Kingma et al., 2014](http://arxiv.org/abs/1412.6980). #264

copybara-service · 2025-05-28T18:08:23Z

Add the Adam optimizer from Kingma et al., 2014.

Some specific design decisions were made that differ from Keras/Optax.

Keras ignores the step-dependent bias correction for epsilon (optax and tensorflow's Adam optimizer's setting. google-deepmind/optax#571),
which differs from the original paper. We do correct for the bias,
consistent with optax/pytorch.
Keras/pytorch support amsgrad: bool as an option, which changes how the variable is
updated, keeping track of the maximum velocity encountered. However, this
would lead to an additional state parameter (v_max), and conditionally
changes the number of slot variables. Slot variables are particularly
expensive in large embedding lookups (each is the size of the entire
sharded table), and would require a different underlying primitive anyways.
If we need the option, we can create a new optimizer. This is consistent with optax,
which has a separate optax.amsgrad optimizer.
Optax supports a nesterov: bool option. Similar to amsgrad, this modifies
the update rule. Technically the Nesterov modification also adds a step-dependent
beta_1 parameter, and requires an additional state variable to keep track
of the accumulated product - something Optax currently ignores. Keras handles
this with a different optimizer, keras.optimizer.Nadam, which does add
the additional state variable. PyTorch also has a separate torch.optim.NAdam
specifically for this.

…s/1412.6980). Some specific design decisions were made that differ from Keras/Optax. - Keras ignores the step-dependent bias correction for epsilon (google-deepmind/optax#571), which differs from the original paper. We _do_ correct for the bias, consistent with optax/pytorch. - Keras/pytorch support `amsgrad: bool` as an option, which changes how the variable is updated, keeping track of the maximum velocity encountered. However, this would lead to an additional state parameter (`v_max`), and conditionally changes the number of slot variables. Slot variables are particularly expensive in large embedding lookups (each is the size of the entire sharded table), and would require a different underlying primitive anyways. If we need the option, we can create a new optimizer. This is consistent with optax, which has a separate `optax.amsgrad` optimizer. - Optax supports a `nesterov: bool` option. Similar to `amsgrad`, this modifies the update rule. Technically the Nesterov modification also adds a step-dependent `beta_1` parameter, and requires an additional state variable to keep track of the accumulated product - something Optax currently ignores. Keras handles this with a different optimizer, `keras.optimizer.Nadam`, which does add the additional state variable. PyTorch also has a separate `torch.optim.NAdam` specifically for this. PiperOrigin-RevId: 764333873

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add the Adam optimizer from [Kingma et al., 2014](http://arxiv.org/abs/1412.6980). #264

Add the Adam optimizer from [Kingma et al., 2014](http://arxiv.org/abs/1412.6980). #264

copybara-service bot commented May 28, 2025

Uh oh!

Uh oh!

Add the Adam optimizer from [Kingma et al., 2014](http://arxiv.org/abs/1412.6980). #264

Are you sure you want to change the base?

Add the Adam optimizer from [Kingma et al., 2014](http://arxiv.org/abs/1412.6980). #264

Conversation

copybara-service bot commented May 28, 2025

Uh oh!

Uh oh!