Open
Description
I saw you are using AdoptAtan2 in your recent in RL codebase, so I assume you are successful there. But I encountered a failure case and want to report it so you are aware.
I tried both AdamAtan2 and AdoptAtan2 on boolean function in-context learning (repo: https://github.com/satwik77/incontext-bool). AdamAtan2 works very well. But AdoptAtan2 doesn't , despite various different configurations.
Modification to src/train.py
filter out those that do not require grad
param_dict = {pn: p for pn, p in param_dict.items() if p.requires_grad}
# create optim groups. Any parameters that is 2D will be weight decayed, otherwise no.
# i.e. all weight tensors in matmuls + embeddings decay, all biases and layernorms don't.
decay_params = [p for n, p in param_dict.items() if p.dim() >= 2]
nodecay_params = [p for n, p in param_dict.items() if p.dim() < 2]
optim_groups = [
{'params': decay_params, 'weight_decay': weight_decay},
{'params': nodecay_params, 'weight_decay': 0.0}
]
num_decay_params = sum(p.numel() for p in decay_params)
num_nodecay_params = sum(p.numel() for p in nodecay_params)
print(f"num decayed parameter tensors: {len(decay_params)}, with {num_decay_params:,} parameters")
print(f"num non-decayed parameter tensors: {len(nodecay_params)}, with {num_nodecay_params:,} parameters")
if optimizer == 'adopt_atan2':
# adopt is insensitive to beta2
#optimizer = AdoptAtan2(optim_groups, lr=args.learning_rate, betas=(betas[0], 0.999))
optimizer = AdoptAtan2(model.parameters(), weight_decay=weight_decay, lr=args.learning_rate, betas=(betas[0], 0.999), cautious_factor=0.)
elif optimizer == 'adam_atan2':
# decoupled_wd does not work
optimizer = AdamAtan2(optim_groups, lr=args.learning_rate, betas=betas)
elif optimizer == 'adamw':
# Create AdamW optimizer and use the fused version if it is available
fused_available = 'fused' in inspect.signature(torch.optim.AdamW).parameters
use_fused = fused_available and device_type == 'cuda'
extra_args = dict(fused=True) if use_fused else dict()
print(f"using fused AdamW: {use_fused}")
optimizer = torch.optim.AdamW(optim_groups, lr=learning_rate, betas=betas, **extra_args)
else:
raise ValueError(f"Unknown optimizer {optimizer}")
return optimizer
def train(model, args):
optimizer = configure_optimizers(model, optimizer=args.optimizer, learning_rate=args.learning_rate, weight_decay=args.weight_decay)
reproduce by running (it does not converge.)
python -m src.train -project in-context-learning -name test -family san -model_name gpt2 -task conjunction -data boolean -train_steps 15000 -n_dims 28 -n_embd 256 -n_layer 12 -n_head 8 -batch_size 64 -learning_rate 0.00 1 -optimizer adopt_atan2 -weight_decay 0 -gpu 2
comparing to AdamAtan2
python -m src.train -project in-context-learning -name test -family san -model_name gpt2 -task conjunction -data boolean -train_steps 15000 -n_dims 28 -n_embd 256 -n_layer 12 -n_head 8 -batch_size 64 -learning_rate 0.0 001 -optimizer=adam_atan2 -gpu 3
and AdamW
python -m src.train -project in-context-learning -name test -family san -model_name gpt2 -task conjunction -data boolean -train_steps 15000 -n_dims 28 -n_embd 256 -n_layer 12 -n_head 8 -batch_size 64 -learning_rate 0.0 001 -optimizer=adamw -gpu 2
PS: I have tried following additional parameters and none converges
cautious_factor: 0.
weight_decay: 0.1, 0
Metadata
Metadata
Assignees
Labels
No labels