adopt_atan2 does not work for this optimization problem

I saw you are using AdoptAtan2 in your recent in RL codebase, so I assume you are successful there. But I encountered a failure case and want to report it so you are aware.

I tried both AdamAtan2 and AdoptAtan2 on boolean function in-context learning (repo: https://github.com/satwik77/incontext-bool). AdamAtan2 works very well.  But AdoptAtan2 doesn't , despite various different configurations.

Modification to `src/train.py`

```
filter out those that do not require grad
    param_dict = {pn: p for pn, p in param_dict.items() if p.requires_grad}
    # create optim groups. Any parameters that is 2D will be weight decayed, otherwise no.
    # i.e. all weight tensors in matmuls + embeddings decay, all biases and layernorms don't.
    decay_params = [p for n, p in param_dict.items() if p.dim() >= 2]
    nodecay_params = [p for n, p in param_dict.items() if p.dim() < 2]
    optim_groups = [
        {'params': decay_params, 'weight_decay': weight_decay},
        {'params': nodecay_params, 'weight_decay': 0.0}
    ]
    num_decay_params = sum(p.numel() for p in decay_params)
    num_nodecay_params = sum(p.numel() for p in nodecay_params)
    print(f"num decayed parameter tensors: {len(decay_params)}, with {num_decay_params:,} parameters")
    print(f"num non-decayed parameter tensors: {len(nodecay_params)}, with {num_nodecay_params:,} parameters")

    if optimizer == 'adopt_atan2':
        # adopt is insensitive to beta2
        #optimizer = AdoptAtan2(optim_groups, lr=args.learning_rate, betas=(betas[0], 0.999))
        optimizer = AdoptAtan2(model.parameters(), weight_decay=weight_decay, lr=args.learning_rate, betas=(betas[0], 0.999), cautious_factor=0.)
    elif optimizer == 'adam_atan2':
        # decoupled_wd does not work
        optimizer = AdamAtan2(optim_groups, lr=args.learning_rate, betas=betas)
    elif optimizer == 'adamw':
        # Create AdamW optimizer and use the fused version if it is available
        fused_available = 'fused' in inspect.signature(torch.optim.AdamW).parameters
        use_fused = fused_available and device_type == 'cuda'
        extra_args = dict(fused=True) if use_fused else dict()
        print(f"using fused AdamW: {use_fused}")
        optimizer = torch.optim.AdamW(optim_groups, lr=learning_rate, betas=betas, **extra_args)
    else:
        raise ValueError(f"Unknown optimizer {optimizer}")

    return optimizer

def train(model, args):
    optimizer = configure_optimizers(model, optimizer=args.optimizer, learning_rate=args.learning_rate, weight_decay=args.weight_decay)

```

reproduce by running (it does not converge.)
```
python -m src.train -project in-context-learning -name test -family san -model_name gpt2 -task conjunction -data boolean -train_steps 15000 -n_dims 28 -n_embd 256 -n_layer 12 -n_head 8 -batch_size 64 -learning_rate 0.00    1 -optimizer adopt_atan2 -weight_decay 0 -gpu 2
```
comparing to AdamAtan2
```
python -m src.train -project in-context-learning -name test -family san -model_name gpt2 -task conjunction -data boolean -train_steps 15000 -n_dims 28 -n_embd 256 -n_layer 12 -n_head 8 -batch_size 64 -learning_rate 0.0    001 -optimizer=adam_atan2 -gpu 3
```

and AdamW
```
python -m src.train -project in-context-learning -name test -family san -model_name gpt2 -task conjunction -data boolean -train_steps 15000 -n_dims 28 -n_embd 256 -n_layer 12 -n_head 8 -batch_size 64 -learning_rate 0.0    001 -optimizer=adamw -gpu 2
```

PS: I have tried following additional parameters and none converges
cautious_factor: 0.
weight_decay: 0.1, 0




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adopt_atan2 does not work for this optimization problem #4

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

adopt_atan2 does not work for this optimization problem #4

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions