Skip to content

adopt_atan2 does not work for this optimization problem #4

Open
@tesla3

Description

@tesla3

I saw you are using AdoptAtan2 in your recent in RL codebase, so I assume you are successful there. But I encountered a failure case and want to report it so you are aware.

I tried both AdamAtan2 and AdoptAtan2 on boolean function in-context learning (repo: https://github.com/satwik77/incontext-bool). AdamAtan2 works very well. But AdoptAtan2 doesn't , despite various different configurations.

Modification to src/train.py

filter out those that do not require grad
    param_dict = {pn: p for pn, p in param_dict.items() if p.requires_grad}
    # create optim groups. Any parameters that is 2D will be weight decayed, otherwise no.
    # i.e. all weight tensors in matmuls + embeddings decay, all biases and layernorms don't.
    decay_params = [p for n, p in param_dict.items() if p.dim() >= 2]
    nodecay_params = [p for n, p in param_dict.items() if p.dim() < 2]
    optim_groups = [
        {'params': decay_params, 'weight_decay': weight_decay},
        {'params': nodecay_params, 'weight_decay': 0.0}
    ]
    num_decay_params = sum(p.numel() for p in decay_params)
    num_nodecay_params = sum(p.numel() for p in nodecay_params)
    print(f"num decayed parameter tensors: {len(decay_params)}, with {num_decay_params:,} parameters")
    print(f"num non-decayed parameter tensors: {len(nodecay_params)}, with {num_nodecay_params:,} parameters")

    if optimizer == 'adopt_atan2':
        # adopt is insensitive to beta2
        #optimizer = AdoptAtan2(optim_groups, lr=args.learning_rate, betas=(betas[0], 0.999))
        optimizer = AdoptAtan2(model.parameters(), weight_decay=weight_decay, lr=args.learning_rate, betas=(betas[0], 0.999), cautious_factor=0.)
    elif optimizer == 'adam_atan2':
        # decoupled_wd does not work
        optimizer = AdamAtan2(optim_groups, lr=args.learning_rate, betas=betas)
    elif optimizer == 'adamw':
        # Create AdamW optimizer and use the fused version if it is available
        fused_available = 'fused' in inspect.signature(torch.optim.AdamW).parameters
        use_fused = fused_available and device_type == 'cuda'
        extra_args = dict(fused=True) if use_fused else dict()
        print(f"using fused AdamW: {use_fused}")
        optimizer = torch.optim.AdamW(optim_groups, lr=learning_rate, betas=betas, **extra_args)
    else:
        raise ValueError(f"Unknown optimizer {optimizer}")

    return optimizer

def train(model, args):
    optimizer = configure_optimizers(model, optimizer=args.optimizer, learning_rate=args.learning_rate, weight_decay=args.weight_decay)

reproduce by running (it does not converge.)

python -m src.train -project in-context-learning -name test -family san -model_name gpt2 -task conjunction -data boolean -train_steps 15000 -n_dims 28 -n_embd 256 -n_layer 12 -n_head 8 -batch_size 64 -learning_rate 0.00    1 -optimizer adopt_atan2 -weight_decay 0 -gpu 2

comparing to AdamAtan2

python -m src.train -project in-context-learning -name test -family san -model_name gpt2 -task conjunction -data boolean -train_steps 15000 -n_dims 28 -n_embd 256 -n_layer 12 -n_head 8 -batch_size 64 -learning_rate 0.0    001 -optimizer=adam_atan2 -gpu 3

and AdamW

python -m src.train -project in-context-learning -name test -family san -model_name gpt2 -task conjunction -data boolean -train_steps 15000 -n_dims 28 -n_embd 256 -n_layer 12 -n_head 8 -batch_size 64 -learning_rate 0.0    001 -optimizer=adamw -gpu 2

PS: I have tried following additional parameters and none converges
cautious_factor: 0.
weight_decay: 0.1, 0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions