Closed
Description
Our low bit optimizers were merged in HF huggingface/transformers#31865 but
We have a known limitation that the 4 bit optimizer is not great when we don't have a constant learning rate
This is mentioned in the README https://github.com/pytorch/ao/tree/main/torchao/prototype/low_bit_optim
Known issue: When learning rate is updated every step (e.g. using cosine learning rate scheduler), training speed is slower. This is because we have to convert learning rate to a CUDA tensor (which incurs expensive memory transfer cost), since torch.compile() will treat a Python float as a constant and trigger recompile whenever the value is changed.
However this is preventing @winglian from adopting this work
Metadata
Metadata
Assignees
Labels
No labels