v0.44.0: New AdEMAMix optimizer, Embeddings quantization, and more! #1375
Titus-von-Koeller
announced in
Announcements
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
New optimizer: AdEMAMix
The AdEMAMix optimizer is a modification to AdamW which proposes tracking two EMAs to better leverage past gradients. This allows for faster convergence with less training data and improved resistance to forgetting.
We've implemented 8bit and paged variations:
AdEMAMix
,AdEMAMix8bit
,PagedAdEMAMix
, andPagedAdEMAMix8bit
. These can be used with a similar API to existing optimizers.8-bit Optimizers Update
The block size for all 8-bit optimizers has been reduced from 2048 to 256 in this release. This is a change from the original implementation proposed in the paper which improves accuracy.
CUDA Graphs support
A fix to enable CUDA Graphs capture of kernel functions was made in #1330. This allows for performance improvements with inference frameworks like vLLM. Thanks @jeejeelee!
Quantization for Embeddings
The trend of LLMs to use larger vocabularies continues. The embeddings can take up a significant portion of a quantized model's footprint. We now have an implementation of
Embedding4bit
andEmbedding8bit
thanks to @galqiwi!Example usage:
Continuous Builds
We are now building binary wheels for each change on
main
. These builds can be used to preview upcoming changes.🚤 Continuous Build
What's Changed
move_to_device
kwarg to the optimizer'sload_state_dict
by @koute in Addmove_to_device
kwarg to the optimizer'sload_state_dict
#1344New Contributors
move_to_device
kwarg to the optimizer'sload_state_dict
#1344Full Changelog: 0.43.3...v0.44.0
This discussion was created from the release v0.44.0: New AdEMAMix optimizer, Embeddings quantization, and more!.
Beta Was this translation helpful? Give feedback.
All reactions