-
-
Notifications
You must be signed in to change notification settings - Fork 863
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ValueError: Attempting to unscale FP16 gradients. #1031
Comments
However I am able to run lora with fp16 in my other experiments https://github.com/hengjiUSTC/learn-llm/blob/main/trl_finetune.py#L316. So I am not sure what is the expected behavior? |
I found the bug happens when I set
|
I'm wondering if we are even supposed to be recasting to fp16. the original qlora only recasts when bf16 is used https://github.com/artidoro/qlora/blame/main/qlora.py#L396-L405 |
@hengjiUSTC if you comment out these lines for your configuration above, does that fix the issue? |
I am using Lora instead of Qlora, these lines won't be triggered https://github.com/OpenAccess-AI-Collective/axolotl/blob/main/src/axolotl/utils/models.py#L554-L561/
load_in_8bit is false and load_in_4bit is also false |
See relevant discussion in : Here are some experiements:
No error for below two configs
I am a bit new to these settings, does anyone know what is the reason? (I am using T4 gpu, so not able to use bf16) |
I get confirmation that we should not load model in float16 when enable fp16 in peft config. huggingface/peft#341 (comment). But I do see a lot of code (other finetune repo) doing this. And it's the reason error is raised in Axolotl (when fp16 is ture in config.yml, model is loaded with float16 and fp16 is enabled in peft). |
I also have these lines because I am using ChatML and adding new tokens to the base model
|
Based on what @hengjiUSTC linked, if I understand it correctly, fp16 adapter training must use fp32 for trainable and fp16 for non-trainable. They provided a utility function |
It worked for me by setting: --mixed_precision="bf16" |
set bf16 to True can work |
I am using Kaggle's notebook environment with the following specifications:
I attempted full fine-tuning, not LoRA |
Please check that this issue hasn't been reported before.
Expected Behavior
Should run correctly.
Current behaviour
running crash
Steps to reproduce
I use following config:
and run with
python3 -m axolotl.cli.train mix_tangshi/config.yml
Config yaml
No response
Possible solution
No response
Which Operating Systems are you using?
Python Version
3.10
axolotl branch-commit
main commit 3678a6c
Acknowledgements
The text was updated successfully, but these errors were encountered: