-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
finetune: -ngl option to offload to gpu? #3458
Comments
Same issue here |
If you build with cublas, finetuning might be slightly faster. But cuda optimization has not been written, I don't think. |
I'm not sure if I should be posting it to this issue specifically, but since it might be useful...
Inference without
In commit db3abcc up to the latest master at the time I wrote this comment
In this commit, even However, In commit eee42c6 and earlier, I can do CPU finetuning on a
|
I get some utilization in nvidia-smi after building with cublas (not insignificant but not alot). |
if anyone has any idea what the best way to start thinking about fixing this is, let me know. want to get this to work. i know c++ and llama well enough, but not cuda so much. willing to learn/work on this tho. |
this is now eligible for a bounty: OpenAgentsInc/workerbee#15. can dm-me for negotiated amount. |
Which codepath is most interesting here? f16 is slower, but running inference on q8_0 gives this warning: |
I think F-16 seems quite reasonable most people who are doing fine tuning are using larger GPUs it really just comes down to leveraging them as much of the GPUs as you can we're doing F-16 and f-32 now and we let the user choose also we produce two outputs I think our users like the merged GGUF as opposed to the adapter although there are advantages to both so we just generate both and let them download what they want |
Well fwiw, I have something working locally that does f32.
So I suppose this is of use for someone who actually wants to use an f32 finetuned model. P.S. I found that CPU finetuning was a bit faster than GPU and used a lot less memory. Presumably because CPU finetuning uses a smaller intermediate format than f32. This suggests that enabling GPU finetuning right now is worthless, but perhaps not - maybe it is worthwhile to someone with a lot more VRAM. The openllama v2 3B model is a bit big for my machine, when running at f32. OTOH, my CPU doesn't have AVX. |
i suppose you could convert that f32 to an f16 afterward and/or quantize as needed?
that makes it kinda not helpful. maybe a conversion step fixes this?
hmm, it should be fine. the real issue is space. fine-tuning is probably best if it fits in the gpu and only "spills over" a few layers to the cpu. when i fine-tune in pytorch, i typically use load_8_bit=True, and when i'm testing stuff, i might run load_4_bit (just to see how it goes, and if it's learning what i want it to learn). i rarely have room 16-bit.
it's definitely a good step, but most people seem to use models in f16, q8 or lower q's.
yes, if you're running f32, it's going to be slower. i
still think a small step is worth it for a merge (firm believer in baby steps when it comes to code), especially if there are good tests, but i doubt it will be used much until the f16 and even quantized-fine-tunes are supported. |
Looking at the code some more, I see that intermediates are f32 in all implementations (e.g. ggml_mul_mat has a hardcoded GGML_TYPE_F32 for its result). So I was wrong earlier in thinking the intermediate format is different between CPU and GPU. This is really just about downconverting to lora deltas, which I've not looked at yet. I'm not quite sure what Johannes meant by "Currently the CUDA code runs everything as f32 by default"
Thanks, then that was probably a bug / user error. |
indeed even torch doesn't fine tune well across gpu and cpu! |
We maybe close to having it now! @AndrewGodfrey thanks! |
it would be amazing but also I feel like somehow predictable that llama CPP beats torch at getting quantized fine tuning to work with a GPU across multiple operating systems. |
Note: As detailed here, finetune now has an "-ngl" option and it does offload some of the work to the GPU. But a lot of the training work is done on the CPU and so it barely helps, and in some cases runs slower. |
How to you offload layers to the GPU with finetune? There is no
-ngl
option.The text was updated successfully, but these errors were encountered: