Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

finetune: Add -ngl parameter #3762

Merged
merged 16 commits into from
Nov 1, 2023

Conversation

AndrewGodfrey
Copy link
Contributor

Fix #3458

When I tried CUDA offloading during finetuning following the readme, I got an assert here.
This probably isn't an important case because inference later gives a warning saying you should use f16 or f32 instead when using lora
"error: operator (): Finetuning on tensors with type 'f16' is not yet supported"
@ggerganov
Copy link
Owner

The train and finetune examples do not support GPU offloading. There is no point in adding -ngl parameter

@AndrewGodfrey
Copy link
Contributor Author

The train and finetune examples do not support GPU offloading. There is no point in adding -ngl parameter

I believe it did work for me (with these changes) for f32, with an f16 base model. The result was something I couldn't directly use with --lora and GPU inference, but I could use it with --lora and CPU inference, and it did work (I verified that it had finetuned). The only way your statement could be true, afaik, is if 'finetune' didn't actually offload to GPU when it told me it did (I offloaded 25 out of 29 layers).

@AndrewGodfrey
Copy link
Contributor Author

@ggerganov, What I can say is that it schedules work on the GPU, uses a bunch of VRAM, and when I was developing, it hit not-implemented codepaths (hence the code I added). In the console output it says it's using GPU, and it runs faster (though I don't know if it's doing the same amount of work - it produces a lora that, though working, is different somehow in format than produced by the CPU run).

Are you sure GPU offloading isn't supported? It seems hard to believe that it can do all this AND do the 'real' work on the CPU AND produce a different result AND run faster.

Copy link

@kotee4ko kotee4ko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work.

examples/finetune/finetune.sh Outdated Show resolved Hide resolved
@ggerganov
Copy link
Owner

Ah, interesting. I haven't realized that we load the base model with the standard llama.h API which can offload some or all the tensors to the GPU. Later when doing finetuning, some of the ops can benefit from the base tensors being already on the GPU although the results from these ops would likely be copied back to the CPU. So this is far from full offloading, but it actually can provide improvement in the speed.

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use spaces instead of tabs and we can merge

@AndrewGodfrey
Copy link
Contributor Author

I don't know why it still thinks there are conflicts in ggml.c - I merged from master. And when I click 'Resolve conflicts' I get a weird information-devoid view of ggml.c (which looks the same as before I merged from master).

@cebtenzzre
Copy link
Collaborator

cebtenzzre commented Oct 31, 2023

I don't know why it still thinks there are conflicts in ggml.c - I merged from master.

It looks like you merged master from your repo and not from this one.

edit: merged.

@ggerganov ggerganov merged commit 73bdcb3 into ggerganov:master Nov 1, 2023
23 of 32 checks passed
ggerganov added a commit that referenced this pull request Nov 1, 2023
@AndrewGodfrey AndrewGodfrey deleted the finetune_enableGpu branch November 1, 2023 17:19
olexiyb pushed a commit to Sanctum-AI/llama.cpp that referenced this pull request Nov 23, 2023
* Add '-ngl' support to finetune.cpp

* Add fprintf in ggml_cuda_op_add

When I tried CUDA offloading during finetuning following the readme, I got an assert here.
This probably isn't an important case because inference later gives a warning saying you should use f16 or f32 instead when using lora

* Add 'finetune.sh', which currently fails when using GPU

"error: operator (): Finetuning on tensors with type 'f16' is not yet supported"

* tweak finetune.sh

* Suppress some warnings in ggml.c

* Add f16 implementation to ggml_compute_forward_add_f16_f32

* Add an f16 case to ggml_add_cast_impl and llama_build_lora_finetune_graphs

* finetune.sh: Edit comments

* Add "add_f16_f32_f32_cuda"

* Tweak an error message

* finetune.sh: Add an optional LLAMA_MODEL_DIR variable

* finetune.sh: Add an optional LLAMA_TRAINING_DIR variable

* train : minor

* tabs to spaces

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>
olexiyb pushed a commit to Sanctum-AI/llama.cpp that referenced this pull request Nov 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

finetune: -ngl option to offload to gpu?
4 participants