Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

falcon arch fix for tied output embeddings #4978

Merged

Conversation

cmp-nct
Copy link
Contributor

@cmp-nct cmp-nct commented Jan 16, 2024

The latest falcon finetune from Openbuddy is merging intput/output tensors, it was typically separate before as lm_head.
Here it is: https://huggingface.co/OpenBuddy/openbuddy-falcon-40b-v16.1-4k

This PR uses input embeddings if no output ones are available.
Tested on wizard 40 (separate) and and openbuddy 40 (shared) and both work.

llama.cpp Outdated Show resolved Hide resolved
@cmp-nct
Copy link
Contributor Author

cmp-nct commented Jan 16, 2024

One thing is strange though, not super simple to debug.
ggml-ehartford-WizardLM-Uncensored-Falcon-40b-Q2_K.gguf -> inference is 35tk/sec
openbuddy-falcon-40b-v16.1-4k\ggml-model-Q2_K.gguf -> inference is 25tk/sec

That's despite the openbuddy variant being a bit smaller in total size.
Maybe tensor sizes are just not optimal for llama.cpp kernels anymore, something is slowing it down despite being same quantization and architecture (except for the output tensor)

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
@slaren
Copy link
Collaborator

slaren commented Jan 16, 2024

tok_embd is always allocated in the CPU, so the result of doing this in this way is that the output layer cannot be offloaded. To avoid the performance degradation, the tensor would need to be copied to the GPU backend instead, which can be done with ggml_backend_tensor_copy.

@cmp-nct
Copy link
Contributor Author

cmp-nct commented Jan 16, 2024

tok_embd is always allocated in the CPU, so the result of doing this in this way is that the output layer cannot be offloaded. To avoid the performance degradation, the tensor would need to be copied to the GPU backend instead, which can be done with ggml_backend_tensor_copy.

Thanks, I didn't think about that!
I create the tensor in split mode now.
I had to adapt the n_tensors counter as this now results in an additional tensor.

Performance now is at 35tokens/sec !

llama.cpp Outdated Show resolved Hide resolved
@ggerganov
Copy link
Owner

Hm cool! Will wait for slaren to confirm this is fine before merging

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
@cebtenzzre
Copy link
Collaborator

Does this mean that we can revive #3626?

llama.cpp Outdated
Comment on lines 3440 to 3443
} else {
model.output = ml.create_tensor(ctx_output_split, tn(LLM_TENSOR_TOKEN_EMBD, "weight"), {n_embd, n_vocab}); // needs to be on GPU
ml.n_tensors++; // artificial tensor
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should also work, but I would prefer decreasing ml.n_created instead of increasing ml.n_tensors.

Copy link
Contributor Author

@cmp-nct cmp-nct Jan 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking that given we actually have one more tensor in use than in the model file it's better to increase that var. But we can change it to ml.n_created--;
Should I change it ? you can do it as well of course:)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's change it as suggested and merge then

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed and tested it

@slaren
Copy link
Collaborator

slaren commented Jan 17, 2024

@cebtenzzre yep, this should be doable now.

@ggerganov ggerganov merged commit 57e2a7a into ggerganov:master Jan 18, 2024
30 of 43 checks passed
jordankanter pushed a commit to jordankanter/llama.cpp that referenced this pull request Feb 3, 2024
* falcon arch fix for tied output embeddings

* Update llama.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update llama.cpp

* Update llama.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update llama.cpp

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 1, 2024
* falcon arch fix for tied output embeddings

* Update llama.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update llama.cpp

* Update llama.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update llama.cpp

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants