-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llama : add Falcon LLM support #1602
Comments
From what I read, I've not tested it, the model seems significantly better than llama, while it has a kind of shitty license for commercial growth (free until 1MM/y revenue, then 10%) it's better than illegal. It's using flash attention and multiquery. gg already has branches with flashattention. |
I've just invested almost an hour of prompting into Instruct Falcon 40B and it's significantly smarter than OpenAssisst 30B, despite being less well tuned. |
there's a guy who provided a q4b version of Falcon7B, would it be of some use for llama.cpp ? |
Falcon has the full precision binaries available here: From there it should start, the pre-quantized versions are not useful imho. I'm not 100% sure yet but from my tests I believe that we have a superior successor to llama at our hands that covers all our use cases (from small to large). It solved riddles Turbo, Alpaca and OpenAssist 30B can not solve. Carefully said: It looks like the 40B Falcon might outperform the largest 65B llama (it does so in the benchmarks). |
I don't know why I'm not able to convert it to .ggml, like other models.
|
Because it is a different type of model. LLaMA based models have a certain structure. Falcon is not based on LLaMA, there's a different set of tensors, the tensors have different names, etc. The conversion app can't handle Falcon models yet. |
@KerfuffleV2 can you give me (us, really) an ELI5 of the LLaMA architecture and how it differs from, say GPT-3? Will be super grateful! |
How much of all the work done in this repo could easily be transferred to future models and architectures? It looks like the happy days of the original LLaMA models may soon be over, as it starts to get beaten by models with different architectures and more attractive licensing. Open LLM Leaderboard As the flora of LLM architectures will continue to grow and new ones will replace the old, I think this repo and the LLM examples in the ggml repo should be merged into something like ggml_llm. The ggml_llm would contain all the common LLM code (main inference / perplexity / filehandling / quantization / sampling ..) and the code for each architecture could be like plugins added at compile time. The gpt4all-backend may be a good starting point for how such structure could be built. |
I don't want to get too offtopic here so if you want detailed information you'd probably be better off creating a discussion. I also don't really know the specific architecture of GP-3, etc, so I can't tell you the exact way two specific types of model differ, just provide some general information. This is a bit simplified, but a model consists of a bunch of tensors (just big arrays of numbers in various dimensions). The tensors generally have names, like Anyway, to actually run a model one performs a bunch of math operations on those tensors. Some of the operations are simple like addition, multiplication, some are more complex and can have complicated logic internally like rope, alibi, matrix multiplication, etc. Which tensors exist in a model and what sequence of those math operations are used to evaluate the model depends on the model architecture. While a LLaMA based model might have The code in something like this project which evaluates a type of model it supports (say LLaMA for example) is set up to look for tensors with specific names, grab that data, perform the various operations in the correct order and then it also expects the result from those operations to be in a specific format as well. Hopefully this makes it more clear why specific support needs to be added to ML tools to support models that actually have a different architecture. |
Thanks @KerfuffleV2, this is exactly what I was looking for! |
I took a look and Falcon is Bloom based, uses GPT-NeoX rot embeddings, gelu activation Though looks like a bit of a nightmare to adapt everything :( |
Can bloomz.cpp run this model? |
Not without adaption, I've not looked into the differences (aside of the parameter and layer counts) but there certainly are some. |
As of 3 hours ago, they tweeted that they will forgo any royalties for commercial and research uses. I don't know what this means in practice but Falcon might become the first capable genuinly-opensource model we get. |
They've just updated their Huggingface to confirm that the models are now available under Apache 2.0: https://huggingface.co/tiiuae . |
According to their announcement on the official site, it's the Falcon 40B that is now under Apache 2.0. Not sure if they intend to do same for the smaller models, or if they plan an even larger, license-restricted one. |
They updated the main page, not the model pages yet. They are just a bit slow to follow up but it looks like we get a full open source model. Best thing ever exported from Abu Dhabi ? |
All models and datasets from them are now confirmed to be Apache 2.0. The model repositories still contain the old license.txt, but the models themselves are tagged Apache. |
I was actually able to convert, quantize and load the model, but there is some tensor math to debug and modify but I have no 40GB gpu to debug the tensor values at each layer! so it produces garbage for now I can give you the quantized model if you want to continue my work. |
Great work! |
@klosax it is still too big! To debug the weights the model needs to be loaded in fp16 on the gpu. this means that a 24GB gpu is needed in the case of the 7B model and I do not posses one |
Truthfully though the initial Falcon work should be done on 7B to ease development; I think the architecture is the same regardless of model size. If it gets traction I'm sure someone with a big GPU will hop in and help with the 40B 🤗 Like it or not, Llama is limited by its legality and truly open models like Falcon are the way forwards for llama.cpp. |
@nikisalli : On the model card it says "head_dim 64 Reduced to optimise for FlashAttention" but in the config.json the number is 128. Maybe try reducing it to 64? |
@nikisalli what do you need the gpu for? why not cpu?, ggml/llama.cpp is known for its ability to run on cpu after all... |
I find it useful to run the pytorch model with many print statements here and there to check that ggml is giving me the same numbers so that I know what operations to touch |
Awesome to see! I was half way there, glad I stopped given you were successful! Just to be sure I get all: You modified the broadcast PR to source for the kv head, right ? Tensor math is not my strong point, no idea if those words are fitting ? |
I guess so. Consider a case of n_head=128 and n_head_kv=8. There are 8 kv head groups, each with 16 queries and 1 kv pair reused by all the 16 queries within the same kv head group. So what this modification achieves is that for any fixed head group index, the same key row index is picked from the src0 matrix to mulitply with any queries that belong to the same kv head group.
I'm not sure at this point whether two different modes are really needed or if we could accept the "falcon hack" (maybe generalizing it further) as the "right" way of broadcasting. I think we should aim to reproduce the behavior of torch.broadcast_to, which my repeat2 function was intended to imitate (but we do have to maintain backward compatibility with repeat as well, so maybe you're right about the modes). But I can't claim I am much familiar with the exact semantics of torch's implementation of broadcasting either. Perhaps @ggerganov can comment as well - the fix-mul-mat branch was originally intended for another use case, and it would be good to find out whether the "falcon hack" would negatively affect it, be neutral, or maybe even positive. |
@jploski I'll test if the modified broadcast works for SAM - I think it might. |
@ggerganov Just some thoughts.. working with @cmp-nct branch, got the Falcon-40b running using local hardware, and following this branch as well, but it is getting confusing to keep up with the latest. Is it not maybe time to create a single ggmlLLM.cpp (vs llama.cpp), where all the contributors can make all pull requests from? Time to start centralizing all this work? Running converts from 3 repos every other day to keep up ;) Want to move on to some of the 40b-instruct, and start into fine tuning the ggml versions. But was hoping to see the consolidation first. Great work from all parties. Might help to bring more good people into the effort if the llama.cpp project was a little more generic to more models? |
@linuxmagic-mp that's the plan already, check #1991 and ggerganov/ggml#220. |
Yes, as @slaren mentioned - Regarding Anyway, I am thinking about this and eventually will figure out how to improve it |
@ggerganov Understood of course, I guess the actual suggestion was I think your work, and the contributors work has now gone far beyond "Llama" now, so maybe a name change is in order to bring even more contributors to the main branch.. ;) It'll get confusing to the masses when llama.cpp can/will be used for so many other models. ;) Hey, it's a work in progress, and I think everyone is already amazed at the leaps and bounds almost daily. |
|
Also impossible for downstream projects. Image doing llava.cpp, now you need clip.cpp and llama.cpp both not having ggml as a submodule.
yea, the more we talk about it, the more |
sounds a mono repo make everything easier if we can create a right folder structure even integrating SAM and Wisper. We can have multi header files and multi libs for the downstream applications to pick. |
Not sure if this is the right issue for it or if it should be a separate issue, but I'd also like to +1 a monorepo. Subtrees are very troublesome and prone to breaking, and the semantics are hard to understand, while submodules are annoying for developers who are working on both repositories at the same time. |
Just curious, I see this is on the roadmap, and @ggerganov has flagged it for more help, might be nice to get a headstart by creating a new list of what's needed for the 'more help' part.. @cmp-nct what do you think? |
We need to finish the In the mean time, we should simplify the convert / load implementation in |
Following up on this - tested with current SAM inference and it still works so I think the change is good and will upstream it. |
@jploski It appears that the Falcon model utilizes the FlashAttention technique as far as my understanding goes, as mentioned in FlashAttention. I was wondering why your code in this context does not incorporate ggml_flash_attn() when performing the QKV calculation? |
Simply, I was unaware of FlashAttention. The Python version appears to only utilize it through the scaled_dot_product_attention function, which can be backed by FlashAttention depending on PyTorch version. Anyhow, I suspect that in order to use FlashAttention with Multi-Query Attention (for n_head_kv > 1, i.e. in 40B model) the key vector would need to be explicitly ggml_repeat2-ed again, which is something we managed to get rid of through a fused broadcast-matrix-multiplication kernel. Note that the most current version of the Falcon implementation is in here: https://github.com/cmp-nct/ggllm.cpp/blob/master/libfalcon.cpp#L2153 It would be interesting to see if ggml_repeat2 + ggml_flash_attn would work and make it perform better or worse, but I cannot examine it myself at present. |
I'll need to dig into the original modelling.py again to verify but I believe you did not miss out any flash attention code. It wasn't in there afaik. |
What I mean is that modelling_RW.py uses torch.nn.functional.scaled_dot_product_attention, and this in turn uses the FlashAttention algorithm by default in newer torch versions (which can be disabled - https://pytorch.org/docs/stable/backends.html#torch.backends.cuda.enable_flash_sdp) |
darn. I didn't know |
I implemented it here: jploski/ggml@fac72a28 (new branch falcon40b-flash based off falcon40b in my original ggml fork) - the ggml_repeat2 + ggml_flash_attn version turns out as 16% slower than the falcon40b-norepeat branch on CPU for 2048 token generation using a falcon40b mini-model. I can't easily test on GPU right now, but I suspect it won't be much better due to the ggml_repeat2 overhead. |
Hi, |
Llama.cpp does only support Llama-based models - hence the fancy name ;) |
Oh, thanks :) I hope they support METAL already ! |
Close via #2717 |
Falcon LLM 40b and 7b were just open sourced under a license which allows commercial use (
with royalties for over $1 million revenue per year) and have are topping the Huggingface Open LLM leaderboard. It seems to be based on a modified gpt3 architecture. I’m wondering if support in llama.cpp would be considered.https://huggingface.co/tiiuae/falcon-40b
The text was updated successfully, but these errors were encountered: