-
Notifications
You must be signed in to change notification settings - Fork 345
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for mixtral 8x7b or more generic support for MoE models. #359
Comments
There's a discussion in #357. It depends on llama.cpp implementation and we'll support it once llama.cpp supports it. |
Relevant upstream issues: |
I saw that the llama libraries were updated in the llamasharp repo and tried it. Loading the weights took over a minute and 42 GB out of 128GB memory, 80% CPU, 28% GPU and then threw a native load failed exception. Regards, |
Could you link the GGUF file you were trying to use, I'll see if I can reproduce the problem. |
Thanks @martindevans These are my parameters
CPU - i9 13th Gen, GPU RTX 4090, 128GB Thanks, |
I'm downloading it now, but it's going to take a while! However I've actually been testing with the Q5_K_M model from that same repo, so I'm expecting it to work. I'd suggest trying out getting rid of all your params options there, most of them are automatically set and you shouldn't need to change them unless you have a good reason to override the defaults. The only one you do actually need to set is the |
Using this Same error. model_ptr returns IntPtr.Zero Thanks, |
I tested out that mode, but it seems to work perfectly for me on both CPU and GPU. if |
Thanks @martindevans for your help in debugging this issue. It works! The issue was it was picking up the llama dll from cuda11 folder and I assumed was picking it up from cuda11.7.1 folder. I could offload 18 layers to the GPU. Token generation was around 7.5 tokens/sec. Model output was better than the Mistral Instruct v0.2 for some of the prompts I tried. Thanks, |
I'm using CPU inference, so it's slower for me. But as a rough guide it should be around the same speed as a 13B model.
Almost all of the parameters should be automatically set (they're baked into the GGUF file). The GPU layer count I don't know much about. As I understand it you just have to experiment to see how many layers you can fit and what speedup it gets you. |
Thanks @martindevans As you said, the GPU Layer count setting is more of - try to see how many you can fit in your GPU :-) |
v0.9.1 added support for Mixtral, so I'll close this issue now. |
I don't know if this is possible, but I'll just create this for the future :)
The text was updated successfully, but these errors were encountered: