Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for mixtral 8x7b or more generic support for MoE models. #359

Closed
darcome opened this issue Dec 12, 2023 · 12 comments
Closed

Add support for mixtral 8x7b or more generic support for MoE models. #359

darcome opened this issue Dec 12, 2023 · 12 comments
Labels
enhancement New feature or request

Comments

@darcome
Copy link

darcome commented Dec 12, 2023

I don't know if this is possible, but I'll just create this for the future :)

@AsakusaRinne
Copy link
Collaborator

There's a discussion in #357. It depends on llama.cpp implementation and we'll support it once llama.cpp supports it.

@martindevans martindevans added the Upstream Tracking an issue in llama.cpp label Dec 12, 2023
@martindevans
Copy link
Member

@AshD
Copy link

AshD commented Dec 16, 2023

I saw that the llama libraries were updated in the llamasharp repo and tried it.

Loading the weights took over a minute and 42 GB out of 128GB memory, 80% CPU, 28% GPU and then threw a native load failed exception.

Regards,
Ash

@martindevans
Copy link
Member

Could you link the GGUF file you were trying to use, I'll see if I can reproduce the problem.

@AshD
Copy link

AshD commented Dec 17, 2023

Thanks @martindevans

GGUF file - https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF/resolve/main/mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf

These are my parameters

var p = new ModelParams(modelPath) { ContextSize = 4096, GpuLayerCount = 12,UseMemoryLock = true, UseMemorymap = true, Threads = 12,  BatchSize = 128, EmbeddingMode = true };
var w = LLamaWeights.LoadFromFile(p);

CPU - i9 13th Gen, GPU RTX 4090, 128GB

Thanks,
Ash

@martindevans
Copy link
Member

I'm downloading it now, but it's going to take a while!

However I've actually been testing with the Q5_K_M model from that same repo, so I'm expecting it to work.

I'd suggest trying out getting rid of all your params options there, most of them are automatically set and you shouldn't need to change them unless you have a good reason to override the defaults. The only one you do actually need to set is the GpuLayerCount, but I'd suggest setting that to zero as a test first.

@AshD
Copy link

AshD commented Dec 17, 2023

Using this
var p = new ModelParams(modelPath) { GpuLayerCount = 0};

Same error. model_ptr returns IntPtr.Zero
var model_ptr = NativeApi.llama_load_model_from_file(modelPath, lparams);

Thanks,
Ash

@martindevans
Copy link
Member

I tested out that mode, but it seems to work perfectly for me on both CPU and GPU.

if NativeApi.llama_load_model_from_file is failing, that would normall indicate an error with the model file itself or something more fundamental. Have you tried this file with one of the llama.cpp demos?

@AshD
Copy link

AshD commented Dec 17, 2023

Thanks @martindevans for your help in debugging this issue.

It works! The issue was it was picking up the llama dll from cuda11 folder and I assumed was picking it up from cuda11.7.1 folder.

I could offload 18 layers to the GPU. Token generation was around 7.5 tokens/sec.
Are you seeing similar numbers? Is a webpage that you are aware of that has the best parameters to set based on the model?

Model output was better than the Mistral Instruct v0.2 for some of the prompts I tried.

Thanks,
Ash

@martindevans
Copy link
Member

I'm using CPU inference, so it's slower for me. But as a rough guide it should be around the same speed as a 13B model.

Is a webpage that you are aware of that has the best parameters to set based on the model?

Almost all of the parameters should be automatically set (they're baked into the GGUF file).

The GPU layer count I don't know much about. As I understand it you just have to experiment to see how many layers you can fit and what speedup it gets you.

@AshD
Copy link

AshD commented Dec 18, 2023

Thanks @martindevans

As you said, the GPU Layer count setting is more of - try to see how many you can fit in your GPU :-)

@martindevans martindevans added enhancement New feature or request and removed Upstream Tracking an issue in llama.cpp labels Dec 21, 2023
@martindevans
Copy link
Member

v0.9.1 added support for Mixtral, so I'll close this issue now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants