Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mixtral nf4 performance 2x slower than expected #211

Open
2 of 4 tasks
timohear opened this issue Jan 29, 2024 · 2 comments
Open
2 of 4 tasks

Mixtral nf4 performance 2x slower than expected #211

timohear opened this issue Jan 29, 2024 · 2 comments
Labels
question Further information is requested

Comments

@timohear
Copy link

System Info

Latest Lorax version

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

Compare mistral-7b nf4 perf to mixtral nf4 perf

Expected behavior

On a single a6000 using bitsandbytes nf4 quantization I'm seeing 17ms per token with Mistral-7b and 80ms per token with Mixtral.

My expectation is that Mixtral should have the performance of roughly 2x the 7b model (around 40ms per token). Current perf levels seem to negate the advantage over a large non-MOE model.

@timohear
Copy link
Author

Note that I'm also seeing this on TGI so it's not a Lorax-specific issue

@tgaddair
Copy link
Contributor

Hey @timohear, thanks for reporting. I can definitely take a look some time this week. There are some differences in how Mixtral is implemented Mistral that might be contributing to the differences in perf, but some more thorough benchmarking on our side to see where the bottlenecks are coming from would be a good next step.

@tgaddair tgaddair added the question Further information is requested label Jan 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants