You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
On a single a6000 using bitsandbytes nf4 quantization I'm seeing 17ms per token with Mistral-7b and 80ms per token with Mixtral.
My expectation is that Mixtral should have the performance of roughly 2x the 7b model (around 40ms per token). Current perf levels seem to negate the advantage over a large non-MOE model.
The text was updated successfully, but these errors were encountered:
Hey @timohear, thanks for reporting. I can definitely take a look some time this week. There are some differences in how Mixtral is implemented Mistral that might be contributing to the differences in perf, but some more thorough benchmarking on our side to see where the bottlenecks are coming from would be a good next step.
System Info
Latest Lorax version
Information
Tasks
Reproduction
Compare mistral-7b nf4 perf to mixtral nf4 perf
Expected behavior
On a single a6000 using bitsandbytes nf4 quantization I'm seeing 17ms per token with Mistral-7b and 80ms per token with Mixtral.
My expectation is that Mixtral should have the performance of roughly 2x the 7b model (around 40ms per token). Current perf levels seem to negate the advantage over a large non-MOE model.
The text was updated successfully, but these errors were encountered: