Closed
Description
🚀 The feature, motivation and pitch
A triton implementation to support MoE layers quantized with GPTQ or AWQ was implemented in #12185
It is more performant than the current Marlin MoE kernel in the case where there are many, small experts - which is why I ported it to be the default in the case of num_experts > 32
for AWQ and GPTQMarlin configs #13236
We should also propagate the usage of this kernel to compressed-tensors
that have mixed precision.
Alternatives
No response
Additional context
No response
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.