Skip to content

[Feature]: Add moe_wna16 kernel as a backend for CompressedTensorsWNA16MoEMethod #13575

Closed
@mgoin

Description

@mgoin

🚀 The feature, motivation and pitch

A triton implementation to support MoE layers quantized with GPTQ or AWQ was implemented in #12185

It is more performant than the current Marlin MoE kernel in the case where there are many, small experts - which is why I ported it to be the default in the case of num_experts > 32 for AWQ and GPTQMarlin configs #13236

We should also propagate the usage of this kernel to compressed-tensors that have mixed precision.

Alternatives

No response

Additional context

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    feature requestNew feature or requestunstaleRecieved activity after being labelled stale

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions