[Performance]: supports of fused moe kernel implementation

### Proposal to improve performance

By reading relative parts of source code and running some test, we find that when launching a MoE model like Qwen3, vLLM seems to use Triton-based fused moe kernel. While other implementations like cutlass or deep gemm is only supported by specific GPU arch like Hopper, or specific quantization method like Compressed Tensor. 
Is there a way to specify a type of fused moe kernel to use? For example I might want to compare the performance of Triten-based and Cutlass-based implementation on my A100 GPUs.  

### Report of performance regression

_No response_

### Misc discussion on performance

_No response_

### Your current environment (if you think it is necessary)

```text
The output of `python collect_env.py`
```


### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Performance]: supports of fused moe kernel implementation #20176

Proposal to improve performance

Report of performance regression

Misc discussion on performance

Your current environment (if you think it is necessary)

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Performance]: supports of fused moe kernel implementation #20176

Description

Proposal to improve performance

Report of performance regression

Misc discussion on performance

Your current environment (if you think it is necessary)

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions