Triton matmul_ogs Kernel missing multiplying expert weight

### Describe the bug

Hi Triton team, 
     I have been trying to integrate ```matmul_ogs``` kernel into vLLM. But I notice that in the reference implementation ```matmul_ogs```, 
https://github.com/triton-lang/triton/blob/e341f7fd752abf91911acfd30e50a9bfc9a03ad5/bench/triton_bench/matmul_ogs.py#L615-L618
after matmul between activation and expert weight, all expert output for a certain token is just added together instead of applying the expert weight in ```rdata.gate_scal```. Since ```matmul_ogs``` and ```matmul_ogs_torch```are equivalent, I think the same also applies to ```matmul_ogs```? Is that normal, or am I missing something?

Thanks in advance

### Environment details

Triton 3.3
CPU: Intel Xeon Gold 6126 CPU @ 2.60GHz
GPU: RTX 8xA6000, driver version 560.35.05, CUDA 12.6


	for i, (lo, hi) in enumerate(offs):
	dst_idx = scatter_indx.dst_indx[lo:hi] // n_expts_act
	msk = dst_idx != -1
	out[dst_idx[msk], :] += y[lo:hi, :][msk, :].float()

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Triton matmul_ogs Kernel missing multiplying expert weight #6527

Describe the bug

Environment details

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Triton matmul_ogs Kernel missing multiplying expert weight #6527

Description

Describe the bug

Environment details

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions