Skip to content

Conversation

@ElizaWszola
Copy link
Contributor

@ElizaWszola ElizaWszola commented Oct 31, 2025

CUDA kernel and fusion code for Fused Groupwise FP8-Quantized RMS Norm. This code allows to fuse RMS Norm + FP8 Quantization of the RMS Norm's output when enable_fusion==True.

Testing:

Test fused op

pytest tests/kernels/core/test_fused_quant_layernorm.py

Test fusion

pytest tests/compile/test_fusion.py

(tested with both VLLM_USE_DEEP_GEMM=1 and VLLM_USE_DEEP_GEMM=0)

Offline inference

Run with

llm = LLM(model="Qwen/Qwen3-30B-A3B-FP8",
        compilation_config=CompilationConfig(
            pass_config=PassConfig(
                enable_fusion=True,
                enable_noop=True)))

(tested with both VLLM_USE_DEEP_GEMM=1 and VLLM_USE_DEEP_GEMM=0, verified that the fused kernel is being produced)

Benchmarking:

Microbenchmark isolated op:

python benchmarks/fused_kernels/layernorm_rms_benchmarks.py

Results on H100 (click to show)
[-------------------------------------------- rms-norm-dynamic-per-token-quant --------------------------------------------]
                                                                  |  unfused_groupwise_fp8_impl  |  fused_groupwise_fp8_impl
1 threads: -----------------------------------------------------------------------------------------------------------------
      N 1 x D 1024 x R True x DT torch.bfloat16x GS [1, 64]       |             36.2             |            24.7          
      N 1 x D 1024 x R True x DT torch.bfloat16x GS [1, 128]      |             36.1             |            25.1          
      N 1 x D 1024 x R True x DT torch.float32x GS [1, 64]        |             35.9             |            24.7          
      N 1 x D 1024 x R True x DT torch.float32x GS [1, 128]       |             34.9             |            23.7          
      N 1 x D 1024 x R False x DT torch.bfloat16x GS [1, 64]      |             39.2             |            24.9          
      N 1 x D 1024 x R False x DT torch.bfloat16x GS [1, 128]     |             39.5             |            24.5          
      N 1 x D 1024 x R False x DT torch.float32x GS [1, 64]       |             38.2             |            24.2          
      N 1 x D 1024 x R False x DT torch.float32x GS [1, 128]      |             38.7             |            24.9          
      N 1 x D 2048 x R True x DT torch.bfloat16x GS [1, 64]       |             36.0             |            25.4          
      N 1 x D 2048 x R True x DT torch.bfloat16x GS [1, 128]      |             35.4             |            25.3          
      N 1 x D 2048 x R True x DT torch.float32x GS [1, 64]        |             36.3             |            25.6          
      N 1 x D 2048 x R True x DT torch.float32x GS [1, 128]       |             36.4             |            26.1          
      N 1 x D 2048 x R False x DT torch.bfloat16x GS [1, 64]      |             39.1             |            25.2          
      N 1 x D 2048 x R False x DT torch.bfloat16x GS [1, 128]     |             39.6             |            24.8          
      N 1 x D 2048 x R False x DT torch.float32x GS [1, 64]       |             39.6             |            24.4          
      N 1 x D 2048 x R False x DT torch.float32x GS [1, 128]      |             38.2             |            23.9          
      N 1 x D 3072 x R True x DT torch.bfloat16x GS [1, 64]       |             34.6             |            23.8          
      N 1 x D 3072 x R True x DT torch.bfloat16x GS [1, 128]      |             33.4             |            23.0          
      N 1 x D 3072 x R True x DT torch.float32x GS [1, 64]        |             34.0             |            23.9          
      N 1 x D 3072 x R True x DT torch.float32x GS [1, 128]       |             34.2             |            23.5          
      N 1 x D 3072 x R False x DT torch.bfloat16x GS [1, 64]      |             36.7             |            23.2          
      N 1 x D 3072 x R False x DT torch.bfloat16x GS [1, 128]     |             38.2             |            23.2          
      N 1 x D 3072 x R False x DT torch.float32x GS [1, 64]       |             36.1             |            22.7          
      N 1 x D 3072 x R False x DT torch.float32x GS [1, 128]      |             36.6             |            23.4          
      N 1 x D 4096 x R True x DT torch.bfloat16x GS [1, 64]       |             34.8             |            23.6          
      N 1 x D 4096 x R True x DT torch.bfloat16x GS [1, 128]      |             33.5             |            23.3          
      N 1 x D 4096 x R True x DT torch.float32x GS [1, 64]        |             33.7             |            23.2          
      N 1 x D 4096 x R True x DT torch.float32x GS [1, 128]       |             33.6             |            23.2          
      N 1 x D 4096 x R False x DT torch.bfloat16x GS [1, 64]      |             37.3             |            23.0          
      N 1 x D 4096 x R False x DT torch.bfloat16x GS [1, 128]     |             36.9             |            22.6          
      N 1 x D 4096 x R False x DT torch.float32x GS [1, 64]       |             36.4             |            22.8          
      N 1 x D 4096 x R False x DT torch.float32x GS [1, 128]      |             36.7             |            22.9          
      N 1 x D 5120 x R True x DT torch.bfloat16x GS [1, 64]       |             33.8             |            23.4          
      N 1 x D 5120 x R True x DT torch.bfloat16x GS [1, 128]      |             34.1             |            23.7          
      N 1 x D 5120 x R True x DT torch.float32x GS [1, 64]        |             34.0             |            24.2          
      N 1 x D 5120 x R True x DT torch.float32x GS [1, 128]       |             35.4             |            24.5          
      N 1 x D 5120 x R False x DT torch.bfloat16x GS [1, 64]      |             38.6             |            24.3          
      N 1 x D 5120 x R False x DT torch.bfloat16x GS [1, 128]     |             39.1             |            23.5          
      N 1 x D 5120 x R False x DT torch.float32x GS [1, 64]       |             37.6             |            23.0          
      N 1 x D 5120 x R False x DT torch.float32x GS [1, 128]      |             37.8             |            23.4          
      N 1 x D 6144 x R True x DT torch.bfloat16x GS [1, 64]       |             34.0             |            23.5          
      N 1 x D 6144 x R True x DT torch.bfloat16x GS [1, 128]      |             34.4             |            23.8          
      N 1 x D 6144 x R True x DT torch.float32x GS [1, 64]        |             34.5             |            23.9          
      N 1 x D 6144 x R True x DT torch.float32x GS [1, 128]       |             34.7             |            23.9          
      N 1 x D 6144 x R False x DT torch.bfloat16x GS [1, 64]      |             38.3             |            24.6          
      N 1 x D 6144 x R False x DT torch.bfloat16x GS [1, 128]     |             38.4             |            24.5          
      N 1 x D 6144 x R False x DT torch.float32x GS [1, 64]       |             38.8             |            25.4          
      N 1 x D 6144 x R False x DT torch.float32x GS [1, 128]      |             39.4             |            24.6          
      N 1 x D 7168 x R True x DT torch.bfloat16x GS [1, 64]       |             35.5             |            25.0          
      N 1 x D 7168 x R True x DT torch.bfloat16x GS [1, 128]      |             36.2             |            26.6          
      N 1 x D 7168 x R True x DT torch.float32x GS [1, 64]        |             36.8             |            25.7          
      N 1 x D 7168 x R True x DT torch.float32x GS [1, 128]       |             36.2             |            25.1          
      N 1 x D 7168 x R False x DT torch.bfloat16x GS [1, 64]      |             38.2             |            24.4          
      N 1 x D 7168 x R False x DT torch.bfloat16x GS [1, 128]     |             39.7             |            24.8          
      N 1 x D 7168 x R False x DT torch.float32x GS [1, 64]       |             38.4             |            23.3          
      N 1 x D 7168 x R False x DT torch.float32x GS [1, 128]      |             37.8             |            24.0          
      N 2 x D 1024 x R True x DT torch.bfloat16x GS [1, 64]       |             35.4             |            24.6          
      N 2 x D 1024 x R True x DT torch.bfloat16x GS [1, 128]      |             35.6             |            24.2          
      N 2 x D 1024 x R True x DT torch.float32x GS [1, 64]        |             35.2             |            24.8          
      N 2 x D 1024 x R True x DT torch.float32x GS [1, 128]       |             36.3             |            24.2          
      N 2 x D 1024 x R False x DT torch.bfloat16x GS [1, 64]      |             38.5             |            24.2          
      N 2 x D 1024 x R False x DT torch.bfloat16x GS [1, 128]     |             38.3             |            24.0          
      N 2 x D 1024 x R False x DT torch.float32x GS [1, 64]       |             38.2             |            23.2          
      N 2 x D 1024 x R False x DT torch.float32x GS [1, 128]      |             37.8             |            23.9          
      N 2 x D 2048 x R True x DT torch.bfloat16x GS [1, 64]       |             34.9             |            23.9          
      N 2 x D 2048 x R True x DT torch.bfloat16x GS [1, 128]      |             35.0             |            24.3          
      N 2 x D 2048 x R True x DT torch.float32x GS [1, 64]        |             35.0             |            24.1          
      N 2 x D 2048 x R True x DT torch.float32x GS [1, 128]       |             35.3             |            23.9          
      N 2 x D 2048 x R False x DT torch.bfloat16x GS [1, 64]      |             38.5             |            23.3          
      N 2 x D 2048 x R False x DT torch.bfloat16x GS [1, 128]     |             38.4             |            23.6          
      N 2 x D 2048 x R False x DT torch.float32x GS [1, 64]       |             37.8             |            23.4          
      N 2 x D 2048 x R False x DT torch.float32x GS [1, 128]      |             38.4             |            24.0          
      N 2 x D 3072 x R True x DT torch.bfloat16x GS [1, 64]       |             35.0             |            24.2          
      N 2 x D 3072 x R True x DT torch.bfloat16x GS [1, 128]      |             34.3             |            23.8          
      N 2 x D 3072 x R True x DT torch.float32x GS [1, 64]        |             34.5             |            24.6          
      N 2 x D 3072 x R True x DT torch.float32x GS [1, 128]       |             35.2             |            24.4          
      N 2 x D 3072 x R False x DT torch.bfloat16x GS [1, 64]      |             37.9             |            23.7          
      N 2 x D 3072 x R False x DT torch.bfloat16x GS [1, 128]     |             37.2             |            23.4          
      N 2 x D 3072 x R False x DT torch.float32x GS [1, 64]       |             36.8             |            22.7          
      N 2 x D 3072 x R False x DT torch.float32x GS [1, 128]      |             36.9             |            22.6          
      N 2 x D 4096 x R True x DT torch.bfloat16x GS [1, 64]       |             33.2             |            22.7          
      N 2 x D 4096 x R True x DT torch.bfloat16x GS [1, 128]      |             33.2             |            23.2          
      N 2 x D 4096 x R True x DT torch.float32x GS [1, 64]        |             33.3             |            22.9          
      N 2 x D 4096 x R True x DT torch.float32x GS [1, 128]       |             33.4             |            23.2          
      N 2 x D 4096 x R False x DT torch.bfloat16x GS [1, 64]      |             36.3             |            22.7          
      N 2 x D 4096 x R False x DT torch.bfloat16x GS [1, 128]     |             36.5             |            22.5          
      N 2 x D 4096 x R False x DT torch.float32x GS [1, 64]       |             36.7             |            22.9          
      N 2 x D 4096 x R False x DT torch.float32x GS [1, 128]      |             36.1             |            22.5          
      N 2 x D 5120 x R True x DT torch.bfloat16x GS [1, 64]       |             32.4             |            22.9          
      N 2 x D 5120 x R True x DT torch.bfloat16x GS [1, 128]      |             33.5             |            22.9          
      N 2 x D 5120 x R True x DT torch.float32x GS [1, 64]        |             33.1             |            22.9          
      N 2 x D 5120 x R True x DT torch.float32x GS [1, 128]       |             33.1             |            22.6          
      N 2 x D 5120 x R False x DT torch.bfloat16x GS [1, 64]      |             36.3             |            22.8          
      N 2 x D 5120 x R False x DT torch.bfloat16x GS [1, 128]     |             36.9             |            22.9          
      N 2 x D 5120 x R False x DT torch.float32x GS [1, 64]       |             36.5             |            22.6          
      N 2 x D 5120 x R False x DT torch.float32x GS [1, 128]      |             36.4             |            22.9          
      N 2 x D 6144 x R True x DT torch.bfloat16x GS [1, 64]       |             33.4             |            23.2          
      N 2 x D 6144 x R True x DT torch.bfloat16x GS [1, 128]      |             33.4             |            23.0          
      N 2 x D 6144 x R True x DT torch.float32x GS [1, 64]        |             33.7             |            22.7          
      N 2 x D 6144 x R True x DT torch.float32x GS [1, 128]       |             33.0             |            22.8          
      N 2 x D 6144 x R False x DT torch.bfloat16x GS [1, 64]      |             37.2             |            23.6          
      N 2 x D 6144 x R False x DT torch.bfloat16x GS [1, 128]     |             37.1             |            23.2          
      N 2 x D 6144 x R False x DT torch.float32x GS [1, 64]       |             37.1             |            23.3          
      N 2 x D 6144 x R False x DT torch.float32x GS [1, 128]      |             36.4             |            23.1          
      N 2 x D 7168 x R True x DT torch.bfloat16x GS [1, 64]       |             33.1             |            23.0          
      N 2 x D 7168 x R True x DT torch.bfloat16x GS [1, 128]      |             33.3             |            23.4          
      N 2 x D 7168 x R True x DT torch.float32x GS [1, 64]        |             33.8             |            23.1          
      N 2 x D 7168 x R True x DT torch.float32x GS [1, 128]       |             33.2             |            23.0          
      N 2 x D 7168 x R False x DT torch.bfloat16x GS [1, 64]      |             36.4             |            22.7          
      N 2 x D 7168 x R False x DT torch.bfloat16x GS [1, 128]     |             36.1             |            22.8          
      N 2 x D 7168 x R False x DT torch.float32x GS [1, 64]       |             36.5             |            23.0          
      N 2 x D 7168 x R False x DT torch.float32x GS [1, 128]      |             36.6             |            22.4          
      N 4 x D 1024 x R True x DT torch.bfloat16x GS [1, 64]       |             32.6             |            22.6          
      N 4 x D 1024 x R True x DT torch.bfloat16x GS [1, 128]      |             32.7             |            22.7          
      N 4 x D 1024 x R True x DT torch.float32x GS [1, 64]        |             32.6             |            22.5          
      N 4 x D 1024 x R True x DT torch.float32x GS [1, 128]       |             33.2             |            22.6          
      N 4 x D 1024 x R False x DT torch.bfloat16x GS [1, 64]      |             35.9             |            22.5          
      N 4 x D 1024 x R False x DT torch.bfloat16x GS [1, 128]     |             36.0             |            22.2          
      N 4 x D 1024 x R False x DT torch.float32x GS [1, 64]       |             35.8             |            22.3          
      N 4 x D 1024 x R False x DT torch.float32x GS [1, 128]      |             35.4             |            22.3          
      N 4 x D 2048 x R True x DT torch.bfloat16x GS [1, 64]       |             32.3             |            22.7          
      N 4 x D 2048 x R True x DT torch.bfloat16x GS [1, 128]      |             32.4             |            22.7          
      N 4 x D 2048 x R True x DT torch.float32x GS [1, 64]        |             33.1             |            22.9          
      N 4 x D 2048 x R True x DT torch.float32x GS [1, 128]       |             33.1             |            22.6          
      N 4 x D 2048 x R False x DT torch.bfloat16x GS [1, 64]      |             36.1             |            22.5          
      N 4 x D 2048 x R False x DT torch.bfloat16x GS [1, 128]     |             35.2             |            22.3          
      N 4 x D 2048 x R False x DT torch.float32x GS [1, 64]       |             35.6             |            22.3          
      N 4 x D 2048 x R False x DT torch.float32x GS [1, 128]      |             36.4             |            22.2          
      N 4 x D 3072 x R True x DT torch.bfloat16x GS [1, 64]       |             32.2             |            22.7          
      N 4 x D 3072 x R True x DT torch.bfloat16x GS [1, 128]      |             32.6             |            22.8          
      N 4 x D 3072 x R True x DT torch.float32x GS [1, 64]        |             32.6             |            22.9          
      N 4 x D 3072 x R True x DT torch.float32x GS [1, 128]       |             32.7             |            22.6          
      N 4 x D 3072 x R False x DT torch.bfloat16x GS [1, 64]      |             36.0             |            22.6          
      N 4 x D 3072 x R False x DT torch.bfloat16x GS [1, 128]     |             35.9             |            22.7          
      N 4 x D 3072 x R False x DT torch.float32x GS [1, 64]       |             35.8             |            22.3          
      N 4 x D 3072 x R False x DT torch.float32x GS [1, 128]      |             36.1             |            22.4          
      N 4 x D 4096 x R True x DT torch.bfloat16x GS [1, 64]       |             33.2             |            22.8          
      N 4 x D 4096 x R True x DT torch.bfloat16x GS [1, 128]      |             32.8             |            22.7          
      N 4 x D 4096 x R True x DT torch.float32x GS [1, 64]        |             32.9             |            22.0          
      N 4 x D 4096 x R True x DT torch.float32x GS [1, 128]       |             32.2             |            22.6          
      N 4 x D 4096 x R False x DT torch.bfloat16x GS [1, 64]      |             35.7             |            22.7          
      N 4 x D 4096 x R False x DT torch.bfloat16x GS [1, 128]     |             35.5             |            22.5          
      N 4 x D 4096 x R False x DT torch.float32x GS [1, 64]       |             35.8             |            22.2          
      N 4 x D 4096 x R False x DT torch.float32x GS [1, 128]      |             35.4             |            22.5          
      N 4 x D 5120 x R True x DT torch.bfloat16x GS [1, 64]       |             32.5             |            23.0          
      N 4 x D 5120 x R True x DT torch.bfloat16x GS [1, 128]      |             32.5             |            22.8          
      N 4 x D 5120 x R True x DT torch.float32x GS [1, 64]        |             32.7             |            22.6          
      N 4 x D 5120 x R True x DT torch.float32x GS [1, 128]       |             32.8             |            22.5          
      N 4 x D 5120 x R False x DT torch.bfloat16x GS [1, 64]      |             36.0             |            22.4          
      N 4 x D 5120 x R False x DT torch.bfloat16x GS [1, 128]     |             35.5             |            22.3          
      N 4 x D 5120 x R False x DT torch.float32x GS [1, 64]       |             35.5             |            22.4          
      N 4 x D 5120 x R False x DT torch.float32x GS [1, 128]      |             35.6             |            22.7          
      N 4 x D 6144 x R True x DT torch.bfloat16x GS [1, 64]       |             32.2             |            22.6          
      N 4 x D 6144 x R True x DT torch.bfloat16x GS [1, 128]      |             32.7             |            22.6          
      N 4 x D 6144 x R True x DT torch.float32x GS [1, 64]        |             32.6             |            22.8          
      N 4 x D 6144 x R True x DT torch.float32x GS [1, 128]       |             32.6             |            22.7          
      N 4 x D 6144 x R False x DT torch.bfloat16x GS [1, 64]      |             35.5             |            22.5          
      N 4 x D 6144 x R False x DT torch.bfloat16x GS [1, 128]     |             35.5             |            22.4          
      N 4 x D 6144 x R False x DT torch.float32x GS [1, 64]       |             35.4             |            22.5          
      N 4 x D 6144 x R False x DT torch.float32x GS [1, 128]      |             35.8             |            22.5          
      N 4 x D 7168 x R True x DT torch.bfloat16x GS [1, 64]       |             32.3             |            22.4          
      N 4 x D 7168 x R True x DT torch.bfloat16x GS [1, 128]      |             32.4             |            22.6          
      N 4 x D 7168 x R True x DT torch.float32x GS [1, 64]        |             32.5             |            22.5          
      N 4 x D 7168 x R True x DT torch.float32x GS [1, 128]       |             32.6             |            22.9          
      N 4 x D 7168 x R False x DT torch.bfloat16x GS [1, 64]      |             35.6             |            22.6          
      N 4 x D 7168 x R False x DT torch.bfloat16x GS [1, 128]     |             35.9             |            22.4          
      N 4 x D 7168 x R False x DT torch.float32x GS [1, 64]       |             35.9             |            22.5          
      N 4 x D 7168 x R False x DT torch.float32x GS [1, 128]      |             35.8             |            22.6          
      N 8 x D 1024 x R True x DT torch.bfloat16x GS [1, 64]       |             33.2             |            22.5          
      N 8 x D 1024 x R True x DT torch.bfloat16x GS [1, 128]      |             32.6             |            22.7          
      N 8 x D 1024 x R True x DT torch.float32x GS [1, 64]        |             32.8             |            22.9          
      N 8 x D 1024 x R True x DT torch.float32x GS [1, 128]       |             32.3             |            22.7          
      N 8 x D 1024 x R False x DT torch.bfloat16x GS [1, 64]      |             36.1             |            22.8          
      N 8 x D 1024 x R False x DT torch.bfloat16x GS [1, 128]     |             36.3             |            22.5          
      N 8 x D 1024 x R False x DT torch.float32x GS [1, 64]       |             35.7             |            22.7          
      N 8 x D 1024 x R False x DT torch.float32x GS [1, 128]      |             35.3             |            22.3          
      N 8 x D 2048 x R True x DT torch.bfloat16x GS [1, 64]       |             32.1             |            22.5          
      N 8 x D 2048 x R True x DT torch.bfloat16x GS [1, 128]      |             32.2             |            22.6          
      N 8 x D 2048 x R True x DT torch.float32x GS [1, 64]        |             32.9             |            22.5          
      N 8 x D 2048 x R True x DT torch.float32x GS [1, 128]       |             32.6             |            22.6          
      N 8 x D 2048 x R False x DT torch.bfloat16x GS [1, 64]      |             35.4             |            22.5          
      N 8 x D 2048 x R False x DT torch.bfloat16x GS [1, 128]     |             35.6             |            22.3          
      N 8 x D 2048 x R False x DT torch.float32x GS [1, 64]       |             35.5             |            22.5          
      N 8 x D 2048 x R False x DT torch.float32x GS [1, 128]      |             35.6             |            22.1          
      N 8 x D 3072 x R True x DT torch.bfloat16x GS [1, 64]       |             32.5             |            22.8          
      N 8 x D 3072 x R True x DT torch.bfloat16x GS [1, 128]      |             32.6             |            22.7          
      N 8 x D 3072 x R True x DT torch.float32x GS [1, 64]        |             33.3             |            22.7          
      N 8 x D 3072 x R True x DT torch.float32x GS [1, 128]       |             33.0             |            23.0          
      N 8 x D 3072 x R False x DT torch.bfloat16x GS [1, 64]      |             35.5             |            22.5          
      N 8 x D 3072 x R False x DT torch.bfloat16x GS [1, 128]     |             36.0             |            22.4          
      N 8 x D 3072 x R False x DT torch.float32x GS [1, 64]       |             36.4             |            22.8          
      N 8 x D 3072 x R False x DT torch.float32x GS [1, 128]      |             36.5             |            22.6          
      N 8 x D 4096 x R True x DT torch.bfloat16x GS [1, 64]       |             32.6             |            22.2          
      N 8 x D 4096 x R True x DT torch.bfloat16x GS [1, 128]      |             32.7             |            22.7          
      N 8 x D 4096 x R True x DT torch.float32x GS [1, 64]        |             32.8             |            22.6          
      N 8 x D 4096 x R True x DT torch.float32x GS [1, 128]       |             32.8             |            22.5          
      N 8 x D 4096 x R False x DT torch.bfloat16x GS [1, 64]      |             36.5             |            21.8          
      N 8 x D 4096 x R False x DT torch.bfloat16x GS [1, 128]     |             35.5             |            22.5          
      N 8 x D 4096 x R False x DT torch.float32x GS [1, 64]       |             36.2             |            22.7          
      N 8 x D 4096 x R False x DT torch.float32x GS [1, 128]      |             36.3             |            23.0          
      N 8 x D 5120 x R True x DT torch.bfloat16x GS [1, 64]       |             32.6             |            22.9          
      N 8 x D 5120 x R True x DT torch.bfloat16x GS [1, 128]      |             33.2             |            22.4          
      N 8 x D 5120 x R True x DT torch.float32x GS [1, 64]        |             32.5             |            22.8          
      N 8 x D 5120 x R True x DT torch.float32x GS [1, 128]       |             32.7             |            22.7          
      N 8 x D 5120 x R False x DT torch.bfloat16x GS [1, 64]      |             36.3             |            22.4          
      N 8 x D 5120 x R False x DT torch.bfloat16x GS [1, 128]     |             36.3             |            22.6          
      N 8 x D 5120 x R False x DT torch.float32x GS [1, 64]       |             36.3             |            22.3          
      N 8 x D 5120 x R False x DT torch.float32x GS [1, 128]      |             35.9             |            22.5          
      N 8 x D 6144 x R True x DT torch.bfloat16x GS [1, 64]       |             32.3             |            22.4          
      N 8 x D 6144 x R True x DT torch.bfloat16x GS [1, 128]      |             32.6             |            22.5          
      N 8 x D 6144 x R True x DT torch.float32x GS [1, 64]        |             32.4             |            22.3          
      N 8 x D 6144 x R True x DT torch.float32x GS [1, 128]       |             32.2             |            22.3          
      N 8 x D 6144 x R False x DT torch.bfloat16x GS [1, 64]      |             35.7             |            22.5          
      N 8 x D 6144 x R False x DT torch.bfloat16x GS [1, 128]     |             35.9             |            22.8          
      N 8 x D 6144 x R False x DT torch.float32x GS [1, 64]       |             36.5             |            23.1          
      N 8 x D 6144 x R False x DT torch.float32x GS [1, 128]      |             36.5             |            22.8          
      N 8 x D 7168 x R True x DT torch.bfloat16x GS [1, 64]       |             33.2             |            22.8          
      N 8 x D 7168 x R True x DT torch.bfloat16x GS [1, 128]      |             33.1             |            22.4          
      N 8 x D 7168 x R True x DT torch.float32x GS [1, 64]        |             32.5             |            22.3          
      N 8 x D 7168 x R True x DT torch.float32x GS [1, 128]       |             33.1             |            22.8          
      N 8 x D 7168 x R False x DT torch.bfloat16x GS [1, 64]      |             36.1             |            23.0          
      N 8 x D 7168 x R False x DT torch.bfloat16x GS [1, 128]     |             35.6             |            22.8          
      N 8 x D 7168 x R False x DT torch.float32x GS [1, 64]       |             36.2             |            22.6          
      N 8 x D 7168 x R False x DT torch.float32x GS [1, 128]      |             36.3             |            22.1          
      N 16 x D 1024 x R True x DT torch.bfloat16x GS [1, 64]      |             32.8             |            23.0          
      N 16 x D 1024 x R True x DT torch.bfloat16x GS [1, 128]     |             32.9             |            22.7          
      N 16 x D 1024 x R True x DT torch.float32x GS [1, 64]       |             32.7             |            22.3          
      N 16 x D 1024 x R True x DT torch.float32x GS [1, 128]      |             32.7             |            23.0          
      N 16 x D 1024 x R False x DT torch.bfloat16x GS [1, 64]     |             36.1             |            22.5          
      N 16 x D 1024 x R False x DT torch.bfloat16x GS [1, 128]    |             35.8             |            22.5          
      N 16 x D 1024 x R False x DT torch.float32x GS [1, 64]      |             35.6             |            22.2          
      N 16 x D 1024 x R False x DT torch.float32x GS [1, 128]     |             35.6             |            21.8          
      N 16 x D 2048 x R True x DT torch.bfloat16x GS [1, 64]      |             32.3             |            22.3          
      N 16 x D 2048 x R True x DT torch.bfloat16x GS [1, 128]     |             32.8             |            23.0          
      N 16 x D 2048 x R True x DT torch.float32x GS [1, 64]       |             33.0             |            22.7          
      N 16 x D 2048 x R True x DT torch.float32x GS [1, 128]      |             32.9             |            22.7          
      N 16 x D 2048 x R False x DT torch.bfloat16x GS [1, 64]     |             36.0             |            22.8          
      N 16 x D 2048 x R False x DT torch.bfloat16x GS [1, 128]    |             36.3             |            23.0          
      N 16 x D 2048 x R False x DT torch.float32x GS [1, 64]      |             36.1             |            22.6          
      N 16 x D 2048 x R False x DT torch.float32x GS [1, 128]     |             35.9             |            22.5          
      N 16 x D 3072 x R True x DT torch.bfloat16x GS [1, 64]      |             32.6             |            22.7          
      N 16 x D 3072 x R True x DT torch.bfloat16x GS [1, 128]     |             32.6             |            22.6          
      N 16 x D 3072 x R True x DT torch.float32x GS [1, 64]       |             32.5             |            22.6          
      N 16 x D 3072 x R True x DT torch.float32x GS [1, 128]      |             32.2             |            22.5          
      N 16 x D 3072 x R False x DT torch.bfloat16x GS [1, 64]     |             36.6             |            23.1          
      N 16 x D 3072 x R False x DT torch.bfloat16x GS [1, 128]    |             36.2             |            22.9          
      N 16 x D 3072 x R False x DT torch.float32x GS [1, 64]      |             36.1             |            22.6          
      N 16 x D 3072 x R False x DT torch.float32x GS [1, 128]     |             36.2             |            22.8          
      N 16 x D 4096 x R True x DT torch.bfloat16x GS [1, 64]      |             32.5             |            22.8          
      N 16 x D 4096 x R True x DT torch.bfloat16x GS [1, 128]     |             32.6             |            22.4          
      N 16 x D 4096 x R True x DT torch.float32x GS [1, 64]       |             32.7             |            22.8          
      N 16 x D 4096 x R True x DT torch.float32x GS [1, 128]      |             32.6             |            22.8          
      N 16 x D 4096 x R False x DT torch.bfloat16x GS [1, 64]     |             35.4             |            22.5          
      N 16 x D 4096 x R False x DT torch.bfloat16x GS [1, 128]    |             35.5             |            22.4          
      N 16 x D 4096 x R False x DT torch.float32x GS [1, 64]      |             36.1             |            23.0          
      N 16 x D 4096 x R False x DT torch.float32x GS [1, 128]     |             36.3             |            22.6          
      N 16 x D 5120 x R True x DT torch.bfloat16x GS [1, 64]      |             32.3             |            22.7          
      N 16 x D 5120 x R True x DT torch.bfloat16x GS [1, 128]     |             32.3             |            22.5          
      N 16 x D 5120 x R True x DT torch.float32x GS [1, 64]       |             32.2             |            22.9          
      N 16 x D 5120 x R True x DT torch.float32x GS [1, 128]      |             32.9             |            22.6          
      N 16 x D 5120 x R False x DT torch.bfloat16x GS [1, 64]     |             35.6             |            22.6          
      N 16 x D 5120 x R False x DT torch.bfloat16x GS [1, 128]    |             35.5             |            22.4          
      N 16 x D 5120 x R False x DT torch.float32x GS [1, 64]      |             35.6             |            22.1          
      N 16 x D 5120 x R False x DT torch.float32x GS [1, 128]     |             35.8             |            22.2          
      N 16 x D 6144 x R True x DT torch.bfloat16x GS [1, 64]      |             32.5             |            22.8          
      N 16 x D 6144 x R True x DT torch.bfloat16x GS [1, 128]     |             32.2             |            22.6          
      N 16 x D 6144 x R True x DT torch.float32x GS [1, 64]       |             32.4             |            22.5          
      N 16 x D 6144 x R True x DT torch.float32x GS [1, 128]      |             32.6             |            22.8          
      N 16 x D 6144 x R False x DT torch.bfloat16x GS [1, 64]     |             36.3             |            22.9          
      N 16 x D 6144 x R False x DT torch.bfloat16x GS [1, 128]    |             36.1             |            22.6          
      N 16 x D 6144 x R False x DT torch.float32x GS [1, 64]      |             36.2             |            22.7          
      N 16 x D 6144 x R False x DT torch.float32x GS [1, 128]     |             35.4             |            22.2          
      N 16 x D 7168 x R True x DT torch.bfloat16x GS [1, 64]      |             32.7             |            22.3          
      N 16 x D 7168 x R True x DT torch.bfloat16x GS [1, 128]     |             32.9             |            22.6          
      N 16 x D 7168 x R True x DT torch.float32x GS [1, 64]       |             32.9             |            22.9          
      N 16 x D 7168 x R True x DT torch.float32x GS [1, 128]      |             32.2             |            22.4          
      N 16 x D 7168 x R False x DT torch.bfloat16x GS [1, 64]     |             35.4             |            22.6          
      N 16 x D 7168 x R False x DT torch.bfloat16x GS [1, 128]    |             35.4             |            22.7          
      N 16 x D 7168 x R False x DT torch.float32x GS [1, 64]      |             36.3             |            22.7          
      N 16 x D 7168 x R False x DT torch.float32x GS [1, 128]     |             36.8             |            22.7          
      N 32 x D 1024 x R True x DT torch.bfloat16x GS [1, 64]      |             32.7             |            22.6          
      N 32 x D 1024 x R True x DT torch.bfloat16x GS [1, 128]     |             32.4             |            22.4          
      N 32 x D 1024 x R True x DT torch.float32x GS [1, 64]       |             32.6             |            22.3          
      N 32 x D 1024 x R True x DT torch.float32x GS [1, 128]      |             32.7             |            22.5          
      N 32 x D 1024 x R False x DT torch.bfloat16x GS [1, 64]     |             35.2             |            22.2          
      N 32 x D 1024 x R False x DT torch.bfloat16x GS [1, 128]    |             35.5             |            22.2          
      N 32 x D 1024 x R False x DT torch.float32x GS [1, 64]      |             35.4             |            22.2          
      N 32 x D 1024 x R False x DT torch.float32x GS [1, 128]     |             35.5             |            22.2          
      N 32 x D 2048 x R True x DT torch.bfloat16x GS [1, 64]      |             32.4             |            22.6          
      N 32 x D 2048 x R True x DT torch.bfloat16x GS [1, 128]     |             32.7             |            22.4          
      N 32 x D 2048 x R True x DT torch.float32x GS [1, 64]       |             32.4             |            22.4          
      N 32 x D 2048 x R True x DT torch.float32x GS [1, 128]      |             32.1             |            22.5          
      N 32 x D 2048 x R False x DT torch.bfloat16x GS [1, 64]     |             35.5             |            22.3          
      N 32 x D 2048 x R False x DT torch.bfloat16x GS [1, 128]    |             35.5             |            22.2          
      N 32 x D 2048 x R False x DT torch.float32x GS [1, 64]      |             35.5             |            22.2          
      N 32 x D 2048 x R False x DT torch.float32x GS [1, 128]     |             35.7             |            22.4          
      N 32 x D 3072 x R True x DT torch.bfloat16x GS [1, 64]      |             32.1             |            22.5          
      N 32 x D 3072 x R True x DT torch.bfloat16x GS [1, 128]     |             32.4             |            22.6          
      N 32 x D 3072 x R True x DT torch.float32x GS [1, 64]       |             32.7             |            22.7          
      N 32 x D 3072 x R True x DT torch.float32x GS [1, 128]      |             32.6             |            22.2          
      N 32 x D 3072 x R False x DT torch.bfloat16x GS [1, 64]     |             35.3             |            22.3          
      N 32 x D 3072 x R False x DT torch.bfloat16x GS [1, 128]    |             35.5             |            22.3          
      N 32 x D 3072 x R False x DT torch.float32x GS [1, 64]      |             35.4             |            22.3          
      N 32 x D 3072 x R False x DT torch.float32x GS [1, 128]     |             35.4             |            22.0          
      N 32 x D 4096 x R True x DT torch.bfloat16x GS [1, 64]      |             32.9             |            22.6          
      N 32 x D 4096 x R True x DT torch.bfloat16x GS [1, 128]     |             32.3             |            22.6          
      N 32 x D 4096 x R True x DT torch.float32x GS [1, 64]       |             32.4             |            22.3          
      N 32 x D 4096 x R True x DT torch.float32x GS [1, 128]      |             32.4             |            22.3          
      N 32 x D 4096 x R False x DT torch.bfloat16x GS [1, 64]     |             35.3             |            22.3          
      N 32 x D 4096 x R False x DT torch.bfloat16x GS [1, 128]    |             35.4             |            22.4          
      N 32 x D 4096 x R False x DT torch.float32x GS [1, 64]      |             35.3             |            22.4          
      N 32 x D 4096 x R False x DT torch.float32x GS [1, 128]     |             36.1             |            22.7          
      N 32 x D 5120 x R True x DT torch.bfloat16x GS [1, 64]      |             33.0             |            22.6          
      N 32 x D 5120 x R True x DT torch.bfloat16x GS [1, 128]     |             32.8             |            22.6          
      N 32 x D 5120 x R True x DT torch.float32x GS [1, 64]       |             32.3             |            22.5          
      N 32 x D 5120 x R True x DT torch.float32x GS [1, 128]      |             32.3             |            22.6          
      N 32 x D 5120 x R False x DT torch.bfloat16x GS [1, 64]     |             36.2             |            22.7          
      N 32 x D 5120 x R False x DT torch.bfloat16x GS [1, 128]    |             36.4             |            22.5          
      N 32 x D 5120 x R False x DT torch.float32x GS [1, 64]      |             35.4             |            22.4          
      N 32 x D 5120 x R False x DT torch.float32x GS [1, 128]     |             36.4             |            22.8          
      N 32 x D 6144 x R True x DT torch.bfloat16x GS [1, 64]      |             33.1             |            23.0          
      N 32 x D 6144 x R True x DT torch.bfloat16x GS [1, 128]     |             32.4             |            22.7          
      N 32 x D 6144 x R True x DT torch.float32x GS [1, 64]       |             33.2             |            23.0          
      N 32 x D 6144 x R True x DT torch.float32x GS [1, 128]      |             32.9             |            23.4          
      N 32 x D 6144 x R False x DT torch.bfloat16x GS [1, 64]     |             36.4             |            24.1          
      N 32 x D 6144 x R False x DT torch.bfloat16x GS [1, 128]    |             37.6             |            23.8          
      N 32 x D 6144 x R False x DT torch.float32x GS [1, 64]      |             37.1             |            23.8          
      N 32 x D 6144 x R False x DT torch.float32x GS [1, 128]     |             37.2             |            22.4          
      N 32 x D 7168 x R True x DT torch.bfloat16x GS [1, 64]      |             32.8             |            22.7          
      N 32 x D 7168 x R True x DT torch.bfloat16x GS [1, 128]     |             32.8             |            22.6          
      N 32 x D 7168 x R True x DT torch.float32x GS [1, 64]       |             33.2             |            23.1          
      N 32 x D 7168 x R True x DT torch.float32x GS [1, 128]      |             33.4             |            22.7          
      N 32 x D 7168 x R False x DT torch.bfloat16x GS [1, 64]     |             36.5             |            22.9          
      N 32 x D 7168 x R False x DT torch.bfloat16x GS [1, 128]    |             36.3             |            22.5          
      N 32 x D 7168 x R False x DT torch.float32x GS [1, 64]      |             36.7             |            23.3          
      N 32 x D 7168 x R False x DT torch.float32x GS [1, 128]     |             36.2             |            22.9          
      N 64 x D 1024 x R True x DT torch.bfloat16x GS [1, 64]      |             32.7             |            22.7          
      N 64 x D 1024 x R True x DT torch.bfloat16x GS [1, 128]     |             32.9             |            23.4          
      N 64 x D 1024 x R True x DT torch.float32x GS [1, 64]       |             32.8             |            23.0          
      N 64 x D 1024 x R True x DT torch.float32x GS [1, 128]      |             32.7             |            22.7          
      N 64 x D 1024 x R False x DT torch.bfloat16x GS [1, 64]     |             35.5             |            22.9          
      N 64 x D 1024 x R False x DT torch.bfloat16x GS [1, 128]    |             36.1             |            22.6          
      N 64 x D 1024 x R False x DT torch.float32x GS [1, 64]      |             36.4             |            22.6          
      N 64 x D 1024 x R False x DT torch.float32x GS [1, 128]     |             35.8             |            22.4          
      N 64 x D 2048 x R True x DT torch.bfloat16x GS [1, 64]      |             32.4             |            22.8          
      N 64 x D 2048 x R True x DT torch.bfloat16x GS [1, 128]     |             32.9             |            22.4          
      N 64 x D 2048 x R True x DT torch.float32x GS [1, 64]       |             32.9             |            22.8          
      N 64 x D 2048 x R True x DT torch.float32x GS [1, 128]      |             33.2             |            22.8          
      N 64 x D 2048 x R False x DT torch.bfloat16x GS [1, 64]     |             35.8             |            22.8          
      N 64 x D 2048 x R False x DT torch.bfloat16x GS [1, 128]    |             35.5             |            22.8          
      N 64 x D 2048 x R False x DT torch.float32x GS [1, 64]      |             36.1             |            22.6          
      N 64 x D 2048 x R False x DT torch.float32x GS [1, 128]     |             36.5             |            22.8          
      N 64 x D 3072 x R True x DT torch.bfloat16x GS [1, 64]      |             32.8             |            23.0          
      N 64 x D 3072 x R True x DT torch.bfloat16x GS [1, 128]     |             33.4             |            23.1          
      N 64 x D 3072 x R True x DT torch.float32x GS [1, 64]       |             33.0             |            22.9          
      N 64 x D 3072 x R True x DT torch.float32x GS [1, 128]      |             33.3             |            23.2          
      N 64 x D 3072 x R False x DT torch.bfloat16x GS [1, 64]     |             36.5             |            22.7          
      N 64 x D 3072 x R False x DT torch.bfloat16x GS [1, 128]    |             36.2             |            22.5          
      N 64 x D 3072 x R False x DT torch.float32x GS [1, 64]      |             35.7             |            22.0          
      N 64 x D 3072 x R False x DT torch.float32x GS [1, 128]     |             35.3             |            22.1          
      N 64 x D 4096 x R True x DT torch.bfloat16x GS [1, 64]      |             31.9             |            22.4          
      N 64 x D 4096 x R True x DT torch.bfloat16x GS [1, 128]     |             32.6             |            22.4          
      N 64 x D 4096 x R True x DT torch.float32x GS [1, 64]       |             32.4             |            22.5          
      N 64 x D 4096 x R True x DT torch.float32x GS [1, 128]      |             32.5             |            22.7          
      N 64 x D 4096 x R False x DT torch.bfloat16x GS [1, 64]     |             35.7             |            22.4          
      N 64 x D 4096 x R False x DT torch.bfloat16x GS [1, 128]    |             35.7             |            22.7          
      N 64 x D 4096 x R False x DT torch.float32x GS [1, 64]      |             35.9             |            22.5          
      N 64 x D 4096 x R False x DT torch.float32x GS [1, 128]     |             35.8             |            22.5          
      N 64 x D 5120 x R True x DT torch.bfloat16x GS [1, 64]      |             32.2             |            22.6          
      N 64 x D 5120 x R True x DT torch.bfloat16x GS [1, 128]     |             32.5             |            22.7          
      N 64 x D 5120 x R True x DT torch.float32x GS [1, 64]       |             32.7             |            22.5          
      N 64 x D 5120 x R True x DT torch.float32x GS [1, 128]      |             32.7             |            22.7          
      N 64 x D 5120 x R False x DT torch.bfloat16x GS [1, 64]     |             35.7             |            22.4          
      N 64 x D 5120 x R False x DT torch.bfloat16x GS [1, 128]    |             35.8             |            22.5          
      N 64 x D 5120 x R False x DT torch.float32x GS [1, 64]      |             36.0             |            22.8          
      N 64 x D 5120 x R False x DT torch.float32x GS [1, 128]     |             35.9             |            22.7          
      N 64 x D 6144 x R True x DT torch.bfloat16x GS [1, 64]      |             32.3             |            22.5          
      N 64 x D 6144 x R True x DT torch.bfloat16x GS [1, 128]     |             32.4             |            22.6          
      N 64 x D 6144 x R True x DT torch.float32x GS [1, 64]       |             32.5             |            22.6          
      N 64 x D 6144 x R True x DT torch.float32x GS [1, 128]      |             32.5             |            22.7          
      N 64 x D 6144 x R False x DT torch.bfloat16x GS [1, 64]     |             35.6             |            22.8          
      N 64 x D 6144 x R False x DT torch.bfloat16x GS [1, 128]    |             36.0             |            22.4          
      N 64 x D 6144 x R False x DT torch.float32x GS [1, 64]      |             35.7             |            22.8          
      N 64 x D 6144 x R False x DT torch.float32x GS [1, 128]     |             35.8             |            22.7          
      N 64 x D 7168 x R True x DT torch.bfloat16x GS [1, 64]      |             32.5             |            22.6          
      N 64 x D 7168 x R True x DT torch.bfloat16x GS [1, 128]     |             32.8             |            22.9          
      N 64 x D 7168 x R True x DT torch.float32x GS [1, 64]       |             33.2             |            23.1          
      N 64 x D 7168 x R True x DT torch.float32x GS [1, 128]      |             32.8             |            22.7          
      N 64 x D 7168 x R False x DT torch.bfloat16x GS [1, 64]     |             35.5             |            22.6          
      N 64 x D 7168 x R False x DT torch.bfloat16x GS [1, 128]    |             35.9             |            22.4          
      N 64 x D 7168 x R False x DT torch.float32x GS [1, 64]      |             36.4             |            22.6          
      N 64 x D 7168 x R False x DT torch.float32x GS [1, 128]     |             36.1             |            22.6          
      N 128 x D 1024 x R True x DT torch.bfloat16x GS [1, 64]     |             32.8             |            23.0          
      N 128 x D 1024 x R True x DT torch.bfloat16x GS [1, 128]    |             32.4             |            22.2          
      N 128 x D 1024 x R True x DT torch.float32x GS [1, 64]      |             32.5             |            22.5          
      N 128 x D 1024 x R True x DT torch.float32x GS [1, 128]     |             32.6             |            22.5          
      N 128 x D 1024 x R False x DT torch.bfloat16x GS [1, 64]    |             36.0             |            22.5          
      N 128 x D 1024 x R False x DT torch.bfloat16x GS [1, 128]   |             35.9             |            22.6          
      N 128 x D 1024 x R False x DT torch.float32x GS [1, 64]     |             36.4             |            22.7          
      N 128 x D 1024 x R False x DT torch.float32x GS [1, 128]    |             35.9             |            22.5          
      N 128 x D 2048 x R True x DT torch.bfloat16x GS [1, 64]     |             32.6             |            22.8          
      N 128 x D 2048 x R True x DT torch.bfloat16x GS [1, 128]    |             32.8             |            22.6          
      N 128 x D 2048 x R True x DT torch.float32x GS [1, 64]      |             32.5             |            22.6          
      N 128 x D 2048 x R True x DT torch.float32x GS [1, 128]     |             33.1             |            22.7          
      N 128 x D 2048 x R False x DT torch.bfloat16x GS [1, 64]    |             35.6             |            23.0          
      N 128 x D 2048 x R False x DT torch.bfloat16x GS [1, 128]   |             35.8             |            22.5          
      N 128 x D 2048 x R False x DT torch.float32x GS [1, 64]     |             35.8             |            22.6          
      N 128 x D 2048 x R False x DT torch.float32x GS [1, 128]    |             35.7             |            22.6          
      N 128 x D 3072 x R True x DT torch.bfloat16x GS [1, 64]     |             32.5             |            22.6          
      N 128 x D 3072 x R True x DT torch.bfloat16x GS [1, 128]    |             32.2             |            22.5          
      N 128 x D 3072 x R True x DT torch.float32x GS [1, 64]      |             33.0             |            22.9          
      N 128 x D 3072 x R True x DT torch.float32x GS [1, 128]     |             32.8             |            22.7          
      N 128 x D 3072 x R False x DT torch.bfloat16x GS [1, 64]    |             35.8             |            22.3          
      N 128 x D 3072 x R False x DT torch.bfloat16x GS [1, 128]   |             35.6             |            22.3          
      N 128 x D 3072 x R False x DT torch.float32x GS [1, 64]     |             35.8             |            22.3          
      N 128 x D 3072 x R False x DT torch.float32x GS [1, 128]    |             36.4             |            23.0          
      N 128 x D 4096 x R True x DT torch.bfloat16x GS [1, 64]     |             33.2             |            23.1          
      N 128 x D 4096 x R True x DT torch.bfloat16x GS [1, 128]    |             32.9             |            22.6          
      N 128 x D 4096 x R True x DT torch.float32x GS [1, 64]      |             32.9             |            22.7          
      N 128 x D 4096 x R True x DT torch.float32x GS [1, 128]     |             32.8             |            22.6          
      N 128 x D 4096 x R False x DT torch.bfloat16x GS [1, 64]    |             36.1             |            22.9          
      N 128 x D 4096 x R False x DT torch.bfloat16x GS [1, 128]   |             36.2             |            22.5          
      N 128 x D 4096 x R False x DT torch.float32x GS [1, 64]     |             36.6             |            23.0          
      N 128 x D 4096 x R False x DT torch.float32x GS [1, 128]    |             36.4             |            23.1          
      N 128 x D 5120 x R True x DT torch.bfloat16x GS [1, 64]     |             33.2             |            23.3          
      N 128 x D 5120 x R True x DT torch.bfloat16x GS [1, 128]    |             33.4             |            23.4          
      N 128 x D 5120 x R True x DT torch.float32x GS [1, 64]      |             33.2             |            22.6          
      N 128 x D 5120 x R True x DT torch.float32x GS [1, 128]     |             32.8             |            22.6          
      N 128 x D 5120 x R False x DT torch.bfloat16x GS [1, 64]    |             36.3             |            22.5          
      N 128 x D 5120 x R False x DT torch.bfloat16x GS [1, 128]   |             35.4             |            22.2          
      N 128 x D 5120 x R False x DT torch.float32x GS [1, 64]     |             36.2             |            22.8          
      N 128 x D 5120 x R False x DT torch.float32x GS [1, 128]    |             36.2             |            22.6          
      N 128 x D 6144 x R True x DT torch.bfloat16x GS [1, 64]     |             32.3             |            22.6          
      N 128 x D 6144 x R True x DT torch.bfloat16x GS [1, 128]    |             32.1             |            21.8          
      N 128 x D 6144 x R True x DT torch.float32x GS [1, 64]      |             32.3             |            22.5          
      N 128 x D 6144 x R True x DT torch.float32x GS [1, 128]     |             32.2             |            22.6          
      N 128 x D 6144 x R False x DT torch.bfloat16x GS [1, 64]    |             35.6             |            22.6          
      N 128 x D 6144 x R False x DT torch.bfloat16x GS [1, 128]   |             35.5             |            22.1          
      N 128 x D 6144 x R False x DT torch.float32x GS [1, 64]     |             35.6             |            22.6          
      N 128 x D 6144 x R False x DT torch.float32x GS [1, 128]    |             35.9             |            22.4          
      N 128 x D 7168 x R True x DT torch.bfloat16x GS [1, 64]     |             32.3             |            22.5          
      N 128 x D 7168 x R True x DT torch.bfloat16x GS [1, 128]    |             32.2             |            22.4          
      N 128 x D 7168 x R True x DT torch.float32x GS [1, 64]      |             33.4             |            23.0          
      N 128 x D 7168 x R True x DT torch.float32x GS [1, 128]     |             33.2             |            23.3          
      N 128 x D 7168 x R False x DT torch.bfloat16x GS [1, 64]    |             36.5             |            22.9          
      N 128 x D 7168 x R False x DT torch.bfloat16x GS [1, 128]   |             36.3             |            22.9          
      N 128 x D 7168 x R False x DT torch.float32x GS [1, 64]     |             36.1             |            22.9          
      N 128 x D 7168 x R False x DT torch.float32x GS [1, 128]    |             36.3             |            22.4          
      N 256 x D 1024 x R True x DT torch.bfloat16x GS [1, 64]     |             32.9             |            22.9          
      N 256 x D 1024 x R True x DT torch.bfloat16x GS [1, 128]    |             33.0             |            22.6          
      N 256 x D 1024 x R True x DT torch.float32x GS [1, 64]      |             32.5             |            22.6          
      N 256 x D 1024 x R True x DT torch.float32x GS [1, 128]     |             32.5             |            22.5          
      N 256 x D 1024 x R False x DT torch.bfloat16x GS [1, 64]    |             35.5             |            22.2          
      N 256 x D 1024 x R False x DT torch.bfloat16x GS [1, 128]   |             35.9             |            22.5          
      N 256 x D 1024 x R False x DT torch.float32x GS [1, 64]     |             36.0             |            22.4          
      N 256 x D 1024 x R False x DT torch.float32x GS [1, 128]    |             36.0             |            22.6          
      N 256 x D 2048 x R True x DT torch.bfloat16x GS [1, 64]     |             32.6             |            22.7          
      N 256 x D 2048 x R True x DT torch.bfloat16x GS [1, 128]    |             33.1             |            22.7          
      N 256 x D 2048 x R True x DT torch.float32x GS [1, 64]      |             32.5             |            22.7          
      N 256 x D 2048 x R True x DT torch.float32x GS [1, 128]     |             32.5             |            22.4          
      N 256 x D 2048 x R False x DT torch.bfloat16x GS [1, 64]    |             36.0             |            22.7          
      N 256 x D 2048 x R False x DT torch.bfloat16x GS [1, 128]   |             35.5             |            22.4          
      N 256 x D 2048 x R False x DT torch.float32x GS [1, 64]     |             35.6             |            22.4          
      N 256 x D 2048 x R False x DT torch.float32x GS [1, 128]    |             35.6             |            22.4          
      N 256 x D 3072 x R True x DT torch.bfloat16x GS [1, 64]     |             32.2             |            22.5          
      N 256 x D 3072 x R True x DT torch.bfloat16x GS [1, 128]    |             32.5             |            22.5          
      N 256 x D 3072 x R True x DT torch.float32x GS [1, 64]      |             32.5             |            22.5          
      N 256 x D 3072 x R True x DT torch.float32x GS [1, 128]     |             32.4             |            22.5          
      N 256 x D 3072 x R False x DT torch.bfloat16x GS [1, 64]    |             35.6             |            23.0          
      N 256 x D 3072 x R False x DT torch.bfloat16x GS [1, 128]   |             36.3             |            22.5          
      N 256 x D 3072 x R False x DT torch.float32x GS [1, 64]     |             36.2             |            22.8          
      N 256 x D 3072 x R False x DT torch.float32x GS [1, 128]    |             36.5             |            22.7          
      N 256 x D 4096 x R True x DT torch.bfloat16x GS [1, 64]     |             33.3             |            23.2          
      N 256 x D 4096 x R True x DT torch.bfloat16x GS [1, 128]    |             33.3             |            23.3          
      N 256 x D 4096 x R True x DT torch.float32x GS [1, 64]      |             32.9             |            22.5          
      N 256 x D 4096 x R True x DT torch.float32x GS [1, 128]     |             32.4             |            22.5          
      N 256 x D 4096 x R False x DT torch.bfloat16x GS [1, 64]    |             35.8             |            22.5          
      N 256 x D 4096 x R False x DT torch.bfloat16x GS [1, 128]   |             35.7             |            22.3          
      N 256 x D 4096 x R False x DT torch.float32x GS [1, 64]     |             35.5             |            22.3          
      N 256 x D 4096 x R False x DT torch.float32x GS [1, 128]    |             35.6             |            22.5          
      N 256 x D 5120 x R True x DT torch.bfloat16x GS [1, 64]     |             32.4             |            22.7          
      N 256 x D 5120 x R True x DT torch.bfloat16x GS [1, 128]    |             32.2             |            22.5          
      N 256 x D 5120 x R True x DT torch.float32x GS [1, 64]      |             32.9             |            22.9          
      N 256 x D 5120 x R True x DT torch.float32x GS [1, 128]     |             32.3             |            22.7          
      N 256 x D 5120 x R False x DT torch.bfloat16x GS [1, 64]    |             35.9             |            22.5          
      N 256 x D 5120 x R False x DT torch.bfloat16x GS [1, 128]   |             35.8             |            22.3          
      N 256 x D 5120 x R False x DT torch.float32x GS [1, 64]     |             35.9             |            22.5          
      N 256 x D 5120 x R False x DT torch.float32x GS [1, 128]    |             35.8             |            22.6          
      N 256 x D 6144 x R True x DT torch.bfloat16x GS [1, 64]     |             33.3             |            22.7          
      N 256 x D 6144 x R True x DT torch.bfloat16x GS [1, 128]    |             33.1             |            22.6          
      N 256 x D 6144 x R True x DT torch.float32x GS [1, 64]      |             33.1             |            22.5          
      N 256 x D 6144 x R True x DT torch.float32x GS [1, 128]     |             33.0             |            22.5          
      N 256 x D 6144 x R False x DT torch.bfloat16x GS [1, 64]    |             36.4             |            22.3          
      N 256 x D 6144 x R False x DT torch.bfloat16x GS [1, 128]   |             35.6             |            22.7          
      N 256 x D 6144 x R False x DT torch.float32x GS [1, 64]     |             36.0             |            22.7          
      N 256 x D 6144 x R False x DT torch.float32x GS [1, 128]    |             36.3             |            22.9          
      N 256 x D 7168 x R True x DT torch.bfloat16x GS [1, 64]     |             32.5             |            23.2          
      N 256 x D 7168 x R True x DT torch.bfloat16x GS [1, 128]    |             33.2             |            23.0          
      N 256 x D 7168 x R True x DT torch.float32x GS [1, 64]      |             32.9             |            23.0          
      N 256 x D 7168 x R True x DT torch.float32x GS [1, 128]     |             33.2             |            23.0          
      N 256 x D 7168 x R False x DT torch.bfloat16x GS [1, 64]    |             36.4             |            22.9          
      N 256 x D 7168 x R False x DT torch.bfloat16x GS [1, 128]   |             36.0             |            22.6          
      N 256 x D 7168 x R False x DT torch.float32x GS [1, 64]     |             35.6             |            22.3          
      N 256 x D 7168 x R False x DT torch.float32x GS [1, 128]    |             35.7             |            22.3          
      N 512 x D 1024 x R True x DT torch.bfloat16x GS [1, 64]     |             32.5             |            22.4          
      N 512 x D 1024 x R True x DT torch.bfloat16x GS [1, 128]    |             32.5             |            22.4          
      N 512 x D 1024 x R True x DT torch.float32x GS [1, 64]      |             32.8             |            22.7          
      N 512 x D 1024 x R True x DT torch.float32x GS [1, 128]     |             33.0             |            22.9          
      N 512 x D 1024 x R False x DT torch.bfloat16x GS [1, 64]    |             36.0             |            23.5          
      N 512 x D 1024 x R False x DT torch.bfloat16x GS [1, 128]   |             36.3             |            23.3          
      N 512 x D 1024 x R False x DT torch.float32x GS [1, 64]     |             36.8             |            22.9          
      N 512 x D 1024 x R False x DT torch.float32x GS [1, 128]    |             36.5             |            23.0          
      N 512 x D 2048 x R True x DT torch.bfloat16x GS [1, 64]     |             32.9             |            23.1          
      N 512 x D 2048 x R True x DT torch.bfloat16x GS [1, 128]    |             32.6             |            23.1          
      N 512 x D 2048 x R True x DT torch.float32x GS [1, 64]      |             32.6             |            22.8          
      N 512 x D 2048 x R True x DT torch.float32x GS [1, 128]     |             32.9             |            22.9          
      N 512 x D 2048 x R False x DT torch.bfloat16x GS [1, 64]    |             36.5             |            23.1          
      N 512 x D 2048 x R False x DT torch.bfloat16x GS [1, 128]   |             36.2             |            22.7          
      N 512 x D 2048 x R False x DT torch.float32x GS [1, 64]     |             37.1             |            22.5          
      N 512 x D 2048 x R False x DT torch.float32x GS [1, 128]    |             36.7             |            23.0          
      N 512 x D 3072 x R True x DT torch.bfloat16x GS [1, 64]     |             33.0             |            22.9          
      N 512 x D 3072 x R True x DT torch.bfloat16x GS [1, 128]    |             33.1             |            23.2          
      N 512 x D 3072 x R True x DT torch.float32x GS [1, 64]      |             33.4             |            22.9          
      N 512 x D 3072 x R True x DT torch.float32x GS [1, 128]     |             33.5             |            23.2          
      N 512 x D 3072 x R False x DT torch.bfloat16x GS [1, 64]    |             36.8             |            23.0          
      N 512 x D 3072 x R False x DT torch.bfloat16x GS [1, 128]   |             37.0             |            23.5          
      N 512 x D 3072 x R False x DT torch.float32x GS [1, 64]     |             35.9             |            22.5          
      N 512 x D 3072 x R False x DT torch.float32x GS [1, 128]    |             36.0             |            22.6          
      N 512 x D 4096 x R True x DT torch.bfloat16x GS [1, 64]     |             32.8             |            23.0          
      N 512 x D 4096 x R True x DT torch.bfloat16x GS [1, 128]    |             33.2             |            23.0          
      N 512 x D 4096 x R True x DT torch.float32x GS [1, 64]      |             33.4             |            23.1          
      N 512 x D 4096 x R True x DT torch.float32x GS [1, 128]     |             33.4             |            23.3          
      N 512 x D 4096 x R False x DT torch.bfloat16x GS [1, 64]    |             36.5             |            22.8          
      N 512 x D 4096 x R False x DT torch.bfloat16x GS [1, 128]   |             36.3             |            22.9          
      N 512 x D 4096 x R False x DT torch.float32x GS [1, 64]     |             37.0             |            22.8          
      N 512 x D 4096 x R False x DT torch.float32x GS [1, 128]    |             35.8             |            22.4          
      N 512 x D 5120 x R True x DT torch.bfloat16x GS [1, 64]     |             32.6             |            23.9          
      N 512 x D 5120 x R True x DT torch.bfloat16x GS [1, 128]    |             33.5             |            22.9          
      N 512 x D 5120 x R True x DT torch.float32x GS [1, 64]      |             33.6             |            24.6          
      N 512 x D 5120 x R True x DT torch.float32x GS [1, 128]     |             32.9             |            22.9          
      N 512 x D 5120 x R False x DT torch.bfloat16x GS [1, 64]    |             36.6             |            22.9          
      N 512 x D 5120 x R False x DT torch.bfloat16x GS [1, 128]   |             36.7             |            22.5          
      N 512 x D 5120 x R False x DT torch.float32x GS [1, 64]     |             36.5             |            22.8          
      N 512 x D 5120 x R False x DT torch.float32x GS [1, 128]    |             37.3             |            23.4          
      N 512 x D 6144 x R True x DT torch.bfloat16x GS [1, 64]     |             33.3             |            28.6          
      N 512 x D 6144 x R True x DT torch.bfloat16x GS [1, 128]    |             33.2             |            23.0          
      N 512 x D 6144 x R True x DT torch.float32x GS [1, 64]      |             33.1             |            29.6          
      N 512 x D 6144 x R True x DT torch.float32x GS [1, 128]     |             33.3             |            23.1          
      N 512 x D 6144 x R False x DT torch.bfloat16x GS [1, 64]    |             36.9             |            24.8          
      N 512 x D 6144 x R False x DT torch.bfloat16x GS [1, 128]   |             37.8             |            22.6          
      N 512 x D 6144 x R False x DT torch.float32x GS [1, 64]     |             36.3             |            24.4          
      N 512 x D 6144 x R False x DT torch.float32x GS [1, 128]    |             36.2             |            22.8          
      N 512 x D 7168 x R True x DT torch.bfloat16x GS [1, 64]     |             32.8             |            33.2          
      N 512 x D 7168 x R True x DT torch.bfloat16x GS [1, 128]    |             32.8             |            24.5          
      N 512 x D 7168 x R True x DT torch.float32x GS [1, 64]      |             33.6             |            34.5          
      N 512 x D 7168 x R True x DT torch.float32x GS [1, 128]     |             33.4             |            26.3          
      N 512 x D 7168 x R False x DT torch.bfloat16x GS [1, 64]    |             36.5             |            28.1          
      N 512 x D 7168 x R False x DT torch.bfloat16x GS [1, 128]   |             36.4             |            22.8          
      N 512 x D 7168 x R False x DT torch.float32x GS [1, 64]     |             36.1             |            27.6          
      N 512 x D 7168 x R False x DT torch.float32x GS [1, 128]    |             36.2             |            22.9          
      N 1024 x D 1024 x R True x DT torch.bfloat16x GS [1, 64]    |             32.7             |            22.8          
      N 1024 x D 1024 x R True x DT torch.bfloat16x GS [1, 128]   |             32.5             |            22.6          
      N 1024 x D 1024 x R True x DT torch.float32x GS [1, 64]     |             32.7             |            23.0          
      N 1024 x D 1024 x R True x DT torch.float32x GS [1, 128]    |             32.8             |            22.8          
      N 1024 x D 1024 x R False x DT torch.bfloat16x GS [1, 64]   |             35.7             |            22.5          
      N 1024 x D 1024 x R False x DT torch.bfloat16x GS [1, 128]  |             35.8             |            22.6          
      N 1024 x D 1024 x R False x DT torch.float32x GS [1, 64]    |             36.0             |            22.8          
      N 1024 x D 1024 x R False x DT torch.float32x GS [1, 128]   |             36.0             |            22.8          
      N 1024 x D 2048 x R True x DT torch.bfloat16x GS [1, 64]    |             32.8             |            23.0          
      N 1024 x D 2048 x R True x DT torch.bfloat16x GS [1, 128]   |             32.6             |            22.6          
      N 1024 x D 2048 x R True x DT torch.float32x GS [1, 64]     |             32.9             |            22.9          
      N 1024 x D 2048 x R True x DT torch.float32x GS [1, 128]    |             32.7             |            23.0          
      N 1024 x D 2048 x R False x DT torch.bfloat16x GS [1, 64]   |             36.0             |            22.6          
      N 1024 x D 2048 x R False x DT torch.bfloat16x GS [1, 128]  |             35.8             |            22.5          
      N 1024 x D 2048 x R False x DT torch.float32x GS [1, 64]    |             35.9             |            22.5          
      N 1024 x D 2048 x R False x DT torch.float32x GS [1, 128]   |             35.9             |            22.6          
      N 1024 x D 3072 x R True x DT torch.bfloat16x GS [1, 64]    |             32.7             |            26.9          
      N 1024 x D 3072 x R True x DT torch.bfloat16x GS [1, 128]   |             32.7             |            22.8          
      N 1024 x D 3072 x R True x DT torch.float32x GS [1, 64]     |             32.4             |            28.9          
      N 1024 x D 3072 x R True x DT torch.float32x GS [1, 128]    |             33.0             |            23.9          
      N 1024 x D 3072 x R False x DT torch.bfloat16x GS [1, 64]   |             36.2             |            24.4          
      N 1024 x D 3072 x R False x DT torch.bfloat16x GS [1, 128]  |             36.1             |            22.5          
      N 1024 x D 3072 x R False x DT torch.float32x GS [1, 64]    |             36.1             |            24.7          
      N 1024 x D 3072 x R False x DT torch.float32x GS [1, 128]   |             35.9             |            22.9          
      N 1024 x D 4096 x R True x DT torch.bfloat16x GS [1, 64]    |             32.5             |            33.2          
      N 1024 x D 4096 x R True x DT torch.bfloat16x GS [1, 128]   |             32.6             |            27.4          
      N 1024 x D 4096 x R True x DT torch.float32x GS [1, 64]     |             38.8             |            36.3          
      N 1024 x D 4096 x R True x DT torch.float32x GS [1, 128]    |             37.0             |            30.2          
      N 1024 x D 4096 x R False x DT torch.bfloat16x GS [1, 64]   |             35.9             |            29.4          
      N 1024 x D 4096 x R False x DT torch.bfloat16x GS [1, 128]  |             35.7             |            24.0          
      N 1024 x D 4096 x R False x DT torch.float32x GS [1, 64]    |             36.3             |            28.6          
      N 1024 x D 4096 x R False x DT torch.float32x GS [1, 128]   |             36.2             |            22.8          
      N 1024 x D 5120 x R True x DT torch.bfloat16x GS [1, 64]    |             33.0             |            42.1          
      N 1024 x D 5120 x R True x DT torch.bfloat16x GS [1, 128]   |             32.9             |            33.6          
      N 1024 x D 5120 x R True x DT torch.float32x GS [1, 64]     |             49.9             |            52.7          
      N 1024 x D 5120 x R True x DT torch.float32x GS [1, 128]    |             46.7             |            43.7          
      N 1024 x D 5120 x R False x DT torch.bfloat16x GS [1, 64]   |             36.6             |            36.7          
      N 1024 x D 5120 x R False x DT torch.bfloat16x GS [1, 128]  |             36.7             |            29.2          
      N 1024 x D 5120 x R False x DT torch.float32x GS [1, 64]    |             36.0             |            35.9          
      N 1024 x D 5120 x R False x DT torch.float32x GS [1, 128]   |             35.7             |            27.9          
      N 1024 x D 6144 x R True x DT torch.bfloat16x GS [1, 64]    |             38.3             |            51.2          
      N 1024 x D 6144 x R True x DT torch.bfloat16x GS [1, 128]   |             32.5             |            39.8          
      N 1024 x D 6144 x R True x DT torch.float32x GS [1, 64]     |             61.6             |            67.2          
      N 1024 x D 6144 x R True x DT torch.float32x GS [1, 128]    |             58.4             |            55.7          
      N 1024 x D 6144 x R False x DT torch.bfloat16x GS [1, 64]   |             36.1             |            43.4          
      N 1024 x D 6144 x R False x DT torch.bfloat16x GS [1, 128]  |             35.8             |            33.9          
      N 1024 x D 6144 x R False x DT torch.float32x GS [1, 64]    |             41.5             |            43.6          
      N 1024 x D 6144 x R False x DT torch.float32x GS [1, 128]   |             37.9             |            33.2          
      N 1024 x D 7168 x R True x DT torch.bfloat16x GS [1, 64]    |             46.9             |            60.4          
      N 1024 x D 7168 x R True x DT torch.bfloat16x GS [1, 128]   |             35.0             |            46.7          
      N 1024 x D 7168 x R True x DT torch.float32x GS [1, 64]     |             73.6             |            78.8          
      N 1024 x D 7168 x R True x DT torch.float32x GS [1, 128]    |             70.3             |            65.0          
      N 1024 x D 7168 x R False x DT torch.bfloat16x GS [1, 64]   |             38.3             |            49.6          
      N 1024 x D 7168 x R False x DT torch.bfloat16x GS [1, 128]  |             37.2             |            38.2          
      N 1024 x D 7168 x R False x DT torch.float32x GS [1, 64]    |             50.4             |            51.4          
      N 1024 x D 7168 x R False x DT torch.float32x GS [1, 128]   |             46.5             |            38.3          

Times are in microseconds (us).

Results of E2E sonnet benchmark of Qwen/Qwen3-30B-A3B-FP8 compared to main (H100):

input_len prefix_len output_len request_rate main_ttft pr_ttft main_tpot pr_tpot
256 128 128 1 26.07 24.98 7.7 7.75
512 256 256 1 30.11 28.14 8.43 8.41
1024 512 512 1 40.37 40.57 10.02 9.83
256 128 128 2.5 23.89 23.2 8.57 8.57
512 256 256 2.5 27.41 26.94 10.09 9.98
1024 512 512 2.5 32.23 31.77 12.65 12.32
256 128 128 5 25.5 24.12 9.97 9.79
512 256 256 5 29.46 30.77 12.22 11.98
1024 512 512 5 37.25 34.78 14.94 15.23
256 128 128 10 28.07 29.84 12.19 11.91
512 256 256 10 38.68 37.84 15.64 15.4
1024 512 512 10 69.44 67.1 22.96 22.65
256 128 128 25 36.9 35.8 14.67 14.62
512 256 256 25 87.61 79.51 27.22 26.24
1024 512 512 25 120.67 117.49 42.95 42.53
256 128 128 50 73.54 71.44 24.6 25.25
512 256 256 50 131.41 130.21 54.35 55.16
1024 512 512 50 201.89 200.63 58.51 58.47

Signed-off-by: ElizaWszola <ewszola@redhat.com>
Signed-off-by: ElizaWszola <ewszola@redhat.com>
…or int8

Signed-off-by: ElizaWszola <ewszola@redhat.com>
Signed-off-by: ElizaWszola <ewszola@redhat.com>
Signed-off-by: ElizaWszola <ewszola@redhat.com>
Signed-off-by: ElizaWszola <ewszola@redhat.com>
Signed-off-by: ElizaWszola <ewszola@redhat.com>
Signed-off-by: ElizaWszola <ewszola@redhat.com>
Signed-off-by: ElizaWszola <ewszola@redhat.com>
Signed-off-by: ElizaWszola <ewszola@redhat.com>
Signed-off-by: ElizaWszola <ewszola@redhat.com>
@mergify mergify bot added the performance Performance-related issues label Nov 7, 2025
@ElizaWszola ElizaWszola changed the title [Performance] Blockwise quant RMS norm [Performance] Fused blockwise quant RMS norm Nov 7, 2025
Signed-off-by: yewentao256 <zhyanwentao@126.com>
@yewentao256
Copy link
Member

The optimization of this commit is beneficial:
Before

[-------------------------------------------- rms-norm-dynamic-per-token-quant --------------------------------------------]
                                                                  |  unfused_groupwise_fp8_impl  |  fused_groupwise_fp8_impl
1 threads: -----------------------------------------------------------------------------------------------------------------
      N 1 x D 1024 x R True x DT torch.bfloat16x GS [1, 128]      |             31.4             |            29.4          
      N 1 x D 1024 x R False x DT torch.bfloat16x GS [1, 128]     |             34.0             |            30.4          
      N 1 x D 5120 x R True x DT torch.bfloat16x GS [1, 128]      |             31.3             |            29.6          
      N 1 x D 5120 x R False x DT torch.bfloat16x GS [1, 128]     |             34.0             |            29.5          
      N 4 x D 1024 x R True x DT torch.bfloat16x GS [1, 128]      |             30.1             |            29.5          
      N 4 x D 1024 x R False x DT torch.bfloat16x GS [1, 128]     |             35.1             |            31.2          
      N 4 x D 5120 x R True x DT torch.bfloat16x GS [1, 128]      |             32.4             |            32.5          
      N 4 x D 5120 x R False x DT torch.bfloat16x GS [1, 128]     |             36.1             |            30.7          
      N 16 x D 1024 x R True x DT torch.bfloat16x GS [1, 128]     |             31.6             |            31.4          
      N 16 x D 1024 x R False x DT torch.bfloat16x GS [1, 128]    |             35.2             |            32.3          
      N 16 x D 5120 x R True x DT torch.bfloat16x GS [1, 128]     |             32.8             |            32.2          
      N 16 x D 5120 x R False x DT torch.bfloat16x GS [1, 128]    |             35.1             |            31.6          
      N 64 x D 1024 x R True x DT torch.bfloat16x GS [1, 128]     |             31.8             |            31.5          
      N 64 x D 1024 x R False x DT torch.bfloat16x GS [1, 128]    |             35.2             |            32.7          
      N 64 x D 5120 x R True x DT torch.bfloat16x GS [1, 128]     |             31.8             |            31.6          
      N 64 x D 5120 x R False x DT torch.bfloat16x GS [1, 128]    |             36.1             |            32.1          
      N 256 x D 1024 x R True x DT torch.bfloat16x GS [1, 128]    |             32.8             |            32.3          
      N 256 x D 1024 x R False x DT torch.bfloat16x GS [1, 128]   |             36.1             |            32.0          
      N 256 x D 5120 x R True x DT torch.bfloat16x GS [1, 128]    |             32.6             |            32.3          
      N 256 x D 5120 x R False x DT torch.bfloat16x GS [1, 128]   |             35.2             |            31.5          
      N 1024 x D 1024 x R True x DT torch.bfloat16x GS [1, 128]   |             31.4             |            39.0          
      N 1024 x D 1024 x R False x DT torch.bfloat16x GS [1, 128]  |             35.1             |            36.9          
      N 1024 x D 5120 x R True x DT torch.bfloat16x GS [1, 128]   |             31.8             |            53.3          
      N 1024 x D 5120 x R False x DT torch.bfloat16x GS [1, 128]  |             35.5             |            49.3   

now

[-------------------------------------------- rms-norm-dynamic-per-token-quant --------------------------------------------]
                                                                  |  unfused_groupwise_fp8_impl  |  fused_groupwise_fp8_impl
1 threads: -----------------------------------------------------------------------------------------------------------------
      N 1 x D 1024 x R True x DT torch.bfloat16x GS [1, 128]      |             30.9             |            19.6          
      N 1 x D 1024 x R False x DT torch.bfloat16x GS [1, 128]     |             36.5             |            19.4          
      N 1 x D 5120 x R True x DT torch.bfloat16x GS [1, 128]      |             30.5             |            19.6          
      N 1 x D 5120 x R False x DT torch.bfloat16x GS [1, 128]     |             36.5             |            19.6          
      N 4 x D 1024 x R True x DT torch.bfloat16x GS [1, 128]      |             30.4             |            19.5          
      N 4 x D 1024 x R False x DT torch.bfloat16x GS [1, 128]     |             34.2             |            19.3          
      N 4 x D 5120 x R True x DT torch.bfloat16x GS [1, 128]      |             30.5             |            19.6          
      N 4 x D 5120 x R False x DT torch.bfloat16x GS [1, 128]     |             34.2             |            19.4          
      N 16 x D 1024 x R True x DT torch.bfloat16x GS [1, 128]     |             31.8             |            19.6          
      N 16 x D 1024 x R False x DT torch.bfloat16x GS [1, 128]    |             36.4             |            19.5          
      N 16 x D 5120 x R True x DT torch.bfloat16x GS [1, 128]     |             30.7             |            19.7          
      N 16 x D 5120 x R False x DT torch.bfloat16x GS [1, 128]    |             36.5             |            19.7          
      N 64 x D 1024 x R True x DT torch.bfloat16x GS [1, 128]     |             31.8             |            19.7          
      N 64 x D 1024 x R False x DT torch.bfloat16x GS [1, 128]    |             36.5             |            19.6          
      N 64 x D 5120 x R True x DT torch.bfloat16x GS [1, 128]     |             30.4             |            19.6          
      N 64 x D 5120 x R False x DT torch.bfloat16x GS [1, 128]    |             34.3             |            19.5          
      N 256 x D 1024 x R True x DT torch.bfloat16x GS [1, 128]    |             30.1             |            19.4          
      N 256 x D 1024 x R False x DT torch.bfloat16x GS [1, 128]   |             34.4             |            19.8          
      N 256 x D 5120 x R True x DT torch.bfloat16x GS [1, 128]    |             30.7             |            19.6          
      N 256 x D 5120 x R False x DT torch.bfloat16x GS [1, 128]   |             34.2             |            19.5          
      N 1024 x D 1024 x R True x DT torch.bfloat16x GS [1, 128]   |             30.7             |            19.4          
      N 1024 x D 1024 x R False x DT torch.bfloat16x GS [1, 128]  |             34.4             |            19.4          
      N 1024 x D 5120 x R True x DT torch.bfloat16x GS [1, 128]   |             30.7             |            28.7          
      N 1024 x D 5120 x R False x DT torch.bfloat16x GS [1, 128]  |             34.5             |            28.7 

@mergify
Copy link

mergify bot commented Nov 10, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ElizaWszola.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Nov 10, 2025
Signed-off-by: ElizaWszola <ewszola@redhat.com>
…agic/vllm into blockwise-quant-rms-norm

Signed-off-by: ElizaWszola <ewszola@redhat.com>
Signed-off-by: ElizaWszola <ewszola@redhat.com>
Signed-off-by: ElizaWszola <ewszola@redhat.com>
Signed-off-by: ElizaWszola <ewszola@redhat.com>
Signed-off-by: ElizaWszola <ewszola@redhat.com>
Signed-off-by: ElizaWszola <ewszola@redhat.com>
Signed-off-by: ElizaWszola <ewszola@redhat.com>
Signed-off-by: ElizaWszola <ewszola@redhat.com>
Copy link
Collaborator

@ProExpertProg ProExpertProg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @yewentao256 @varun-sundar-rabindranath for kernel review as well

const cudaStream_t stream = at::cuda::getCurrentCUDAStream();

if (residual.has_value()) {
if (is_scale_transposed) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have a bool dispatch macro

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found one in SM100 CUTLASS file, but it didn't do quite what I needed it for, so I ended up adding my own macro in dispatch_utils. If it duplicates some already existing code, please lmk

Signed-off-by: ElizaWszola <ewszola@redhat.com>
@ElizaWszola
Copy link
Contributor Author

ElizaWszola commented Dec 5, 2025

@ProExpertProg I've now observed function mismatches in fusion tests when running with deepgemm enabled on H100, so we should either add this support later of delay landing until it's fixed -- what is your opinion on this?

Figured it out now, pushed the fix :)

Signed-off-by: ElizaWszola <ewszola@redhat.com>
Signed-off-by: ElizaWszola <ewszola@redhat.com>
@github-project-automation github-project-automation bot moved this to In review in NVIDIA Dec 5, 2025
@ProExpertProg ProExpertProg added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 5, 2025
@ProExpertProg ProExpertProg enabled auto-merge (squash) December 6, 2025 02:04
@ProExpertProg ProExpertProg force-pushed the blockwise-quant-rms-norm branch from e4aa624 to f4a206c Compare December 7, 2025 04:31
@ProExpertProg ProExpertProg merged commit af0444b into vllm-project:main Dec 7, 2025
92 checks passed
@github-project-automation github-project-automation bot moved this from To triage to Done in torch.compile integration Dec 7, 2025
@github-project-automation github-project-automation bot moved this from In review to Done in NVIDIA Dec 7, 2025
penfree pushed a commit to penfree/vllm that referenced this pull request Dec 8, 2025
Signed-off-by: ElizaWszola <ewszola@redhat.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Co-authored-by: yewentao256 <zhyanwentao@126.com>
@cjackal
Copy link
Contributor

cjackal commented Dec 8, 2025

After this PR, Qwen3 VLs (and most likely other FP8 VLMs I guess) are failing with the following error:

...
AttributeError: 'Qwen3VLMoeConfig' object has no attribute 'intermediate_size'

which is raised at using_deepgemm checking hf_config.intermediate_size w/o proper guard.

yeqcharlotte added a commit to yeqcharlotte/vllm that referenced this pull request Dec 8, 2025
Summary:
Fix AMD compilation failure for DeepSeek models introduced in vllm-project#27883.

The issue was that RMSNormQuantFusionPass unconditionally creates
FusedAddRMSNormGroupQuantPattern and RMSNormGroupQuantPattern for
group quantization (GroupShape 64 and 128), but the underlying C++
operation per_token_group_fp8_quant is only available on CUDA
(wrapped in #ifndef USE_ROCM in torch_bindings.cpp).

On AMD platforms, this caused an assertion failure:
  AssertionError: unsupported quantization scheme QuantKey(f8e4m3fnuz,scale(f32,dynamic,GroupShape(row=1, col=128)),symmetric)

The fix guards the creation of group quant patterns with
current_platform.is_cuda(), matching the guard used for registering
these keys in QUANT_OPS.

Test Plan:
Waiting for this deepseek job on amd to complete: https://www.internalfb.com/vanguard/serving_test_cases/1967790977283741

Will also wait for external CI

Differential Revision:
D88608586

Privacy Context Container: L1370295
@ElizaWszola
Copy link
Contributor Author

Hey @cjackal , thanks for finding this issue! I put up a PR that seems to fix the problem, let me know if you're still seeing any issues with this solution #30244

mayoohee pushed a commit to mayoohee/vllm that referenced this pull request Dec 9, 2025
Signed-off-by: ElizaWszola <ewszola@redhat.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Co-authored-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: mayoohee <yiweiii.fang@gmail.com>
ilmarkov pushed a commit to neuralmagic/vllm that referenced this pull request Dec 9, 2025
Signed-off-by: ElizaWszola <ewszola@redhat.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Co-authored-by: yewentao256 <zhyanwentao@126.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

nvidia performance Performance-related issues quantization ready ONLY add when PR is ready to merge/full CI is needed torch.compile

Projects

Status: Done
Status: Done

Development

Successfully merging this pull request may close these issues.

5 participants