[Performance] Fused blockwise quant RMS norm #27883

ElizaWszola · 2025-10-31T15:01:19Z

CUDA kernel and fusion code for Fused Groupwise FP8-Quantized RMS Norm. This code allows to fuse RMS Norm + FP8 Quantization of the RMS Norm's output when enable_fusion==True.

Testing:

Test fused op

pytest tests/kernels/core/test_fused_quant_layernorm.py

Test fusion

pytest tests/compile/test_fusion.py

(tested with both VLLM_USE_DEEP_GEMM=1 and VLLM_USE_DEEP_GEMM=0)

Offline inference

Run with

llm = LLM(model="Qwen/Qwen3-30B-A3B-FP8",
        compilation_config=CompilationConfig(
            pass_config=PassConfig(
                enable_fusion=True,
                enable_noop=True)))

(tested with both VLLM_USE_DEEP_GEMM=1 and VLLM_USE_DEEP_GEMM=0, verified that the fused kernel is being produced)

Benchmarking:

Microbenchmark isolated op:

python benchmarks/fused_kernels/layernorm_rms_benchmarks.py

Results on H100 (click to show)

[-------------------------------------------- rms-norm-dynamic-per-token-quant --------------------------------------------]
                                                                  |  unfused_groupwise_fp8_impl  |  fused_groupwise_fp8_impl
1 threads: -----------------------------------------------------------------------------------------------------------------
      N 1 x D 1024 x R True x DT torch.bfloat16x GS [1, 64]       |             36.2             |            24.7          
      N 1 x D 1024 x R True x DT torch.bfloat16x GS [1, 128]      |             36.1             |            25.1          
      N 1 x D 1024 x R True x DT torch.float32x GS [1, 64]        |             35.9             |            24.7          
      N 1 x D 1024 x R True x DT torch.float32x GS [1, 128]       |             34.9             |            23.7          
      N 1 x D 1024 x R False x DT torch.bfloat16x GS [1, 64]      |             39.2             |            24.9          
      N 1 x D 1024 x R False x DT torch.bfloat16x GS [1, 128]     |             39.5             |            24.5          
      N 1 x D 1024 x R False x DT torch.float32x GS [1, 64]       |             38.2             |            24.2          
      N 1 x D 1024 x R False x DT torch.float32x GS [1, 128]      |             38.7             |            24.9          
      N 1 x D 2048 x R True x DT torch.bfloat16x GS [1, 64]       |             36.0             |            25.4          
      N 1 x D 2048 x R True x DT torch.bfloat16x GS [1, 128]      |             35.4             |            25.3          
      N 1 x D 2048 x R True x DT torch.float32x GS [1, 64]        |             36.3             |            25.6          
      N 1 x D 2048 x R True x DT torch.float32x GS [1, 128]       |             36.4             |            26.1          
      N 1 x D 2048 x R False x DT torch.bfloat16x GS [1, 64]      |             39.1             |            25.2          
      N 1 x D 2048 x R False x DT torch.bfloat16x GS [1, 128]     |             39.6             |            24.8          
      N 1 x D 2048 x R False x DT torch.float32x GS [1, 64]       |             39.6             |            24.4          
      N 1 x D 2048 x R False x DT torch.float32x GS [1, 128]      |             38.2             |            23.9          
      N 1 x D 3072 x R True x DT torch.bfloat16x GS [1, 64]       |             34.6             |            23.8          
      N 1 x D 3072 x R True x DT torch.bfloat16x GS [1, 128]      |             33.4             |            23.0          
      N 1 x D 3072 x R True x DT torch.float32x GS [1, 64]        |             34.0             |            23.9          
      N 1 x D 3072 x R True x DT torch.float32x GS [1, 128]       |             34.2             |            23.5          
      N 1 x D 3072 x R False x DT torch.bfloat16x GS [1, 64]      |             36.7             |            23.2          
      N 1 x D 3072 x R False x DT torch.bfloat16x GS [1, 128]     |             38.2             |            23.2          
      N 1 x D 3072 x R False x DT torch.float32x GS [1, 64]       |             36.1             |            22.7          
      N 1 x D 3072 x R False x DT torch.float32x GS [1, 128]      |             36.6             |            23.4          
      N 1 x D 4096 x R True x DT torch.bfloat16x GS [1, 64]       |             34.8             |            23.6          
      N 1 x D 4096 x R True x DT torch.bfloat16x GS [1, 128]      |             33.5             |            23.3          
      N 1 x D 4096 x R True x DT torch.float32x GS [1, 64]        |             33.7             |            23.2          
      N 1 x D 4096 x R True x DT torch.float32x GS [1, 128]       |             33.6             |            23.2          
      N 1 x D 4096 x R False x DT torch.bfloat16x GS [1, 64]      |             37.3             |            23.0          
      N 1 x D 4096 x R False x DT torch.bfloat16x GS [1, 128]     |             36.9             |            22.6          
      N 1 x D 4096 x R False x DT torch.float32x GS [1, 64]       |             36.4             |            22.8          
      N 1 x D 4096 x R False x DT torch.float32x GS [1, 128]      |             36.7             |            22.9          
      N 1 x D 5120 x R True x DT torch.bfloat16x GS [1, 64]       |             33.8             |            23.4          
      N 1 x D 5120 x R True x DT torch.bfloat16x GS [1, 128]      |             34.1             |            23.7          
      N 1 x D 5120 x R True x DT torch.float32x GS [1, 64]        |             34.0             |            24.2          
      N 1 x D 5120 x R True x DT torch.float32x GS [1, 128]       |             35.4             |            24.5          
      N 1 x D 5120 x R False x DT torch.bfloat16x GS [1, 64]      |             38.6             |            24.3          
      N 1 x D 5120 x R False x DT torch.bfloat16x GS [1, 128]     |             39.1             |            23.5          
      N 1 x D 5120 x R False x DT torch.float32x GS [1, 64]       |             37.6             |            23.0          
      N 1 x D 5120 x R False x DT torch.float32x GS [1, 128]      |             37.8             |            23.4          
      N 1 x D 6144 x R True x DT torch.bfloat16x GS [1, 64]       |             34.0             |            23.5          
      N 1 x D 6144 x R True x DT torch.bfloat16x GS [1, 128]      |             34.4             |            23.8          
      N 1 x D 6144 x R True x DT torch.float32x GS [1, 64]        |             34.5             |            23.9          
      N 1 x D 6144 x R True x DT torch.float32x GS [1, 128]       |             34.7             |            23.9          
      N 1 x D 6144 x R False x DT torch.bfloat16x GS [1, 64]      |             38.3             |            24.6          
      N 1 x D 6144 x R False x DT torch.bfloat16x GS [1, 128]     |             38.4             |            24.5          
      N 1 x D 6144 x R False x DT torch.float32x GS [1, 64]       |             38.8             |            25.4          
      N 1 x D 6144 x R False x DT torch.float32x GS [1, 128]      |             39.4             |            24.6          
      N 1 x D 7168 x R True x DT torch.bfloat16x GS [1, 64]       |             35.5             |            25.0          
      N 1 x D 7168 x R True x DT torch.bfloat16x GS [1, 128]      |             36.2             |            26.6          
      N 1 x D 7168 x R True x DT torch.float32x GS [1, 64]        |             36.8             |            25.7          
      N 1 x D 7168 x R True x DT torch.float32x GS [1, 128]       |             36.2             |            25.1          
      N 1 x D 7168 x R False x DT torch.bfloat16x GS [1, 64]      |             38.2             |            24.4          
      N 1 x D 7168 x R False x DT torch.bfloat16x GS [1, 128]     |             39.7             |            24.8          
      N 1 x D 7168 x R False x DT torch.float32x GS [1, 64]       |             38.4             |            23.3          
      N 1 x D 7168 x R False x DT torch.float32x GS [1, 128]      |             37.8             |            24.0          
      N 2 x D 1024 x R True x DT torch.bfloat16x GS [1, 64]       |             35.4             |            24.6          
      N 2 x D 1024 x R True x DT torch.bfloat16x GS [1, 128]      |             35.6             |            24.2          
      N 2 x D 1024 x R True x DT torch.float32x GS [1, 64]        |             35.2             |            24.8          
      N 2 x D 1024 x R True x DT torch.float32x GS [1, 128]       |             36.3             |            24.2          
      N 2 x D 1024 x R False x DT torch.bfloat16x GS [1, 64]      |             38.5             |            24.2          
      N 2 x D 1024 x R False x DT torch.bfloat16x GS [1, 128]     |             38.3             |            24.0          
      N 2 x D 1024 x R False x DT torch.float32x GS [1, 64]       |             38.2             |            23.2          
      N 2 x D 1024 x R False x DT torch.float32x GS [1, 128]      |             37.8             |            23.9          
      N 2 x D 2048 x R True x DT torch.bfloat16x GS [1, 64]       |             34.9             |            23.9          
      N 2 x D 2048 x R True x DT torch.bfloat16x GS [1, 128]      |             35.0             |            24.3          
      N 2 x D 2048 x R True x DT torch.float32x GS [1, 64]        |             35.0             |            24.1          
      N 2 x D 2048 x R True x DT torch.float32x GS [1, 128]       |             35.3             |            23.9          
      N 2 x D 2048 x R False x DT torch.bfloat16x GS [1, 64]      |             38.5             |            23.3          
      N 2 x D 2048 x R False x DT torch.bfloat16x GS [1, 128]     |             38.4             |            23.6          
      N 2 x D 2048 x R False x DT torch.float32x GS [1, 64]       |             37.8             |            23.4          
      N 2 x D 2048 x R False x DT torch.float32x GS [1, 128]      |             38.4             |            24.0          
      N 2 x D 3072 x R True x DT torch.bfloat16x GS [1, 64]       |             35.0             |            24.2          
      N 2 x D 3072 x R True x DT torch.bfloat16x GS [1, 128]      |             34.3             |            23.8          
      N 2 x D 3072 x R True x DT torch.float32x GS [1, 64]        |             34.5             |            24.6          
      N 2 x D 3072 x R True x DT torch.float32x GS [1, 128]       |             35.2             |            24.4          
      N 2 x D 3072 x R False x DT torch.bfloat16x GS [1, 64]      |             37.9             |            23.7          
      N 2 x D 3072 x R False x DT torch.bfloat16x GS [1, 128]     |             37.2             |            23.4          
      N 2 x D 3072 x R False x DT torch.float32x GS [1, 64]       |             36.8             |            22.7          
      N 2 x D 3072 x R False x DT torch.float32x GS [1, 128]      |             36.9             |            22.6          
      N 2 x D 4096 x R True x DT torch.bfloat16x GS [1, 64]       |             33.2             |            22.7          
      N 2 x D 4096 x R True x DT torch.bfloat16x GS [1, 128]      |             33.2             |            23.2          
      N 2 x D 4096 x R True x DT torch.float32x GS [1, 64]        |             33.3             |            22.9          
      N 2 x D 4096 x R True x DT torch.float32x GS [1, 128]       |             33.4             |            23.2          
      N 2 x D 4096 x R False x DT torch.bfloat16x GS [1, 64]      |             36.3             |            22.7          
      N 2 x D 4096 x R False x DT torch.bfloat16x GS [1, 128]     |             36.5             |            22.5          
      N 2 x D 4096 x R False x DT torch.float32x GS [1, 64]       |             36.7             |            22.9          
      N 2 x D 4096 x R False x DT torch.float32x GS [1, 128]      |             36.1             |            22.5          
      N 2 x D 5120 x R True x DT torch.bfloat16x GS [1, 64]       |             32.4             |            22.9          
      N 2 x D 5120 x R True x DT torch.bfloat16x GS [1, 128]      |             33.5             |            22.9          
      N 2 x D 5120 x R True x DT torch.float32x GS [1, 64]        |             33.1             |            22.9          
      N 2 x D 5120 x R True x DT torch.float32x GS [1, 128]       |             33.1             |            22.6          
      N 2 x D 5120 x R False x DT torch.bfloat16x GS [1, 64]      |             36.3             |            22.8          
      N 2 x D 5120 x R False x DT torch.bfloat16x GS [1, 128]     |             36.9             |            22.9          
      N 2 x D 5120 x R False x DT torch.float32x GS [1, 64]       |             36.5             |            22.6          
      N 2 x D 5120 x R False x DT torch.float32x GS [1, 128]      |             36.4             |            22.9          
      N 2 x D 6144 x R True x DT torch.bfloat16x GS [1, 64]       |             33.4             |            23.2          
      N 2 x D 6144 x R True x DT torch.bfloat16x GS [1, 128]      |             33.4             |            23.0          
      N 2 x D 6144 x R True x DT torch.float32x GS [1, 64]        |             33.7             |            22.7          
      N 2 x D 6144 x R True x DT torch.float32x GS [1, 128]       |             33.0             |            22.8          
      N 2 x D 6144 x R False x DT torch.bfloat16x GS [1, 64]      |             37.2             |            23.6          
      N 2 x D 6144 x R False x DT torch.bfloat16x GS [1, 128]     |             37.1             |            23.2          
      N 2 x D 6144 x R False x DT torch.float32x GS [1, 64]       |             37.1             |            23.3          
      N 2 x D 6144 x R False x DT torch.float32x GS [1, 128]      |             36.4             |            23.1          
      N 2 x D 7168 x R True x DT torch.bfloat16x GS [1, 64]       |             33.1             |            23.0          
      N 2 x D 7168 x R True x DT torch.bfloat16x GS [1, 128]      |             33.3             |            23.4          
      N 2 x D 7168 x R True x DT torch.float32x GS [1, 64]        |             33.8             |            23.1          
      N 2 x D 7168 x R True x DT torch.float32x GS [1, 128]       |             33.2             |            23.0          
      N 2 x D 7168 x R False x DT torch.bfloat16x GS [1, 64]      |             36.4             |            22.7          
      N 2 x D 7168 x R False x DT torch.bfloat16x GS [1, 128]     |             36.1             |            22.8          
      N 2 x D 7168 x R False x DT torch.float32x GS [1, 64]       |             36.5             |            23.0          
      N 2 x D 7168 x R False x DT torch.float32x GS [1, 128]      |             36.6             |            22.4          
      N 4 x D 1024 x R True x DT torch.bfloat16x GS [1, 64]       |             32.6             |            22.6          
      N 4 x D 1024 x R True x DT torch.bfloat16x GS [1, 128]      |             32.7             |            22.7          
      N 4 x D 1024 x R True x DT torch.float32x GS [1, 64]        |             32.6             |            22.5          
      N 4 x D 1024 x R True x DT torch.float32x GS [1, 128]       |             33.2             |            22.6          
      N 4 x D 1024 x R False x DT torch.bfloat16x GS [1, 64]      |             35.9             |            22.5          
      N 4 x D 1024 x R False x DT torch.bfloat16x GS [1, 128]     |             36.0             |            22.2          
      N 4 x D 1024 x R False x DT torch.float32x GS [1, 64]       |             35.8             |            22.3          
      N 4 x D 1024 x R False x DT torch.float32x GS [1, 128]      |             35.4             |            22.3          
      N 4 x D 2048 x R True x DT torch.bfloat16x GS [1, 64]       |             32.3             |            22.7          
      N 4 x D 2048 x R True x DT torch.bfloat16x GS [1, 128]      |             32.4             |            22.7          
      N 4 x D 2048 x R True x DT torch.float32x GS [1, 64]        |             33.1             |            22.9          
      N 4 x D 2048 x R True x DT torch.float32x GS [1, 128]       |             33.1             |            22.6          
      N 4 x D 2048 x R False x DT torch.bfloat16x GS [1, 64]      |             36.1             |            22.5          
      N 4 x D 2048 x R False x DT torch.bfloat16x GS [1, 128]     |             35.2             |            22.3          
      N 4 x D 2048 x R False x DT torch.float32x GS [1, 64]       |             35.6             |            22.3          
      N 4 x D 2048 x R False x DT torch.float32x GS [1, 128]      |             36.4             |            22.2          
      N 4 x D 3072 x R True x DT torch.bfloat16x GS [1, 64]       |             32.2             |            22.7          
      N 4 x D 3072 x R True x DT torch.bfloat16x GS [1, 128]      |             32.6             |            22.8          
      N 4 x D 3072 x R True x DT torch.float32x GS [1, 64]        |             32.6             |            22.9          
      N 4 x D 3072 x R True x DT torch.float32x GS [1, 128]       |             32.7             |            22.6          
      N 4 x D 3072 x R False x DT torch.bfloat16x GS [1, 64]      |             36.0             |            22.6          
      N 4 x D 3072 x R False x DT torch.bfloat16x GS [1, 128]     |             35.9             |            22.7          
      N 4 x D 3072 x R False x DT torch.float32x GS [1, 64]       |             35.8             |            22.3          
      N 4 x D 3072 x R False x DT torch.float32x GS [1, 128]      |             36.1             |            22.4          
      N 4 x D 4096 x R True x DT torch.bfloat16x GS [1, 64]       |             33.2             |            22.8          
      N 4 x D 4096 x R True x DT torch.bfloat16x GS [1, 128]      |             32.8             |            22.7          
      N 4 x D 4096 x R True x DT torch.float32x GS [1, 64]        |             32.9             |            22.0          
      N 4 x D 4096 x R True x DT torch.float32x GS [1, 128]       |             32.2             |            22.6          
      N 4 x D 4096 x R False x DT torch.bfloat16x GS [1, 64]      |             35.7             |            22.7          
      N 4 x D 4096 x R False x DT torch.bfloat16x GS [1, 128]     |             35.5             |            22.5          
      N 4 x D 4096 x R False x DT torch.float32x GS [1, 64]       |             35.8             |            22.2          
      N 4 x D 4096 x R False x DT torch.float32x GS [1, 128]      |             35.4             |            22.5          
      N 4 x D 5120 x R True x DT torch.bfloat16x GS [1, 64]       |             32.5             |            23.0          
      N 4 x D 5120 x R True x DT torch.bfloat16x GS [1, 128]      |             32.5             |            22.8          
      N 4 x D 5120 x R True x DT torch.float32x GS [1, 64]        |             32.7             |            22.6          
      N 4 x D 5120 x R True x DT torch.float32x GS [1, 128]       |             32.8             |            22.5          
      N 4 x D 5120 x R False x DT torch.bfloat16x GS [1, 64]      |             36.0             |            22.4          
      N 4 x D 5120 x R False x DT torch.bfloat16x GS [1, 128]     |             35.5             |            22.3          
      N 4 x D 5120 x R False x DT torch.float32x GS [1, 64]       |             35.5             |            22.4          
      N 4 x D 5120 x R False x DT torch.float32x GS [1, 128]      |             35.6             |            22.7          
      N 4 x D 6144 x R True x DT torch.bfloat16x GS [1, 64]       |             32.2             |            22.6          
      N 4 x D 6144 x R True x DT torch.bfloat16x GS [1, 128]      |             32.7             |            22.6          
      N 4 x D 6144 x R True x DT torch.float32x GS [1, 64]        |             32.6             |            22.8          
      N 4 x D 6144 x R True x DT torch.float32x GS [1, 128]       |             32.6             |            22.7          
      N 4 x D 6144 x R False x DT torch.bfloat16x GS [1, 64]      |             35.5             |            22.5          
      N 4 x D 6144 x R False x DT torch.bfloat16x GS [1, 128]     |             35.5             |            22.4          
      N 4 x D 6144 x R False x DT torch.float32x GS [1, 64]       |             35.4             |            22.5          
      N 4 x D 6144 x R False x DT torch.float32x GS [1, 128]      |             35.8             |            22.5          
      N 4 x D 7168 x R True x DT torch.bfloat16x GS [1, 64]       |             32.3             |            22.4          
      N 4 x D 7168 x R True x DT torch.bfloat16x GS [1, 128]      |             32.4             |            22.6          
      N 4 x D 7168 x R True x DT torch.float32x GS [1, 64]        |             32.5             |            22.5          
      N 4 x D 7168 x R True x DT torch.float32x GS [1, 128]       |             32.6             |            22.9          
      N 4 x D 7168 x R False x DT torch.bfloat16x GS [1, 64]      |             35.6             |            22.6          
      N 4 x D 7168 x R False x DT torch.bfloat16x GS [1, 128]     |             35.9             |            22.4          
      N 4 x D 7168 x R False x DT torch.float32x GS [1, 64]       |             35.9             |            22.5          
      N 4 x D 7168 x R False x DT torch.float32x GS [1, 128]      |             35.8             |            22.6          
      N 8 x D 1024 x R True x DT torch.bfloat16x GS [1, 64]       |             33.2             |            22.5          
      N 8 x D 1024 x R True x DT torch.bfloat16x GS [1, 128]      |             32.6             |            22.7          
      N 8 x D 1024 x R True x DT torch.float32x GS [1, 64]        |             32.8             |            22.9          
      N 8 x D 1024 x R True x DT torch.float32x GS [1, 128]       |             32.3             |            22.7          
      N 8 x D 1024 x R False x DT torch.bfloat16x GS [1, 64]      |             36.1             |            22.8          
      N 8 x D 1024 x R False x DT torch.bfloat16x GS [1, 128]     |             36.3             |            22.5          
      N 8 x D 1024 x R False x DT torch.float32x GS [1, 64]       |             35.7             |            22.7          
      N 8 x D 1024 x R False x DT torch.float32x GS [1, 128]      |             35.3             |            22.3          
      N 8 x D 2048 x R True x DT torch.bfloat16x GS [1, 64]       |             32.1             |            22.5          
      N 8 x D 2048 x R True x DT torch.bfloat16x GS [1, 128]      |             32.2             |            22.6          
      N 8 x D 2048 x R True x DT torch.float32x GS [1, 64]        |             32.9             |            22.5          
      N 8 x D 2048 x R True x DT torch.float32x GS [1, 128]       |             32.6             |            22.6          
      N 8 x D 2048 x R False x DT torch.bfloat16x GS [1, 64]      |             35.4             |            22.5          
      N 8 x D 2048 x R False x DT torch.bfloat16x GS [1, 128]     |             35.6             |            22.3          
      N 8 x D 2048 x R False x DT torch.float32x GS [1, 64]       |             35.5             |            22.5          
      N 8 x D 2048 x R False x DT torch.float32x GS [1, 128]      |             35.6             |            22.1          
      N 8 x D 3072 x R True x DT torch.bfloat16x GS [1, 64]       |             32.5             |            22.8          
      N 8 x D 3072 x R True x DT torch.bfloat16x GS [1, 128]      |             32.6             |            22.7          
      N 8 x D 3072 x R True x DT torch.float32x GS [1, 64]        |             33.3             |            22.7          
      N 8 x D 3072 x R True x DT torch.float32x GS [1, 128]       |             33.0             |            23.0          
      N 8 x D 3072 x R False x DT torch.bfloat16x GS [1, 64]      |             35.5             |            22.5          
      N 8 x D 3072 x R False x DT torch.bfloat16x GS [1, 128]     |             36.0             |            22.4          
      N 8 x D 3072 x R False x DT torch.float32x GS [1, 64]       |             36.4             |            22.8          
      N 8 x D 3072 x R False x DT torch.float32x GS [1, 128]      |             36.5             |            22.6          
      N 8 x D 4096 x R True x DT torch.bfloat16x GS [1, 64]       |             32.6             |            22.2          
      N 8 x D 4096 x R True x DT torch.bfloat16x GS [1, 128]      |             32.7             |            22.7          
      N 8 x D 4096 x R True x DT torch.float32x GS [1, 64]        |             32.8             |            22.6          
      N 8 x D 4096 x R True x DT torch.float32x GS [1, 128]       |             32.8             |            22.5          
      N 8 x D 4096 x R False x DT torch.bfloat16x GS [1, 64]      |             36.5             |            21.8          
      N 8 x D 4096 x R False x DT torch.bfloat16x GS [1, 128]     |             35.5             |            22.5          
      N 8 x D 4096 x R False x DT torch.float32x GS [1, 64]       |             36.2             |            22.7          
      N 8 x D 4096 x R False x DT torch.float32x GS [1, 128]      |             36.3             |            23.0          
      N 8 x D 5120 x R True x DT torch.bfloat16x GS [1, 64]       |             32.6             |            22.9          
      N 8 x D 5120 x R True x DT torch.bfloat16x GS [1, 128]      |             33.2             |            22.4          
      N 8 x D 5120 x R True x DT torch.float32x GS [1, 64]        |             32.5             |            22.8          
      N 8 x D 5120 x R True x DT torch.float32x GS [1, 128]       |             32.7             |            22.7          
      N 8 x D 5120 x R False x DT torch.bfloat16x GS [1, 64]      |             36.3             |            22.4          
      N 8 x D 5120 x R False x DT torch.bfloat16x GS [1, 128]     |             36.3             |            22.6          
      N 8 x D 5120 x R False x DT torch.float32x GS [1, 64]       |             36.3             |            22.3          
      N 8 x D 5120 x R False x DT torch.float32x GS [1, 128]      |             35.9             |            22.5          
      N 8 x D 6144 x R True x DT torch.bfloat16x GS [1, 64]       |             32.3             |            22.4          
      N 8 x D 6144 x R True x DT torch.bfloat16x GS [1, 128]      |             32.6             |            22.5          
      N 8 x D 6144 x R True x DT torch.float32x GS [1, 64]        |             32.4             |            22.3          
      N 8 x D 6144 x R True x DT torch.float32x GS [1, 128]       |             32.2             |            22.3          
      N 8 x D 6144 x R False x DT torch.bfloat16x GS [1, 64]      |             35.7             |            22.5          
      N 8 x D 6144 x R False x DT torch.bfloat16x GS [1, 128]     |             35.9             |            22.8          
      N 8 x D 6144 x R False x DT torch.float32x GS [1, 64]       |             36.5             |            23.1          
      N 8 x D 6144 x R False x DT torch.float32x GS [1, 128]      |             36.5             |            22.8          
      N 8 x D 7168 x R True x DT torch.bfloat16x GS [1, 64]       |             33.2             |            22.8          
      N 8 x D 7168 x R True x DT torch.bfloat16x GS [1, 128]      |             33.1             |            22.4          
      N 8 x D 7168 x R True x DT torch.float32x GS [1, 64]        |             32.5             |            22.3          
      N 8 x D 7168 x R True x DT torch.float32x GS [1, 128]       |             33.1             |            22.8          
      N 8 x D 7168 x R False x DT torch.bfloat16x GS [1, 64]      |             36.1             |            23.0          
      N 8 x D 7168 x R False x DT torch.bfloat16x GS [1, 128]     |             35.6             |            22.8          
      N 8 x D 7168 x R False x DT torch.float32x GS [1, 64]       |             36.2             |            22.6          
      N 8 x D 7168 x R False x DT torch.float32x GS [1, 128]      |             36.3             |            22.1          
      N 16 x D 1024 x R True x DT torch.bfloat16x GS [1, 64]      |             32.8             |            23.0          
      N 16 x D 1024 x R True x DT torch.bfloat16x GS [1, 128]     |             32.9             |            22.7          
      N 16 x D 1024 x R True x DT torch.float32x GS [1, 64]       |             32.7             |            22.3          
      N 16 x D 1024 x R True x DT torch.float32x GS [1, 128]      |             32.7             |            23.0          
      N 16 x D 1024 x R False x DT torch.bfloat16x GS [1, 64]     |             36.1             |            22.5          
      N 16 x D 1024 x R False x DT torch.bfloat16x GS [1, 128]    |             35.8             |            22.5          
      N 16 x D 1024 x R False x DT torch.float32x GS [1, 64]      |             35.6             |            22.2          
      N 16 x D 1024 x R False x DT torch.float32x GS [1, 128]     |             35.6             |            21.8          
      N 16 x D 2048 x R True x DT torch.bfloat16x GS [1, 64]      |             32.3             |            22.3          
      N 16 x D 2048 x R True x DT torch.bfloat16x GS [1, 128]     |             32.8             |            23.0          
      N 16 x D 2048 x R True x DT torch.float32x GS [1, 64]       |             33.0             |            22.7          
      N 16 x D 2048 x R True x DT torch.float32x GS [1, 128]      |             32.9             |            22.7          
      N 16 x D 2048 x R False x DT torch.bfloat16x GS [1, 64]     |             36.0             |            22.8          
      N 16 x D 2048 x R False x DT torch.bfloat16x GS [1, 128]    |             36.3             |            23.0          
      N 16 x D 2048 x R False x DT torch.float32x GS [1, 64]      |             36.1             |            22.6          
      N 16 x D 2048 x R False x DT torch.float32x GS [1, 128]     |             35.9             |            22.5          
      N 16 x D 3072 x R True x DT torch.bfloat16x GS [1, 64]      |             32.6             |            22.7          
      N 16 x D 3072 x R True x DT torch.bfloat16x GS [1, 128]     |             32.6             |            22.6          
      N 16 x D 3072 x R True x DT torch.float32x GS [1, 64]       |             32.5             |            22.6          
      N 16 x D 3072 x R True x DT torch.float32x GS [1, 128]      |             32.2             |            22.5          
      N 16 x D 3072 x R False x DT torch.bfloat16x GS [1, 64]     |             36.6             |            23.1          
      N 16 x D 3072 x R False x DT torch.bfloat16x GS [1, 128]    |             36.2             |            22.9          
      N 16 x D 3072 x R False x DT torch.float32x GS [1, 64]      |             36.1             |            22.6          
      N 16 x D 3072 x R False x DT torch.float32x GS [1, 128]     |             36.2             |            22.8          
      N 16 x D 4096 x R True x DT torch.bfloat16x GS [1, 64]      |             32.5             |            22.8          
      N 16 x D 4096 x R True x DT torch.bfloat16x GS [1, 128]     |             32.6             |            22.4          
      N 16 x D 4096 x R True x DT torch.float32x GS [1, 64]       |             32.7             |            22.8          
      N 16 x D 4096 x R True x DT torch.float32x GS [1, 128]      |             32.6             |            22.8          
      N 16 x D 4096 x R False x DT torch.bfloat16x GS [1, 64]     |             35.4             |            22.5          
      N 16 x D 4096 x R False x DT torch.bfloat16x GS [1, 128]    |             35.5             |            22.4          
      N 16 x D 4096 x R False x DT torch.float32x GS [1, 64]      |             36.1             |            23.0          
      N 16 x D 4096 x R False x DT torch.float32x GS [1, 128]     |             36.3             |            22.6          
      N 16 x D 5120 x R True x DT torch.bfloat16x GS [1, 64]      |             32.3             |            22.7          
      N 16 x D 5120 x R True x DT torch.bfloat16x GS [1, 128]     |             32.3             |            22.5          
      N 16 x D 5120 x R True x DT torch.float32x GS [1, 64]       |             32.2             |            22.9          
      N 16 x D 5120 x R True x DT torch.float32x GS [1, 128]      |             32.9             |            22.6          
      N 16 x D 5120 x R False x DT torch.bfloat16x GS [1, 64]     |             35.6             |            22.6          
      N 16 x D 5120 x R False x DT torch.bfloat16x GS [1, 128]    |             35.5             |            22.4          
      N 16 x D 5120 x R False x DT torch.float32x GS [1, 64]      |             35.6             |            22.1          
      N 16 x D 5120 x R False x DT torch.float32x GS [1, 128]     |             35.8             |            22.2          
      N 16 x D 6144 x R True x DT torch.bfloat16x GS [1, 64]      |             32.5             |            22.8          
      N 16 x D 6144 x R True x DT torch.bfloat16x GS [1, 128]     |             32.2             |            22.6          
      N 16 x D 6144 x R True x DT torch.float32x GS [1, 64]       |             32.4             |            22.5          
      N 16 x D 6144 x R True x DT torch.float32x GS [1, 128]      |             32.6             |            22.8          
      N 16 x D 6144 x R False x DT torch.bfloat16x GS [1, 64]     |             36.3             |            22.9          
      N 16 x D 6144 x R False x DT torch.bfloat16x GS [1, 128]    |             36.1             |            22.6          
      N 16 x D 6144 x R False x DT torch.float32x GS [1, 64]      |             36.2             |            22.7          
      N 16 x D 6144 x R False x DT torch.float32x GS [1, 128]     |             35.4             |            22.2          
      N 16 x D 7168 x R True x DT torch.bfloat16x GS [1, 64]      |             32.7             |            22.3          
      N 16 x D 7168 x R True x DT torch.bfloat16x GS [1, 128]     |             32.9             |            22.6          
      N 16 x D 7168 x R True x DT torch.float32x GS [1, 64]       |             32.9             |            22.9          
      N 16 x D 7168 x R True x DT torch.float32x GS [1, 128]      |             32.2             |            22.4          
      N 16 x D 7168 x R False x DT torch.bfloat16x GS [1, 64]     |             35.4             |            22.6          
      N 16 x D 7168 x R False x DT torch.bfloat16x GS [1, 128]    |             35.4             |            22.7          
      N 16 x D 7168 x R False x DT torch.float32x GS [1, 64]      |             36.3             |            22.7          
      N 16 x D 7168 x R False x DT torch.float32x GS [1, 128]     |             36.8             |            22.7          
      N 32 x D 1024 x R True x DT torch.bfloat16x GS [1, 64]      |             32.7             |            22.6          
      N 32 x D 1024 x R True x DT torch.bfloat16x GS [1, 128]     |             32.4             |            22.4          
      N 32 x D 1024 x R True x DT torch.float32x GS [1, 64]       |             32.6             |            22.3          
      N 32 x D 1024 x R True x DT torch.float32x GS [1, 128]      |             32.7             |            22.5          
      N 32 x D 1024 x R False x DT torch.bfloat16x GS [1, 64]     |             35.2             |            22.2          
      N 32 x D 1024 x R False x DT torch.bfloat16x GS [1, 128]    |             35.5             |            22.2          
      N 32 x D 1024 x R False x DT torch.float32x GS [1, 64]      |             35.4             |            22.2          
      N 32 x D 1024 x R False x DT torch.float32x GS [1, 128]     |             35.5             |            22.2          
      N 32 x D 2048 x R True x DT torch.bfloat16x GS [1, 64]      |             32.4             |            22.6          
      N 32 x D 2048 x R True x DT torch.bfloat16x GS [1, 128]     |             32.7             |            22.4          
      N 32 x D 2048 x R True x DT torch.float32x GS [1, 64]       |             32.4             |            22.4          
      N 32 x D 2048 x R True x DT torch.float32x GS [1, 128]      |             32.1             |            22.5          
      N 32 x D 2048 x R False x DT torch.bfloat16x GS [1, 64]     |             35.5             |            22.3          
      N 32 x D 2048 x R False x DT torch.bfloat16x GS [1, 128]    |             35.5             |            22.2          
      N 32 x D 2048 x R False x DT torch.float32x GS [1, 64]      |             35.5             |            22.2          
      N 32 x D 2048 x R False x DT torch.float32x GS [1, 128]     |             35.7             |            22.4          
      N 32 x D 3072 x R True x DT torch.bfloat16x GS [1, 64]      |             32.1             |            22.5          
      N 32 x D 3072 x R True x DT torch.bfloat16x GS [1, 128]     |             32.4             |            22.6          
      N 32 x D 3072 x R True x DT torch.float32x GS [1, 64]       |             32.7             |            22.7          
      N 32 x D 3072 x R True x DT torch.float32x GS [1, 128]      |             32.6             |            22.2          
      N 32 x D 3072 x R False x DT torch.bfloat16x GS [1, 64]     |             35.3             |            22.3          
      N 32 x D 3072 x R False x DT torch.bfloat16x GS [1, 128]    |             35.5             |            22.3          
      N 32 x D 3072 x R False x DT torch.float32x GS [1, 64]      |             35.4             |            22.3          
      N 32 x D 3072 x R False x DT torch.float32x GS [1, 128]     |             35.4             |            22.0          
      N 32 x D 4096 x R True x DT torch.bfloat16x GS [1, 64]      |             32.9             |            22.6          
      N 32 x D 4096 x R True x DT torch.bfloat16x GS [1, 128]     |             32.3             |            22.6          
      N 32 x D 4096 x R True x DT torch.float32x GS [1, 64]       |             32.4             |            22.3          
      N 32 x D 4096 x R True x DT torch.float32x GS [1, 128]      |             32.4             |            22.3          
      N 32 x D 4096 x R False x DT torch.bfloat16x GS [1, 64]     |             35.3             |            22.3          
      N 32 x D 4096 x R False x DT torch.bfloat16x GS [1, 128]    |             35.4             |            22.4          
      N 32 x D 4096 x R False x DT torch.float32x GS [1, 64]      |             35.3             |            22.4          
      N 32 x D 4096 x R False x DT torch.float32x GS [1, 128]     |             36.1             |            22.7          
      N 32 x D 5120 x R True x DT torch.bfloat16x GS [1, 64]      |             33.0             |            22.6          
      N 32 x D 5120 x R True x DT torch.bfloat16x GS [1, 128]     |             32.8             |            22.6          
      N 32 x D 5120 x R True x DT torch.float32x GS [1, 64]       |             32.3             |            22.5          
      N 32 x D 5120 x R True x DT torch.float32x GS [1, 128]      |             32.3             |            22.6          
      N 32 x D 5120 x R False x DT torch.bfloat16x GS [1, 64]     |             36.2             |            22.7          
      N 32 x D 5120 x R False x DT torch.bfloat16x GS [1, 128]    |             36.4             |            22.5          
      N 32 x D 5120 x R False x DT torch.float32x GS [1, 64]      |             35.4             |            22.4          
      N 32 x D 5120 x R False x DT torch.float32x GS [1, 128]     |             36.4             |            22.8          
      N 32 x D 6144 x R True x DT torch.bfloat16x GS [1, 64]      |             33.1             |            23.0          
      N 32 x D 6144 x R True x DT torch.bfloat16x GS [1, 128]     |             32.4             |            22.7          
      N 32 x D 6144 x R True x DT torch.float32x GS [1, 64]       |             33.2             |            23.0          
      N 32 x D 6144 x R True x DT torch.float32x GS [1, 128]      |             32.9             |            23.4          
      N 32 x D 6144 x R False x DT torch.bfloat16x GS [1, 64]     |             36.4             |            24.1          
      N 32 x D 6144 x R False x DT torch.bfloat16x GS [1, 128]    |             37.6             |            23.8          
      N 32 x D 6144 x R False x DT torch.float32x GS [1, 64]      |             37.1             |            23.8          
      N 32 x D 6144 x R False x DT torch.float32x GS [1, 128]     |             37.2             |            22.4          
      N 32 x D 7168 x R True x DT torch.bfloat16x GS [1, 64]      |             32.8             |            22.7          
      N 32 x D 7168 x R True x DT torch.bfloat16x GS [1, 128]     |             32.8             |            22.6          
      N 32 x D 7168 x R True x DT torch.float32x GS [1, 64]       |             33.2             |            23.1          
      N 32 x D 7168 x R True x DT torch.float32x GS [1, 128]      |             33.4             |            22.7          
      N 32 x D 7168 x R False x DT torch.bfloat16x GS [1, 64]     |             36.5             |            22.9          
      N 32 x D 7168 x R False x DT torch.bfloat16x GS [1, 128]    |             36.3             |            22.5          
      N 32 x D 7168 x R False x DT torch.float32x GS [1, 64]      |             36.7             |            23.3          
      N 32 x D 7168 x R False x DT torch.float32x GS [1, 128]     |             36.2             |            22.9          
      N 64 x D 1024 x R True x DT torch.bfloat16x GS [1, 64]      |             32.7             |            22.7          
      N 64 x D 1024 x R True x DT torch.bfloat16x GS [1, 128]     |             32.9             |            23.4          
      N 64 x D 1024 x R True x DT torch.float32x GS [1, 64]       |             32.8             |            23.0          
      N 64 x D 1024 x R True x DT torch.float32x GS [1, 128]      |             32.7             |            22.7          
      N 64 x D 1024 x R False x DT torch.bfloat16x GS [1, 64]     |             35.5             |            22.9          
      N 64 x D 1024 x R False x DT torch.bfloat16x GS [1, 128]    |             36.1             |            22.6          
      N 64 x D 1024 x R False x DT torch.float32x GS [1, 64]      |             36.4             |            22.6          
      N 64 x D 1024 x R False x DT torch.float32x GS [1, 128]     |             35.8             |            22.4          
      N 64 x D 2048 x R True x DT torch.bfloat16x GS [1, 64]      |             32.4             |            22.8          
      N 64 x D 2048 x R True x DT torch.bfloat16x GS [1, 128]     |             32.9             |            22.4          
      N 64 x D 2048 x R True x DT torch.float32x GS [1, 64]       |             32.9             |            22.8          
      N 64 x D 2048 x R True x DT torch.float32x GS [1, 128]      |             33.2             |            22.8          
      N 64 x D 2048 x R False x DT torch.bfloat16x GS [1, 64]     |             35.8             |            22.8          
      N 64 x D 2048 x R False x DT torch.bfloat16x GS [1, 128]    |             35.5             |            22.8          
      N 64 x D 2048 x R False x DT torch.float32x GS [1, 64]      |             36.1             |            22.6          
      N 64 x D 2048 x R False x DT torch.float32x GS [1, 128]     |             36.5             |            22.8          
      N 64 x D 3072 x R True x DT torch.bfloat16x GS [1, 64]      |             32.8             |            23.0          
      N 64 x D 3072 x R True x DT torch.bfloat16x GS [1, 128]     |             33.4             |            23.1          
      N 64 x D 3072 x R True x DT torch.float32x GS [1, 64]       |             33.0             |            22.9          
      N 64 x D 3072 x R True x DT torch.float32x GS [1, 128]      |             33.3             |            23.2          
      N 64 x D 3072 x R False x DT torch.bfloat16x GS [1, 64]     |             36.5             |            22.7          
      N 64 x D 3072 x R False x DT torch.bfloat16x GS [1, 128]    |             36.2             |            22.5          
      N 64 x D 3072 x R False x DT torch.float32x GS [1, 64]      |             35.7             |            22.0          
      N 64 x D 3072 x R False x DT torch.float32x GS [1, 128]     |             35.3             |            22.1          
      N 64 x D 4096 x R True x DT torch.bfloat16x GS [1, 64]      |             31.9             |            22.4          
      N 64 x D 4096 x R True x DT torch.bfloat16x GS [1, 128]     |             32.6             |            22.4          
      N 64 x D 4096 x R True x DT torch.float32x GS [1, 64]       |             32.4             |            22.5          
      N 64 x D 4096 x R True x DT torch.float32x GS [1, 128]      |             32.5             |            22.7          
      N 64 x D 4096 x R False x DT torch.bfloat16x GS [1, 64]     |             35.7             |            22.4          
      N 64 x D 4096 x R False x DT torch.bfloat16x GS [1, 128]    |             35.7             |            22.7          
      N 64 x D 4096 x R False x DT torch.float32x GS [1, 64]      |             35.9             |            22.5          
      N 64 x D 4096 x R False x DT torch.float32x GS [1, 128]     |             35.8             |            22.5          
      N 64 x D 5120 x R True x DT torch.bfloat16x GS [1, 64]      |             32.2             |            22.6          
      N 64 x D 5120 x R True x DT torch.bfloat16x GS [1, 128]     |             32.5             |            22.7          
      N 64 x D 5120 x R True x DT torch.float32x GS [1, 64]       |             32.7             |            22.5          
      N 64 x D 5120 x R True x DT torch.float32x GS [1, 128]      |             32.7             |            22.7          
      N 64 x D 5120 x R False x DT torch.bfloat16x GS [1, 64]     |             35.7             |            22.4          
      N 64 x D 5120 x R False x DT torch.bfloat16x GS [1, 128]    |             35.8             |            22.5          
      N 64 x D 5120 x R False x DT torch.float32x GS [1, 64]      |             36.0             |            22.8          
      N 64 x D 5120 x R False x DT torch.float32x GS [1, 128]     |             35.9             |            22.7          
      N 64 x D 6144 x R True x DT torch.bfloat16x GS [1, 64]      |             32.3             |            22.5          
      N 64 x D 6144 x R True x DT torch.bfloat16x GS [1, 128]     |             32.4             |            22.6          
      N 64 x D 6144 x R True x DT torch.float32x GS [1, 64]       |             32.5             |            22.6          
      N 64 x D 6144 x R True x DT torch.float32x GS [1, 128]      |             32.5             |            22.7          
      N 64 x D 6144 x R False x DT torch.bfloat16x GS [1, 64]     |             35.6             |            22.8          
      N 64 x D 6144 x R False x DT torch.bfloat16x GS [1, 128]    |             36.0             |            22.4          
      N 64 x D 6144 x R False x DT torch.float32x GS [1, 64]      |             35.7             |            22.8          
      N 64 x D 6144 x R False x DT torch.float32x GS [1, 128]     |             35.8             |            22.7          
      N 64 x D 7168 x R True x DT torch.bfloat16x GS [1, 64]      |             32.5             |            22.6          
      N 64 x D 7168 x R True x DT torch.bfloat16x GS [1, 128]     |             32.8             |            22.9          
      N 64 x D 7168 x R True x DT torch.float32x GS [1, 64]       |             33.2             |            23.1          
      N 64 x D 7168 x R True x DT torch.float32x GS [1, 128]      |             32.8             |            22.7          
      N 64 x D 7168 x R False x DT torch.bfloat16x GS [1, 64]     |             35.5             |            22.6          
      N 64 x D 7168 x R False x DT torch.bfloat16x GS [1, 128]    |             35.9             |            22.4          
      N 64 x D 7168 x R False x DT torch.float32x GS [1, 64]      |             36.4             |            22.6          
      N 64 x D 7168 x R False x DT torch.float32x GS [1, 128]     |             36.1             |            22.6          
      N 128 x D 1024 x R True x DT torch.bfloat16x GS [1, 64]     |             32.8             |            23.0          
      N 128 x D 1024 x R True x DT torch.bfloat16x GS [1, 128]    |             32.4             |            22.2          
      N 128 x D 1024 x R True x DT torch.float32x GS [1, 64]      |             32.5             |            22.5          
      N 128 x D 1024 x R True x DT torch.float32x GS [1, 128]     |             32.6             |            22.5          
      N 128 x D 1024 x R False x DT torch.bfloat16x GS [1, 64]    |             36.0             |            22.5          
      N 128 x D 1024 x R False x DT torch.bfloat16x GS [1, 128]   |             35.9             |            22.6          
      N 128 x D 1024 x R False x DT torch.float32x GS [1, 64]     |             36.4             |            22.7          
      N 128 x D 1024 x R False x DT torch.float32x GS [1, 128]    |             35.9             |            22.5          
      N 128 x D 2048 x R True x DT torch.bfloat16x GS [1, 64]     |             32.6             |            22.8          
      N 128 x D 2048 x R True x DT torch.bfloat16x GS [1, 128]    |             32.8             |            22.6          
      N 128 x D 2048 x R True x DT torch.float32x GS [1, 64]      |             32.5             |            22.6          
      N 128 x D 2048 x R True x DT torch.float32x GS [1, 128]     |             33.1             |            22.7          
      N 128 x D 2048 x R False x DT torch.bfloat16x GS [1, 64]    |             35.6             |            23.0          
      N 128 x D 2048 x R False x DT torch.bfloat16x GS [1, 128]   |             35.8             |            22.5          
      N 128 x D 2048 x R False x DT torch.float32x GS [1, 64]     |             35.8             |            22.6          
      N 128 x D 2048 x R False x DT torch.float32x GS [1, 128]    |             35.7             |            22.6          
      N 128 x D 3072 x R True x DT torch.bfloat16x GS [1, 64]     |             32.5             |            22.6          
      N 128 x D 3072 x R True x DT torch.bfloat16x GS [1, 128]    |             32.2             |            22.5          
      N 128 x D 3072 x R True x DT torch.float32x GS [1, 64]      |             33.0             |            22.9          
      N 128 x D 3072 x R True x DT torch.float32x GS [1, 128]     |             32.8             |            22.7          
      N 128 x D 3072 x R False x DT torch.bfloat16x GS [1, 64]    |             35.8             |            22.3          
      N 128 x D 3072 x R False x DT torch.bfloat16x GS [1, 128]   |             35.6             |            22.3          
      N 128 x D 3072 x R False x DT torch.float32x GS [1, 64]     |             35.8             |            22.3          
      N 128 x D 3072 x R False x DT torch.float32x GS [1, 128]    |             36.4             |            23.0          
      N 128 x D 4096 x R True x DT torch.bfloat16x GS [1, 64]     |             33.2             |            23.1          
      N 128 x D 4096 x R True x DT torch.bfloat16x GS [1, 128]    |             32.9             |            22.6          
      N 128 x D 4096 x R True x DT torch.float32x GS [1, 64]      |             32.9             |            22.7          
      N 128 x D 4096 x R True x DT torch.float32x GS [1, 128]     |             32.8             |            22.6          
      N 128 x D 4096 x R False x DT torch.bfloat16x GS [1, 64]    |             36.1             |            22.9          
      N 128 x D 4096 x R False x DT torch.bfloat16x GS [1, 128]   |             36.2             |            22.5          
      N 128 x D 4096 x R False x DT torch.float32x GS [1, 64]     |             36.6             |            23.0          
      N 128 x D 4096 x R False x DT torch.float32x GS [1, 128]    |             36.4             |            23.1          
      N 128 x D 5120 x R True x DT torch.bfloat16x GS [1, 64]     |             33.2             |            23.3          
      N 128 x D 5120 x R True x DT torch.bfloat16x GS [1, 128]    |             33.4             |            23.4          
      N 128 x D 5120 x R True x DT torch.float32x GS [1, 64]      |             33.2             |            22.6          
      N 128 x D 5120 x R True x DT torch.float32x GS [1, 128]     |             32.8             |            22.6          
      N 128 x D 5120 x R False x DT torch.bfloat16x GS [1, 64]    |             36.3             |            22.5          
      N 128 x D 5120 x R False x DT torch.bfloat16x GS [1, 128]   |             35.4             |            22.2          
      N 128 x D 5120 x R False x DT torch.float32x GS [1, 64]     |             36.2             |            22.8          
      N 128 x D 5120 x R False x DT torch.float32x GS [1, 128]    |             36.2             |            22.6          
      N 128 x D 6144 x R True x DT torch.bfloat16x GS [1, 64]     |             32.3             |            22.6          
      N 128 x D 6144 x R True x DT torch.bfloat16x GS [1, 128]    |             32.1             |            21.8          
      N 128 x D 6144 x R True x DT torch.float32x GS [1, 64]      |             32.3             |            22.5          
      N 128 x D 6144 x R True x DT torch.float32x GS [1, 128]     |             32.2             |            22.6          
      N 128 x D 6144 x R False x DT torch.bfloat16x GS [1, 64]    |             35.6             |            22.6          
      N 128 x D 6144 x R False x DT torch.bfloat16x GS [1, 128]   |             35.5             |            22.1          
      N 128 x D 6144 x R False x DT torch.float32x GS [1, 64]     |             35.6             |            22.6          
      N 128 x D 6144 x R False x DT torch.float32x GS [1, 128]    |             35.9             |            22.4          
      N 128 x D 7168 x R True x DT torch.bfloat16x GS [1, 64]     |             32.3             |            22.5          
      N 128 x D 7168 x R True x DT torch.bfloat16x GS [1, 128]    |             32.2             |            22.4          
      N 128 x D 7168 x R True x DT torch.float32x GS [1, 64]      |             33.4             |            23.0          
      N 128 x D 7168 x R True x DT torch.float32x GS [1, 128]     |             33.2             |            23.3          
      N 128 x D 7168 x R False x DT torch.bfloat16x GS [1, 64]    |             36.5             |            22.9          
      N 128 x D 7168 x R False x DT torch.bfloat16x GS [1, 128]   |             36.3             |            22.9          
      N 128 x D 7168 x R False x DT torch.float32x GS [1, 64]     |             36.1             |            22.9          
      N 128 x D 7168 x R False x DT torch.float32x GS [1, 128]    |             36.3             |            22.4          
      N 256 x D 1024 x R True x DT torch.bfloat16x GS [1, 64]     |             32.9             |            22.9          
      N 256 x D 1024 x R True x DT torch.bfloat16x GS [1, 128]    |             33.0             |            22.6          
      N 256 x D 1024 x R True x DT torch.float32x GS [1, 64]      |             32.5             |            22.6          
      N 256 x D 1024 x R True x DT torch.float32x GS [1, 128]     |             32.5             |            22.5          
      N 256 x D 1024 x R False x DT torch.bfloat16x GS [1, 64]    |             35.5             |            22.2          
      N 256 x D 1024 x R False x DT torch.bfloat16x GS [1, 128]   |             35.9             |            22.5          
      N 256 x D 1024 x R False x DT torch.float32x GS [1, 64]     |             36.0             |            22.4          
      N 256 x D 1024 x R False x DT torch.float32x GS [1, 128]    |             36.0             |            22.6          
      N 256 x D 2048 x R True x DT torch.bfloat16x GS [1, 64]     |             32.6             |            22.7          
      N 256 x D 2048 x R True x DT torch.bfloat16x GS [1, 128]    |             33.1             |            22.7          
      N 256 x D 2048 x R True x DT torch.float32x GS [1, 64]      |             32.5             |            22.7          
      N 256 x D 2048 x R True x DT torch.float32x GS [1, 128]     |             32.5             |            22.4          
      N 256 x D 2048 x R False x DT torch.bfloat16x GS [1, 64]    |             36.0             |            22.7          
      N 256 x D 2048 x R False x DT torch.bfloat16x GS [1, 128]   |             35.5             |            22.4          
      N 256 x D 2048 x R False x DT torch.float32x GS [1, 64]     |             35.6             |            22.4          
      N 256 x D 2048 x R False x DT torch.float32x GS [1, 128]    |             35.6             |            22.4          
      N 256 x D 3072 x R True x DT torch.bfloat16x GS [1, 64]     |             32.2             |            22.5          
      N 256 x D 3072 x R True x DT torch.bfloat16x GS [1, 128]    |             32.5             |            22.5          
      N 256 x D 3072 x R True x DT torch.float32x GS [1, 64]      |             32.5             |            22.5          
      N 256 x D 3072 x R True x DT torch.float32x GS [1, 128]     |             32.4             |            22.5          
      N 256 x D 3072 x R False x DT torch.bfloat16x GS [1, 64]    |             35.6             |            23.0          
      N 256 x D 3072 x R False x DT torch.bfloat16x GS [1, 128]   |             36.3             |            22.5          
      N 256 x D 3072 x R False x DT torch.float32x GS [1, 64]     |             36.2             |            22.8          
      N 256 x D 3072 x R False x DT torch.float32x GS [1, 128]    |             36.5             |            22.7          
      N 256 x D 4096 x R True x DT torch.bfloat16x GS [1, 64]     |             33.3             |            23.2          
      N 256 x D 4096 x R True x DT torch.bfloat16x GS [1, 128]    |             33.3             |            23.3          
      N 256 x D 4096 x R True x DT torch.float32x GS [1, 64]      |             32.9             |            22.5          
      N 256 x D 4096 x R True x DT torch.float32x GS [1, 128]     |             32.4             |            22.5          
      N 256 x D 4096 x R False x DT torch.bfloat16x GS [1, 64]    |             35.8             |            22.5          
      N 256 x D 4096 x R False x DT torch.bfloat16x GS [1, 128]   |             35.7             |            22.3          
      N 256 x D 4096 x R False x DT torch.float32x GS [1, 64]     |             35.5             |            22.3          
      N 256 x D 4096 x R False x DT torch.float32x GS [1, 128]    |             35.6             |            22.5          
      N 256 x D 5120 x R True x DT torch.bfloat16x GS [1, 64]     |             32.4             |            22.7          
      N 256 x D 5120 x R True x DT torch.bfloat16x GS [1, 128]    |             32.2             |            22.5          
      N 256 x D 5120 x R True x DT torch.float32x GS [1, 64]      |             32.9             |            22.9          
      N 256 x D 5120 x R True x DT torch.float32x GS [1, 128]     |             32.3             |            22.7          
      N 256 x D 5120 x R False x DT torch.bfloat16x GS [1, 64]    |             35.9             |            22.5          
      N 256 x D 5120 x R False x DT torch.bfloat16x GS [1, 128]   |             35.8             |            22.3          
      N 256 x D 5120 x R False x DT torch.float32x GS [1, 64]     |             35.9             |            22.5          
      N 256 x D 5120 x R False x DT torch.float32x GS [1, 128]    |             35.8             |            22.6          
      N 256 x D 6144 x R True x DT torch.bfloat16x GS [1, 64]     |             33.3             |            22.7          
      N 256 x D 6144 x R True x DT torch.bfloat16x GS [1, 128]    |             33.1             |            22.6          
      N 256 x D 6144 x R True x DT torch.float32x GS [1, 64]      |             33.1             |            22.5          
      N 256 x D 6144 x R True x DT torch.float32x GS [1, 128]     |             33.0             |            22.5          
      N 256 x D 6144 x R False x DT torch.bfloat16x GS [1, 64]    |             36.4             |            22.3          
      N 256 x D 6144 x R False x DT torch.bfloat16x GS [1, 128]   |             35.6             |            22.7          
      N 256 x D 6144 x R False x DT torch.float32x GS [1, 64]     |             36.0             |            22.7          
      N 256 x D 6144 x R False x DT torch.float32x GS [1, 128]    |             36.3             |            22.9          
      N 256 x D 7168 x R True x DT torch.bfloat16x GS [1, 64]     |             32.5             |            23.2          
      N 256 x D 7168 x R True x DT torch.bfloat16x GS [1, 128]    |             33.2             |            23.0          
      N 256 x D 7168 x R True x DT torch.float32x GS [1, 64]      |             32.9             |            23.0          
      N 256 x D 7168 x R True x DT torch.float32x GS [1, 128]     |             33.2             |            23.0          
      N 256 x D 7168 x R False x DT torch.bfloat16x GS [1, 64]    |             36.4             |            22.9          
      N 256 x D 7168 x R False x DT torch.bfloat16x GS [1, 128]   |             36.0             |            22.6          
      N 256 x D 7168 x R False x DT torch.float32x GS [1, 64]     |             35.6             |            22.3          
      N 256 x D 7168 x R False x DT torch.float32x GS [1, 128]    |             35.7             |            22.3          
      N 512 x D 1024 x R True x DT torch.bfloat16x GS [1, 64]     |             32.5             |            22.4          
      N 512 x D 1024 x R True x DT torch.bfloat16x GS [1, 128]    |             32.5             |            22.4          
      N 512 x D 1024 x R True x DT torch.float32x GS [1, 64]      |             32.8             |            22.7          
      N 512 x D 1024 x R True x DT torch.float32x GS [1, 128]     |             33.0             |            22.9          
      N 512 x D 1024 x R False x DT torch.bfloat16x GS [1, 64]    |             36.0             |            23.5          
      N 512 x D 1024 x R False x DT torch.bfloat16x GS [1, 128]   |             36.3             |            23.3          
      N 512 x D 1024 x R False x DT torch.float32x GS [1, 64]     |             36.8             |            22.9          
      N 512 x D 1024 x R False x DT torch.float32x GS [1, 128]    |             36.5             |            23.0          
      N 512 x D 2048 x R True x DT torch.bfloat16x GS [1, 64]     |             32.9             |            23.1          
      N 512 x D 2048 x R True x DT torch.bfloat16x GS [1, 128]    |             32.6             |            23.1          
      N 512 x D 2048 x R True x DT torch.float32x GS [1, 64]      |             32.6             |            22.8          
      N 512 x D 2048 x R True x DT torch.float32x GS [1, 128]     |             32.9             |            22.9          
      N 512 x D 2048 x R False x DT torch.bfloat16x GS [1, 64]    |             36.5             |            23.1          
      N 512 x D 2048 x R False x DT torch.bfloat16x GS [1, 128]   |             36.2             |            22.7          
      N 512 x D 2048 x R False x DT torch.float32x GS [1, 64]     |             37.1             |            22.5          
      N 512 x D 2048 x R False x DT torch.float32x GS [1, 128]    |             36.7             |            23.0          
      N 512 x D 3072 x R True x DT torch.bfloat16x GS [1, 64]     |             33.0             |            22.9          
      N 512 x D 3072 x R True x DT torch.bfloat16x GS [1, 128]    |             33.1             |            23.2          
      N 512 x D 3072 x R True x DT torch.float32x GS [1, 64]      |             33.4             |            22.9          
      N 512 x D 3072 x R True x DT torch.float32x GS [1, 128]     |             33.5             |            23.2          
      N 512 x D 3072 x R False x DT torch.bfloat16x GS [1, 64]    |             36.8             |            23.0          
      N 512 x D 3072 x R False x DT torch.bfloat16x GS [1, 128]   |             37.0             |            23.5          
      N 512 x D 3072 x R False x DT torch.float32x GS [1, 64]     |             35.9             |            22.5          
      N 512 x D 3072 x R False x DT torch.float32x GS [1, 128]    |             36.0             |            22.6          
      N 512 x D 4096 x R True x DT torch.bfloat16x GS [1, 64]     |             32.8             |            23.0          
      N 512 x D 4096 x R True x DT torch.bfloat16x GS [1, 128]    |             33.2             |            23.0          
      N 512 x D 4096 x R True x DT torch.float32x GS [1, 64]      |             33.4             |            23.1          
      N 512 x D 4096 x R True x DT torch.float32x GS [1, 128]     |             33.4             |            23.3          
      N 512 x D 4096 x R False x DT torch.bfloat16x GS [1, 64]    |             36.5             |            22.8          
      N 512 x D 4096 x R False x DT torch.bfloat16x GS [1, 128]   |             36.3             |            22.9          
      N 512 x D 4096 x R False x DT torch.float32x GS [1, 64]     |             37.0             |            22.8          
      N 512 x D 4096 x R False x DT torch.float32x GS [1, 128]    |             35.8             |            22.4          
      N 512 x D 5120 x R True x DT torch.bfloat16x GS [1, 64]     |             32.6             |            23.9          
      N 512 x D 5120 x R True x DT torch.bfloat16x GS [1, 128]    |             33.5             |            22.9          
      N 512 x D 5120 x R True x DT torch.float32x GS [1, 64]      |             33.6             |            24.6          
      N 512 x D 5120 x R True x DT torch.float32x GS [1, 128]     |             32.9             |            22.9          
      N 512 x D 5120 x R False x DT torch.bfloat16x GS [1, 64]    |             36.6             |            22.9          
      N 512 x D 5120 x R False x DT torch.bfloat16x GS [1, 128]   |             36.7             |            22.5          
      N 512 x D 5120 x R False x DT torch.float32x GS [1, 64]     |             36.5             |            22.8          
      N 512 x D 5120 x R False x DT torch.float32x GS [1, 128]    |             37.3             |            23.4          
      N 512 x D 6144 x R True x DT torch.bfloat16x GS [1, 64]     |             33.3             |            28.6          
      N 512 x D 6144 x R True x DT torch.bfloat16x GS [1, 128]    |             33.2             |            23.0          
      N 512 x D 6144 x R True x DT torch.float32x GS [1, 64]      |             33.1             |            29.6          
      N 512 x D 6144 x R True x DT torch.float32x GS [1, 128]     |             33.3             |            23.1          
      N 512 x D 6144 x R False x DT torch.bfloat16x GS [1, 64]    |             36.9             |            24.8          
      N 512 x D 6144 x R False x DT torch.bfloat16x GS [1, 128]   |             37.8             |            22.6          
      N 512 x D 6144 x R False x DT torch.float32x GS [1, 64]     |             36.3             |            24.4          
      N 512 x D 6144 x R False x DT torch.float32x GS [1, 128]    |             36.2             |            22.8          
      N 512 x D 7168 x R True x DT torch.bfloat16x GS [1, 64]     |             32.8             |            33.2          
      N 512 x D 7168 x R True x DT torch.bfloat16x GS [1, 128]    |             32.8             |            24.5          
      N 512 x D 7168 x R True x DT torch.float32x GS [1, 64]      |             33.6             |            34.5          
      N 512 x D 7168 x R True x DT torch.float32x GS [1, 128]     |             33.4             |            26.3          
      N 512 x D 7168 x R False x DT torch.bfloat16x GS [1, 64]    |             36.5             |            28.1          
      N 512 x D 7168 x R False x DT torch.bfloat16x GS [1, 128]   |             36.4             |            22.8          
      N 512 x D 7168 x R False x DT torch.float32x GS [1, 64]     |             36.1             |            27.6          
      N 512 x D 7168 x R False x DT torch.float32x GS [1, 128]    |             36.2             |            22.9          
      N 1024 x D 1024 x R True x DT torch.bfloat16x GS [1, 64]    |             32.7             |            22.8          
      N 1024 x D 1024 x R True x DT torch.bfloat16x GS [1, 128]   |             32.5             |            22.6          
      N 1024 x D 1024 x R True x DT torch.float32x GS [1, 64]     |             32.7             |            23.0          
      N 1024 x D 1024 x R True x DT torch.float32x GS [1, 128]    |             32.8             |            22.8          
      N 1024 x D 1024 x R False x DT torch.bfloat16x GS [1, 64]   |             35.7             |            22.5          
      N 1024 x D 1024 x R False x DT torch.bfloat16x GS [1, 128]  |             35.8             |            22.6          
      N 1024 x D 1024 x R False x DT torch.float32x GS [1, 64]    |             36.0             |            22.8          
      N 1024 x D 1024 x R False x DT torch.float32x GS [1, 128]   |             36.0             |            22.8          
      N 1024 x D 2048 x R True x DT torch.bfloat16x GS [1, 64]    |             32.8             |            23.0          
      N 1024 x D 2048 x R True x DT torch.bfloat16x GS [1, 128]   |             32.6             |            22.6          
      N 1024 x D 2048 x R True x DT torch.float32x GS [1, 64]     |             32.9             |            22.9          
      N 1024 x D 2048 x R True x DT torch.float32x GS [1, 128]    |             32.7             |            23.0          
      N 1024 x D 2048 x R False x DT torch.bfloat16x GS [1, 64]   |             36.0             |            22.6          
      N 1024 x D 2048 x R False x DT torch.bfloat16x GS [1, 128]  |             35.8             |            22.5          
      N 1024 x D 2048 x R False x DT torch.float32x GS [1, 64]    |             35.9             |            22.5          
      N 1024 x D 2048 x R False x DT torch.float32x GS [1, 128]   |             35.9             |            22.6          
      N 1024 x D 3072 x R True x DT torch.bfloat16x GS [1, 64]    |             32.7             |            26.9          
      N 1024 x D 3072 x R True x DT torch.bfloat16x GS [1, 128]   |             32.7             |            22.8          
      N 1024 x D 3072 x R True x DT torch.float32x GS [1, 64]     |             32.4             |            28.9          
      N 1024 x D 3072 x R True x DT torch.float32x GS [1, 128]    |             33.0             |            23.9          
      N 1024 x D 3072 x R False x DT torch.bfloat16x GS [1, 64]   |             36.2             |            24.4          
      N 1024 x D 3072 x R False x DT torch.bfloat16x GS [1, 128]  |             36.1             |            22.5          
      N 1024 x D 3072 x R False x DT torch.float32x GS [1, 64]    |             36.1             |            24.7          
      N 1024 x D 3072 x R False x DT torch.float32x GS [1, 128]   |             35.9             |            22.9          
      N 1024 x D 4096 x R True x DT torch.bfloat16x GS [1, 64]    |             32.5             |            33.2          
      N 1024 x D 4096 x R True x DT torch.bfloat16x GS [1, 128]   |             32.6             |            27.4          
      N 1024 x D 4096 x R True x DT torch.float32x GS [1, 64]     |             38.8             |            36.3          
      N 1024 x D 4096 x R True x DT torch.float32x GS [1, 128]    |             37.0             |            30.2          
      N 1024 x D 4096 x R False x DT torch.bfloat16x GS [1, 64]   |             35.9             |            29.4          
      N 1024 x D 4096 x R False x DT torch.bfloat16x GS [1, 128]  |             35.7             |            24.0          
      N 1024 x D 4096 x R False x DT torch.float32x GS [1, 64]    |             36.3             |            28.6          
      N 1024 x D 4096 x R False x DT torch.float32x GS [1, 128]   |             36.2             |            22.8          
      N 1024 x D 5120 x R True x DT torch.bfloat16x GS [1, 64]    |             33.0             |            42.1          
      N 1024 x D 5120 x R True x DT torch.bfloat16x GS [1, 128]   |             32.9             |            33.6          
      N 1024 x D 5120 x R True x DT torch.float32x GS [1, 64]     |             49.9             |            52.7          
      N 1024 x D 5120 x R True x DT torch.float32x GS [1, 128]    |             46.7             |            43.7          
      N 1024 x D 5120 x R False x DT torch.bfloat16x GS [1, 64]   |             36.6             |            36.7          
      N 1024 x D 5120 x R False x DT torch.bfloat16x GS [1, 128]  |             36.7             |            29.2          
      N 1024 x D 5120 x R False x DT torch.float32x GS [1, 64]    |             36.0             |            35.9          
      N 1024 x D 5120 x R False x DT torch.float32x GS [1, 128]   |             35.7             |            27.9          
      N 1024 x D 6144 x R True x DT torch.bfloat16x GS [1, 64]    |             38.3             |            51.2          
      N 1024 x D 6144 x R True x DT torch.bfloat16x GS [1, 128]   |             32.5             |            39.8          
      N 1024 x D 6144 x R True x DT torch.float32x GS [1, 64]     |             61.6             |            67.2          
      N 1024 x D 6144 x R True x DT torch.float32x GS [1, 128]    |             58.4             |            55.7          
      N 1024 x D 6144 x R False x DT torch.bfloat16x GS [1, 64]   |             36.1             |            43.4          
      N 1024 x D 6144 x R False x DT torch.bfloat16x GS [1, 128]  |             35.8             |            33.9          
      N 1024 x D 6144 x R False x DT torch.float32x GS [1, 64]    |             41.5             |            43.6          
      N 1024 x D 6144 x R False x DT torch.float32x GS [1, 128]   |             37.9             |            33.2          
      N 1024 x D 7168 x R True x DT torch.bfloat16x GS [1, 64]    |             46.9             |            60.4          
      N 1024 x D 7168 x R True x DT torch.bfloat16x GS [1, 128]   |             35.0             |            46.7          
      N 1024 x D 7168 x R True x DT torch.float32x GS [1, 64]     |             73.6             |            78.8          
      N 1024 x D 7168 x R True x DT torch.float32x GS [1, 128]    |             70.3             |            65.0          
      N 1024 x D 7168 x R False x DT torch.bfloat16x GS [1, 64]   |             38.3             |            49.6          
      N 1024 x D 7168 x R False x DT torch.bfloat16x GS [1, 128]  |             37.2             |            38.2          
      N 1024 x D 7168 x R False x DT torch.float32x GS [1, 64]    |             50.4             |            51.4          
      N 1024 x D 7168 x R False x DT torch.float32x GS [1, 128]   |             46.5             |            38.3          

Times are in microseconds (us).

Results of E2E sonnet benchmark of `Qwen/Qwen3-30B-A3B-FP8` compared to main (H100):

input_len	prefix_len	output_len	request_rate	main_ttft	pr_ttft	main_tpot	pr_tpot
256	128	128	1	26.07	24.98	7.7	7.75
512	256	256	1	30.11	28.14	8.43	8.41
1024	512	512	1	40.37	40.57	10.02	9.83
256	128	128	2.5	23.89	23.2	8.57	8.57
512	256	256	2.5	27.41	26.94	10.09	9.98
1024	512	512	2.5	32.23	31.77	12.65	12.32
256	128	128	5	25.5	24.12	9.97	9.79
512	256	256	5	29.46	30.77	12.22	11.98
1024	512	512	5	37.25	34.78	14.94	15.23
256	128	128	10	28.07	29.84	12.19	11.91
512	256	256	10	38.68	37.84	15.64	15.4
1024	512	512	10	69.44	67.1	22.96	22.65
256	128	128	25	36.9	35.8	14.67	14.62
512	256	256	25	87.61	79.51	27.22	26.24
1024	512	512	25	120.67	117.49	42.95	42.53
256	128	128	50	73.54	71.44	24.6	25.25
512	256	256	50	131.41	130.21	54.35	55.16
1024	512	512	50	201.89	200.63	58.51	58.47

Signed-off-by: ElizaWszola <ewszola@redhat.com>

…or int8 Signed-off-by: ElizaWszola <ewszola@redhat.com>

Signed-off-by: ElizaWszola <ewszola@redhat.com>

Signed-off-by: yewentao256 <zhyanwentao@126.com>

yewentao256 · 2025-11-07T22:49:23Z

The optimization of this commit is beneficial:
Before

[-------------------------------------------- rms-norm-dynamic-per-token-quant --------------------------------------------]
                                                                  |  unfused_groupwise_fp8_impl  |  fused_groupwise_fp8_impl
1 threads: -----------------------------------------------------------------------------------------------------------------
      N 1 x D 1024 x R True x DT torch.bfloat16x GS [1, 128]      |             31.4             |            29.4          
      N 1 x D 1024 x R False x DT torch.bfloat16x GS [1, 128]     |             34.0             |            30.4          
      N 1 x D 5120 x R True x DT torch.bfloat16x GS [1, 128]      |             31.3             |            29.6          
      N 1 x D 5120 x R False x DT torch.bfloat16x GS [1, 128]     |             34.0             |            29.5          
      N 4 x D 1024 x R True x DT torch.bfloat16x GS [1, 128]      |             30.1             |            29.5          
      N 4 x D 1024 x R False x DT torch.bfloat16x GS [1, 128]     |             35.1             |            31.2          
      N 4 x D 5120 x R True x DT torch.bfloat16x GS [1, 128]      |             32.4             |            32.5          
      N 4 x D 5120 x R False x DT torch.bfloat16x GS [1, 128]     |             36.1             |            30.7          
      N 16 x D 1024 x R True x DT torch.bfloat16x GS [1, 128]     |             31.6             |            31.4          
      N 16 x D 1024 x R False x DT torch.bfloat16x GS [1, 128]    |             35.2             |            32.3          
      N 16 x D 5120 x R True x DT torch.bfloat16x GS [1, 128]     |             32.8             |            32.2          
      N 16 x D 5120 x R False x DT torch.bfloat16x GS [1, 128]    |             35.1             |            31.6          
      N 64 x D 1024 x R True x DT torch.bfloat16x GS [1, 128]     |             31.8             |            31.5          
      N 64 x D 1024 x R False x DT torch.bfloat16x GS [1, 128]    |             35.2             |            32.7          
      N 64 x D 5120 x R True x DT torch.bfloat16x GS [1, 128]     |             31.8             |            31.6          
      N 64 x D 5120 x R False x DT torch.bfloat16x GS [1, 128]    |             36.1             |            32.1          
      N 256 x D 1024 x R True x DT torch.bfloat16x GS [1, 128]    |             32.8             |            32.3          
      N 256 x D 1024 x R False x DT torch.bfloat16x GS [1, 128]   |             36.1             |            32.0          
      N 256 x D 5120 x R True x DT torch.bfloat16x GS [1, 128]    |             32.6             |            32.3          
      N 256 x D 5120 x R False x DT torch.bfloat16x GS [1, 128]   |             35.2             |            31.5          
      N 1024 x D 1024 x R True x DT torch.bfloat16x GS [1, 128]   |             31.4             |            39.0          
      N 1024 x D 1024 x R False x DT torch.bfloat16x GS [1, 128]  |             35.1             |            36.9          
      N 1024 x D 5120 x R True x DT torch.bfloat16x GS [1, 128]   |             31.8             |            53.3          
      N 1024 x D 5120 x R False x DT torch.bfloat16x GS [1, 128]  |             35.5             |            49.3

now

[-------------------------------------------- rms-norm-dynamic-per-token-quant --------------------------------------------]
                                                                  |  unfused_groupwise_fp8_impl  |  fused_groupwise_fp8_impl
1 threads: -----------------------------------------------------------------------------------------------------------------
      N 1 x D 1024 x R True x DT torch.bfloat16x GS [1, 128]      |             30.9             |            19.6          
      N 1 x D 1024 x R False x DT torch.bfloat16x GS [1, 128]     |             36.5             |            19.4          
      N 1 x D 5120 x R True x DT torch.bfloat16x GS [1, 128]      |             30.5             |            19.6          
      N 1 x D 5120 x R False x DT torch.bfloat16x GS [1, 128]     |             36.5             |            19.6          
      N 4 x D 1024 x R True x DT torch.bfloat16x GS [1, 128]      |             30.4             |            19.5          
      N 4 x D 1024 x R False x DT torch.bfloat16x GS [1, 128]     |             34.2             |            19.3          
      N 4 x D 5120 x R True x DT torch.bfloat16x GS [1, 128]      |             30.5             |            19.6          
      N 4 x D 5120 x R False x DT torch.bfloat16x GS [1, 128]     |             34.2             |            19.4          
      N 16 x D 1024 x R True x DT torch.bfloat16x GS [1, 128]     |             31.8             |            19.6          
      N 16 x D 1024 x R False x DT torch.bfloat16x GS [1, 128]    |             36.4             |            19.5          
      N 16 x D 5120 x R True x DT torch.bfloat16x GS [1, 128]     |             30.7             |            19.7          
      N 16 x D 5120 x R False x DT torch.bfloat16x GS [1, 128]    |             36.5             |            19.7          
      N 64 x D 1024 x R True x DT torch.bfloat16x GS [1, 128]     |             31.8             |            19.7          
      N 64 x D 1024 x R False x DT torch.bfloat16x GS [1, 128]    |             36.5             |            19.6          
      N 64 x D 5120 x R True x DT torch.bfloat16x GS [1, 128]     |             30.4             |            19.6          
      N 64 x D 5120 x R False x DT torch.bfloat16x GS [1, 128]    |             34.3             |            19.5          
      N 256 x D 1024 x R True x DT torch.bfloat16x GS [1, 128]    |             30.1             |            19.4          
      N 256 x D 1024 x R False x DT torch.bfloat16x GS [1, 128]   |             34.4             |            19.8          
      N 256 x D 5120 x R True x DT torch.bfloat16x GS [1, 128]    |             30.7             |            19.6          
      N 256 x D 5120 x R False x DT torch.bfloat16x GS [1, 128]   |             34.2             |            19.5          
      N 1024 x D 1024 x R True x DT torch.bfloat16x GS [1, 128]   |             30.7             |            19.4          
      N 1024 x D 1024 x R False x DT torch.bfloat16x GS [1, 128]  |             34.4             |            19.4          
      N 1024 x D 5120 x R True x DT torch.bfloat16x GS [1, 128]   |             30.7             |            28.7          
      N 1024 x D 5120 x R False x DT torch.bfloat16x GS [1, 128]  |             34.5             |            28.7

mergify · 2025-11-10T15:52:53Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ElizaWszola.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: ElizaWszola <ewszola@redhat.com>

…agic/vllm into blockwise-quant-rms-norm Signed-off-by: ElizaWszola <ewszola@redhat.com>

Signed-off-by: ElizaWszola <ewszola@redhat.com>

ProExpertProg

cc @yewentao256 @varun-sundar-rabindranath for kernel review as well

tests/compile/test_fusion.py

vllm/compilation/matcher_utils.py

ProExpertProg · 2025-12-04T16:59:51Z

csrc/quantization/fused_kernels/fused_layernorm_dynamic_per_token_quant.cu

+  const cudaStream_t stream = at::cuda::getCurrentCUDAStream();
+
+  if (residual.has_value()) {
+    if (is_scale_transposed) {


I think we have a bool dispatch macro

I found one in SM100 CUTLASS file, but it didn't do quite what I needed it for, so I ended up adding my own macro in dispatch_utils. If it duplicates some already existing code, please lmk

csrc/quantization/fused_kernels/fused_layernorm_dynamic_per_token_quant.cu

csrc/quantization/fused_kernels/layernorm_utils.cuh

Signed-off-by: ElizaWszola <ewszola@redhat.com>

ElizaWszola · 2025-12-05T08:59:22Z

@ProExpertProg I've now observed function mismatches in fusion tests when running with deepgemm enabled on H100, so we should either add this support later of delay landing until it's fixed -- what is your opinion on this?

Figured it out now, pushed the fix :)

Signed-off-by: ElizaWszola <ewszola@redhat.com>

Signed-off-by: ElizaWszola <ewszola@redhat.com> Signed-off-by: yewentao256 <zhyanwentao@126.com> Co-authored-by: yewentao256 <zhyanwentao@126.com>

cjackal · 2025-12-08T02:40:09Z

After this PR, Qwen3 VLs (and most likely other FP8 VLMs I guess) are failing with the following error:

...
AttributeError: 'Qwen3VLMoeConfig' object has no attribute 'intermediate_size'

which is raised at using_deepgemm checking hf_config.intermediate_size w/o proper guard.

Summary: Fix AMD compilation failure for DeepSeek models introduced in vllm-project#27883. The issue was that RMSNormQuantFusionPass unconditionally creates FusedAddRMSNormGroupQuantPattern and RMSNormGroupQuantPattern for group quantization (GroupShape 64 and 128), but the underlying C++ operation per_token_group_fp8_quant is only available on CUDA (wrapped in #ifndef USE_ROCM in torch_bindings.cpp). On AMD platforms, this caused an assertion failure: AssertionError: unsupported quantization scheme QuantKey(f8e4m3fnuz,scale(f32,dynamic,GroupShape(row=1, col=128)),symmetric) The fix guards the creation of group quant patterns with current_platform.is_cuda(), matching the guard used for registering these keys in QUANT_OPS. Test Plan: Waiting for this deepseek job on amd to complete: https://www.internalfb.com/vanguard/serving_test_cases/1967790977283741 Will also wait for external CI Differential Revision: D88608586 Privacy Context Container: L1370295

ElizaWszola · 2025-12-08T07:49:22Z

Hey @cjackal , thanks for finding this issue! I put up a PR that seems to fix the problem, let me know if you're still seeing any issues with this solution #30244

Signed-off-by: ElizaWszola <ewszola@redhat.com> Signed-off-by: yewentao256 <zhyanwentao@126.com> Co-authored-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: mayoohee <yiweiii.fang@gmail.com>

Signed-off-by: ElizaWszola <ewszola@redhat.com> Signed-off-by: yewentao256 <zhyanwentao@126.com> Co-authored-by: yewentao256 <zhyanwentao@126.com>

ElizaWszola added 11 commits October 30, 2025 03:27

Blockwise quant RMS norm

8748f11

Signed-off-by: ElizaWszola <ewszola@redhat.com>

Cleanup

ea9f4db

Signed-off-by: ElizaWszola <ewszola@redhat.com>

Apply quant layer norm fixes from vllm-project#27865, inv scale fix f…

b3a55fd

…or int8 Signed-off-by: ElizaWszola <ewszola@redhat.com>

Cleanup

1e912ee

Signed-off-by: ElizaWszola <ewszola@redhat.com>

Vectorize

051b451

Signed-off-by: ElizaWszola <ewszola@redhat.com>

Unify kernel shapes to fuse

2584f2f

Signed-off-by: ElizaWszola <ewszola@redhat.com>

Fix

9bc68d4

Signed-off-by: ElizaWszola <ewszola@redhat.com>

Cleanup

0fce111

Signed-off-by: ElizaWszola <ewszola@redhat.com>

Scalar scale computation is working again

0fac68c

Signed-off-by: ElizaWszola <ewszola@redhat.com>

Vectorized

294e884

Signed-off-by: ElizaWszola <ewszola@redhat.com>

Test group_size=64, add benchmarks

54ab82f

Signed-off-by: ElizaWszola <ewszola@redhat.com>

mergify bot added the performance Performance-related issues label Nov 7, 2025

ElizaWszola changed the title ~~[Performance] Blockwise quant RMS norm~~ [Performance] Fused blockwise quant RMS norm Nov 7, 2025

yewentao256 added 2 commits November 7, 2025 09:35

Merge branch 'main' into blockwise-quant-rms-norm

3718bbc

optimize

e00a6d7

Signed-off-by: yewentao256 <zhyanwentao@126.com>

mgoin added quantization nvidia labels Nov 10, 2025

github-project-automation bot added this to NVIDIA Nov 10, 2025

mergify bot added the needs-rebase label Nov 10, 2025

ElizaWszola added 3 commits November 14, 2025 00:30

Add fusion patterns

77e0078

Signed-off-by: ElizaWszola <ewszola@redhat.com>

Merge branch 'blockwise-quant-rms-norm' of https://github.com/neuralm…

3c61951

…agic/vllm into blockwise-quant-rms-norm Signed-off-by: ElizaWszola <ewszola@redhat.com>

Merge branch 'main' into blockwise-quant-rms-norm

0d8c405

Signed-off-by: ElizaWszola <ewszola@redhat.com>

mergify bot removed the needs-rebase label Nov 14, 2025

ProExpertProg mentioned this pull request Nov 17, 2025

[Rocm][torch.compile] Adding layernorm + fp8 block quant and silu + fp8 block quant for Aiter #25693

Merged

ElizaWszola added 4 commits November 18, 2025 02:03

Account for transposed scales

c63bb1b

Signed-off-by: ElizaWszola <ewszola@redhat.com>

Cleanup fallback code

e8c5563

Signed-off-by: ElizaWszola <ewszola@redhat.com>

Cleanup comments, var names

c745e91

Signed-off-by: ElizaWszola <ewszola@redhat.com>

Transpose scales if needed

949db4d

Signed-off-by: ElizaWszola <ewszola@redhat.com>

ElizaWszola requested a review from pavanimajety as a code owner December 3, 2025 15:19

ElizaWszola added 2 commits December 3, 2025 16:18

Cleanup unused template

92fd8c9

Signed-off-by: ElizaWszola <ewszola@redhat.com>

Merge branch 'main' into blockwise-quant-rms-norm

377d204

Signed-off-by: ElizaWszola <ewszola@redhat.com>

ProExpertProg reviewed Dec 4, 2025

View reviewed changes

Feedback, add gs==64 to fusion tests

06e3645

Signed-off-by: ElizaWszola <ewszola@redhat.com>

ElizaWszola added 2 commits December 5, 2025 09:26

Move type dispatch to dispatch function

416f173

Signed-off-by: ElizaWszola <ewszola@redhat.com>

Fix deepgemm when we use e8m0

b2e2251

Signed-off-by: ElizaWszola <ewszola@redhat.com>

ProExpertProg approved these changes Dec 5, 2025

View reviewed changes

github-project-automation bot moved this to In review in NVIDIA Dec 5, 2025

ProExpertProg added the torch.compile label Dec 5, 2025

github-project-automation bot added this to torch.compile integration Dec 5, 2025

github-project-automation bot moved this to To triage in torch.compile integration Dec 5, 2025

ProExpertProg added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 5, 2025

ProExpertProg enabled auto-merge (squash) December 6, 2025 02:04

Merge branch 'luka/lora-passing' into blockwise-quant-rms-norm

f4a206c

ProExpertProg force-pushed the blockwise-quant-rms-norm branch from e4aa624 to f4a206c Compare December 7, 2025 04:31

Merge branch 'main' into blockwise-quant-rms-norm

990bc65

ProExpertProg merged commit af0444b into vllm-project:main Dec 7, 2025
92 checks passed

github-project-automation bot moved this from To triage to Done in torch.compile integration Dec 7, 2025

github-project-automation bot moved this from In review to Done in NVIDIA Dec 7, 2025

yeqcharlotte mentioned this pull request Dec 8, 2025

[ROCm] Guard group quant RMS norm fusion patterns #30239

Merged

ElizaWszola mentioned this pull request Dec 8, 2025

[Bugfix] Fix fusion for VL models #30244

Open

ElizaWszola mentioned this pull request Dec 8, 2025

[Bugfix] Temporarily disable group quant rms norm fusion #30273

Draft

Uh oh!

[Performance] Fused blockwise quant RMS norm #27883

[Performance] Fused blockwise quant RMS norm #27883

Uh oh!

Conversation

ElizaWszola commented Oct 31, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Testing:

Test fused op

Test fusion

Offline inference

Benchmarking:

Microbenchmark isolated op:

Results of E2E sonnet benchmark of Qwen/Qwen3-30B-A3B-FP8 compared to main (H100):

Uh oh!

yewentao256 commented Nov 7, 2025

Uh oh!

mergify bot commented Nov 10, 2025

Uh oh!

ProExpertProg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ProExpertProg Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

ElizaWszola Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ElizaWszola commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

cjackal commented Dec 8, 2025

Uh oh!

ElizaWszola commented Dec 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ElizaWszola commented Oct 31, 2025 •

edited by github-actions bot

Loading

Results of E2E sonnet benchmark of `Qwen/Qwen3-30B-A3B-FP8` compared to main (H100):

ElizaWszola commented Dec 5, 2025 •

edited

Loading