Add benchmark for DeepGEMM and vLLM Block FP8 Dense GEMM #13917

mgoin · 2025-02-26T19:19:15Z

python benchmark_fp8_block_dense_gemm_table.py
INFO 02-26 21:55:13 [__init__.py:207] Automatically detected platform cuda.
===== STARTING FP8 GEMM BENCHMARK =====
PyTorch version: 2.5.1+cu124
CUDA version: 12.4
Triton version: 3.1.0
Using device: NVIDIA H100 80GB HBM3

===== PERFORMANCE COMPARISON =====

DeepGEMM Implementation:
+------+-------+-------+-----------+--------+--------+
| m    | n     | k     | Time (μs) | TFLOPS | GB/s   |
+------+-------+-------+-----------+--------+--------+
|    8 |  4096 |  7168 | 102.9     | 4.6    | 286.4  |
|    8 |  7168 | 18432 | 70.8      | 29.8   | 1868.8 |
|    8 | 18432 |  7168 | 69.3      | 30.5   | 1911.8 |
|   64 |  4096 |  7168 | 69.1      | 54.4   | 439.0  |
|   64 |  7168 | 18432 | 69.4      | 243.6  | 1933.6 |
|   64 | 18432 |  7168 | 70.4      | 240.3  | 1917.2 |
|   64 | 24576 |  1536 | 70.1      | 68.9   | 584.6  |
|   64 | 32768 |   512 | 68.4      | 31.4   | 307.1  |
|   64 |  7168 | 16384 | 69.5      | 216.3  | 1718.5 |
|  128 |  4096 |  7168 | 141.1     | 53.3   | 222.1  |
|  128 |  7168 | 18432 | 71.9      | 470.5  | 1896.1 |
|  128 | 18432 |  7168 | 69.3      | 488.2  | 1988.2 |
| 1024 |  4096 |  7168 | 89.7      | 670.1  | 502.5  |
| 1024 | 18432 |  7168 | 279.0     | 969.8  | 635.2  |
| 2048 |  4096 |  7168 | 175.1     | 687.0  | 347.4  |
| 4096 |  4096 |  7168 | 335.4     | 717.0  | 275.1  |
+------+-------+-------+-----------+--------+--------+

vLLM Triton Implementation:
+------+-------+-------+-----------+--------+--------+--------------+
| m    | n     | k     | Time (μs) | TFLOPS | GB/s   | vs DeepGEMM  |
+------+-------+-------+-----------+--------+--------+--------------+
|    8 |  4096 |  7168 | 74.0      | 6.3    | 398.2  | 1.39x faster |
|    8 |  7168 | 18432 | 89.6      | 23.6   | 1478.1 | 0.79x slower |
|    8 | 18432 |  7168 | 113.2     | 18.7   | 1170.4 | 0.61x slower |
|   64 |  4096 |  7168 | 79.4      | 47.3   | 382.2  | 0.87x slower |
|   64 |  7168 | 18432 | 98.5      | 171.7  | 1363.0 | 0.70x slower |
|   64 | 18432 |  7168 | 119.5     | 141.5  | 1129.4 | 0.59x slower |
|   64 | 24576 |  1536 | 37.6      | 128.4  | 1089.7 | 1.86x faster |
|   64 | 32768 |   512 | 38.7      | 55.5   | 542.6  | 1.77x faster |
|   64 |  7168 | 16384 | 86.1      | 174.5  | 1386.4 | 0.81x slower |
|  128 |  4096 |  7168 | 90.7      | 82.9   | 345.4  | 1.56x faster |
|  128 |  7168 | 18432 | 144.0     | 234.9  | 946.9  | 0.50x slower |
|  128 | 18432 |  7168 | 229.5     | 147.4  | 600.1  | 0.30x slower |
| 1024 |  4096 |  7168 | 242.3     | 248.2  | 186.1  | 0.37x slower |
| 1024 | 18432 |  7168 | 897.8     | 301.4  | 197.4  | 0.31x slower |
| 2048 |  4096 |  7168 | 463.0     | 259.7  | 131.4  | 0.38x slower |
| 4096 |  4096 |  7168 | 901.8     | 266.7  | 102.3  | 0.37x slower |
+------+-------+-------+-----------+--------+--------+--------------+

vLLM CUTLASS Implementation:
+------+-------+-------+-----------+--------+--------+--------------+--------------+
| m    | n     | k     | Time (μs) | TFLOPS | GB/s   | vs DeepGEMM  | vs Triton    |
+------+-------+-------+-----------+--------+--------+--------------+--------------+
|    8 |  4096 |  7168 | 34.6      | 13.6   | 852.3  | 2.98x faster | 2.14x faster |
|    8 |  7168 | 18432 | 78.9      | 26.8   | 1677.3 | 0.90x slower | 1.13x faster |
|    8 | 18432 |  7168 | 81.2      | 26.0   | 1631.1 | 0.85x slower | 1.39x faster |
|   64 |  4096 |  7168 | 36.9      | 101.9  | 822.9  | 1.87x faster | 2.15x faster |
|   64 |  7168 | 18432 | 87.4      | 193.4  | 1535.2 | 0.79x slower | 1.13x faster |
|   64 | 18432 |  7168 | 85.0      | 199.0  | 1587.6 | 0.83x slower | 1.41x faster |
|   64 | 24576 |  1536 | 28.0      | 172.8  | 1465.8 | 2.51x faster | 1.35x faster |
|   64 | 32768 |   512 | 28.8      | 74.5   | 728.5  | 2.37x faster | 1.34x faster |
|   64 |  7168 | 16384 | 77.9      | 193.0  | 1532.8 | 0.89x slower | 1.11x faster |
|  128 |  4096 |  7168 | 39.1      | 192.4  | 802.0  | 3.61x faster | 2.32x faster |
|  128 |  7168 | 18432 | 93.7      | 360.8  | 1454.2 | 0.77x slower | 1.54x faster |
|  128 | 18432 |  7168 | 85.7      | 394.8  | 1608.0 | 0.81x slower | 2.68x faster |
| 1024 |  4096 |  7168 | 99.7      | 603.1  | 452.2  | 0.90x slower | 2.43x faster |
| 1024 | 18432 |  7168 | 331.3     | 816.7  | 534.9  | 0.84x slower | 2.71x faster |
| 2048 |  4096 |  7168 | 198.3     | 606.6  | 306.7  | 0.88x slower | 2.34x faster |
| 4096 |  4096 |  7168 | 392.2     | 613.2  | 235.3  | 0.86x slower | 2.30x faster |
+------+-------+-------+-----------+--------+--------+--------------+--------------+

===== AVERAGE PERFORMANCE =====
+----------------+------------+----------+---------------+
| Implementation | Avg TFLOPS | Avg GB/s | Avg Time (ms) |
+----------------+------------+----------+---------------+
| DeepGEMM       | 310.98     | 1052.10  | 0.11          |
| vLLM Triton    | 144.30     | 715.60   | 0.23          |
| vLLM CUTLASS   | 286.78     | 1076.67  | 0.11          |
+----------------+------------+----------+---------------+

===== AVERAGE SPEEDUPS =====
+-----------------------------+--------------+
| Comparison                  | Speedup      |
+-----------------------------+--------------+
| DeepGEMM vs vLLM Triton     | 1.71x faster |
| DeepGEMM vs vLLM CUTLASS    | 0.94x slower |
| vLLM CUTLASS vs vLLM Triton | 1.84x faster |
+-----------------------------+--------------+

===== ACCURACY COMPARISON =====
+----------------+-----------------------+
| Implementation | Avg Diff vs Reference |
+----------------+-----------------------+
| DeepGEMM       | 0.000684              |
| vLLM Triton    | 0.000684              |
| vLLM CUTLASS   | 0.000684              |
+----------------+-----------------------+

Signed-off-by: mgoin <mgoin64@gmail.com>

github-actions · 2025-02-26T19:19:28Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

benchmarks/kernels/deepgemm/benchmark_fp8_block_dense_gemm.py

Signed-off-by: mgoin <mgoin64@gmail.com>

houseroad · 2025-02-26T19:59:17Z

Does the number reported match the numbers reported in their repo?

mgoin · 2025-02-26T20:21:08Z

Does the number reported match the numbers reported in their repo?

It seems DeepGEMM performance is very limited in performance to the shapes they support. I am looking at this

benchislett · 2025-02-26T20:26:18Z

Is it possible to reproduce the setup and configuration they used in their benchmark?

We test all shapes potentially used in DeepSeek-V3/R1 inference (including both prefilling and decoding, but without tensor parallelism) on H800 SXM5 with NVCC 12.8. All speedup metrics are calculated in comparison to our internally and carefully optimized implementation based on CUTLASS 3.6.

The summary claims the following for one of the sample shapes M=128, N=4096, K=7168:

533 TFLOPS | 2221 GB/s | 2.0x

whereas this benchmark has the following result:

=== Benchmarking shape: m=128, n=4096, k=7168 ===
Running correctness check...
DeepGEMM vs Reference difference: 0.000682
vLLM Triton vs Reference difference: 0.000682
vLLM CUTLASS vs Reference difference: 0.000682
vLLM Triton vs DeepGEMM difference: 0.000007
vLLM CUTLASS vs DeepGEMM difference: 0.000007
DeepGEMM: 0.109 ms, 68.76 TFLOPS
vLLM Triton: 0.091 ms, 82.65 TFLOPS
vLLM CUTLASS: 0.039 ms, 190.49 TFLOPS
DeepGEMM is 0.83x slower than vLLM Triton
DeepGEMM is 0.36x slower than vLLM CUTLASS
vLLM CUTLASS is 2.30x faster than vLLM Triton

It seems there might be some inconsistency between the benchmarking setups here...

LucasWilkinson · 2025-02-26T21:34:46Z

do you know how slow get_col_major_tma_aligned_tensor is? we can probably update per_token_group_quant_fp8 to handle this (if we do then I can also vectorize the scale loads in the cutlass implementation)

Signed-off-by: mgoin <mgoin64@gmail.com>

mgoin · 2025-02-26T21:46:01Z

I made a prettier script and updated the table above. The numbers will look a bit worse than their results because I am including the quantization of the input tensor, where they exclude this. I was able to improve their results a decent bit by using out existing triton kernel for input quantization, however it is still behind cutlass except for large M. The larger issue I face is I see many failures to compile for the shapes that they report on.

For instance:

=== Benchmarking shape: m=4096, n=7168, k=16384 ===
Running correctness check...
/home/mgoin/.deep_gemm/cache/kernel.gemm_fp8_fp8_bf16_nt.25e0f4716b93/kernel.cu:8:10: fatal error: cutlass/cutlass.h: No such file or directory
    8 | #include "cutlass/cutlass.h"
      |          ^~~~~~~~~~~~~~~~~~~
compilation terminated.
Traceback (most recent call last):
  File "/home/mgoin/code/vllm/benchmarks/kernels/deepgemm/benchmark_fp8_block_dense_gemm_table.py", line 396, in <module>
    run_benchmarks(verbose=True)
  File "/home/mgoin/code/vllm/benchmarks/kernels/deepgemm/benchmark_fp8_block_dense_gemm_table.py", line 269, in run_benchmarks
    result = benchmark_shape(m, n, k, verbose=verbose)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgoin/code/vllm/benchmarks/kernels/deepgemm/benchmark_fp8_block_dense_gemm_table.py", line 109, in benchmark_shape
    C_deepgemm = deepgemm_gemm()
                 ^^^^^^^^^^^^^^^
  File "/home/mgoin/code/vllm/benchmarks/kernels/deepgemm/benchmark_fp8_block_dense_gemm_table.py", line 79, in deepgemm_gemm
    deep_gemm.gemm_fp8_fp8_bf16_nt((A_deepgemm, A_scale_aligned),
  File "/home/mgoin/code/vllm/benchmarks/kernels/deepgemm/DeepGEMM/deep_gemm/jit_kernels/gemm.py", line 156, in gemm_fp8_fp8_bf16_nt
    runtime = jit_tuner.compile_and_tune(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgoin/code/vllm/benchmarks/kernels/deepgemm/DeepGEMM/deep_gemm/jit_kernels/tuner.py", line 40, in compile_and_tune
    kernels.append((build(name, arg_defs, code), tuned_keys))
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgoin/code/vllm/benchmarks/kernels/deepgemm/DeepGEMM/deep_gemm/jit/compiler.py", line 139, in build
    assert subprocess.check_call(command) == 0, f'Failed to compile {src_path}'
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgoin/.local/share/uv/python/cpython-3.12.4-linux-x86_64-gnu/lib/python3.12/subprocess.py", line 413, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/usr/local/cuda-12.5/bin/nvcc', '/home/mgoin/.deep_gemm/cache/kernel.gemm_fp8_fp8_bf16_nt.25e0f4716b93/kernel.cu', '-o', '/home/mgoin/.deep_gemm/tmp/nvcc.tmp.6d4720bb-9e87-4f2b-bfae-dd365f216cf6.82df462b59a9.so', '-std=c++17', '-shared', '-O3', '--expt-relaxed-constexpr', '--expt-extended-lambda', '-gencode=arch=compute_90a,code=sm_90a', '--ptxas-options=--register-usage-level=10', '--diag-suppress=177,174,940', '--compiler-options=-fPIC,-O3,-Wno-deprecated-declarations,-Wno-abi', '-I/home/mgoin/code/vllm/benchmarks/kernels/deepgemm/DeepGEMM/deep_gemm/jit/../include']' returned non-zero exit status 1.

The cutlass error is strange since several other shapes in this file work fine.

Signed-off-by: mgoin <mgoin64@gmail.com>

mgoin · 2025-02-26T21:59:03Z

@houseroad @LucasWilkinson @benchislett please check the latest results. For specific shapes it seems I am able to get close to their results

houseroad · 2025-02-26T22:15:13Z

Wondering if we can try more shapes provided from their side. Also curious about the Grouped GEMM comparison?

LucasWilkinson · 2025-02-26T23:37:27Z

Hmmm ya this is a bit inconclusive/underwhelming given that the CUTLASS blockwise kernels haven't really be tuned yet, with the exception of #12978 (which was actually targeted at H800 not H100), we use the same M tile size for everything. I was hoping it would be a bit more conclusive in either direction (but I guess that was was wishful thinking given these are gemms after all haha)

feels like this make its hard to say if its worth focusing more energy on DeepGEMM or tuning the blockwise kernels (and use smaller instruction shapes via transposing once this lands)

We'd probably have an easier time tuning DeepGEMM though since CUTLASS currently has some restrictions on tile size

ProphetPeng · 2025-02-27T02:14:56Z

benchmarks/kernels/deepgemm/benchmark_fp8_block_dense_gemm.py

+        # A_deepgemm, A_scale_deepgemm = per_token_cast_to_fp8(A)
+        A_deepgemm, A_scale_deepgemm = per_token_group_quant_fp8(
+            A, block_size[1])
+        A_scale_aligned = get_col_major_tma_aligned_tensor(A_scale_deepgemm)


Can you replace it with per_token_group_quant_fp8(A, block_size[1], column_major_scales=True)?

This seems to make no difference

youkaichao · 2025-02-27T03:40:21Z

i think we can make this pr a draft (to save ci cost), since there's nothing we need to test in ci?

benchislett · 2025-02-27T14:39:34Z

How much effort would it be to plug this in and make some end-to-end benchmarks?

Signed-off-by: mgoin <mgoin64@gmail.com>

mgoin · 2025-02-27T23:14:27Z

@benchislett please see #13996

As noted here, microbenchmark performance is not good (except for very specific sizes) yet so we need to figure out how to fix this first.

houseroad · 2025-03-03T07:25:55Z

Btw, shall we land this benchmark scripts? We may reuse to expand to other kernel libraries.

Signed-off-by: mgoin <mgoin64@gmail.com>

mgoin · 2025-03-05T02:09:12Z

@houseroad Yes I think we should land this benchmark, please review!

* Fix `head_dim` not existing in all model configs (Transformers backend) (vllm-project#14141) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [V0][Metrics] Remove unimplemented `vllm:tokens_total` (vllm-project#14134) Signed-off-by: Mark McLoughlin <markmc@redhat.com> * [V0][Metrics] Deprecate some KV/prefix cache metrics (vllm-project#14136) Signed-off-by: Mark McLoughlin <markmc@redhat.com> * [V1] Simplify stats logging (vllm-project#14082) Signed-off-by: Nick Hill <nhill@redhat.com> * [WIP][[V1][Metrics] Implement max_num_generation_tokens, request_params_n, and request_params_max_tokens metrics (vllm-project#14055) Signed-off-by: Mark McLoughlin <markmc@redhat.com> * [Bugfix] Allow shared_experts skip quantization for DeepSeekV2/V3 (vllm-project#14100) Signed-off-by: mgoin <mgoin64@gmail.com> * [Kernel] Optimize moe intermediate_cache usage (vllm-project#13625) Signed-off-by: mgoin <mgoin64@gmail.com> * [Docs] Add GPTQModel (vllm-project#14056) Signed-off-by: mgoin <mgoin64@gmail.com> Co-authored-by: mgoin <mgoin64@gmail.com> * [v1] Add comments to the new ragged paged attention Pallas kernel (vllm-project#14155) Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> * [Model] Add support for GraniteMoeShared models (vllm-project#13313) Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> * [core] moe fp8 block quant tuning support (vllm-project#14068) Signed-off-by: Divakar Verma <divakar.verma@amd.com> * [Misc] Remove lru_cache in NvmlCudaPlatform (vllm-project#14156) Signed-off-by: Cody Yu <hao.yu.cody@gmail.com> * [core] Pass all driver env vars to ray workers unless excluded (vllm-project#14099) Signed-off-by: Rui Qiao <ruisearch42@gmail.com> * Use math.prod instead of np.prod for trivial ops (vllm-project#14142) * Fix benchmark_moe.py tuning for CUDA devices (vllm-project#14164) * [platform] add debug logging during inferring the device type (vllm-project#14195) Signed-off-by: youkaichao <youkaichao@gmail.com> * [sleep mode] error out with expandable_segments (vllm-project#14189) Signed-off-by: youkaichao <youkaichao@gmail.com> * [doc] add "Failed to infer device type" to faq (vllm-project#14200) Signed-off-by: youkaichao <youkaichao@gmail.com> * [Bugfix] Restrict MacOS CPU detection (vllm-project#14210) Signed-off-by: mgoin <mgoin64@gmail.com> * [V1][BugFix] Fix remaining sync engine client shutdown errors/hangs (vllm-project#13869) Signed-off-by: Nick Hill <nhill@redhat.com> * [V0][Metrics] Deprecate some questionable request time metrics (vllm-project#14135) Signed-off-by: Mark McLoughlin <markmc@redhat.com> * [V1][Molmo] Fix get_multimodal_embeddings() in molmo.py (vllm-project#14161) * add cutlass support for blackwell fp8 gemm (vllm-project#13798) * [TPU][Profiler] Support start_profile/stop_profile in TPU worker (vllm-project#13988) Signed-off-by: Siyuan Liu <lsiyuan@google.com> Co-authored-by: mgoin <mgoin64@gmail.com> * Fix performance when `--generation-config` is not `None` (vllm-project#14223) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [Frontend] Do `prompt_logprobs` clamping for chat as well as completions (vllm-project#14225) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [Docs] Update Dockerfile dependency image (vllm-project#14215) Signed-off-by: mgoin <mgoin64@gmail.com> * [v1][Metrics] Add design doc (vllm-project#12745) Signed-off-by: Mark McLoughlin <markmc@redhat.com> Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: Cody Yu <hao.yu.cody@gmail.com> * [Security] Serialize using safetensors instead of pickle in Mooncake Pipe (vllm-project#14228) Signed-off-by: KuntaiDu <kuntai@uchicago.edu> * Clean up unused padding_idx variables across many model definitions (vllm-project#13240) Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> * [ROCm] Disable a few more kernel tests that are broken on ROCm (vllm-project#14145) Signed-off-by: Sage Moore <sage@neuralmagic.com> * [V1][TPU] TPU multimodal model support for ragged attention (vllm-project#14158) Signed-off-by: Michael Goin <mgoin64@gmail.com> * [misc] announce china meetup (vllm-project#14248) Signed-off-by: youkaichao <youkaichao@gmail.com> * Moved numba from common requirements to cuda/rocm specific requirements (vllm-project#14199) Signed-off-by: Nishidha Panpaliya <nishidha.panpaliya@partner.ibm.com> * Disable GPTQ AllSpark kernels for CUDA Compiler < 12.0 (vllm-project#14157) Signed-off-by: mgoin <mgoin64@gmail.com> * [Bugfix] Fix gptq_marlin for deepseek-v3 (vllm-project#13750) Signed-off-by: dangshunya <dangshunya@baichuan-inc.com> Co-authored-by: dangshunya <dangshunya@baichuan-inc.com> * [V1][Bugfix] Do not reset prefix caching metrics (vllm-project#14235) * [Model] New model support for Phi-4-multimodal-instruct (vllm-project#14119) * [V1] EP/TP MoE + DP Attention (vllm-project#13931) * [platforms] improve rocm debugging info (vllm-project#14257) * Temporarily disable test_awq_gemm_opcheck (vllm-project#14251) Signed-off-by: mgoin <mgoin64@gmail.com> * [Frontend] Allow return_tokens_as_token_ids to be passed as a request param (vllm-project#14066) Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai> * [Misc][V1] Avoid using `envs.VLLM_USE_V1` in mm processing (vllm-project#14256) Signed-off-by: Roger Wang <ywang@roblox.com> * [Bugfix][V1] Fix allowed_token_ids for v1 Sampler (vllm-project#14169) Signed-off-by: Lu Fang <lufang@fb.com> * [Doc] Update nginx guide: remove privileged from vllm container run and add target GPU ID (vllm-project#14217) Signed-off-by: Iacopo Poli <iacopo@lighton.ai> * [Doc] [3/N] Refer code examples for common cases in dev multimodal processor (vllm-project#14278) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * Small update for external_launcher backend docs (vllm-project#14288) * [V1][Frontend] Add Testing For V1 Runtime Parameters (vllm-project#14159) Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com> * [LoRA] Remove linear hack outside transformers backend (vllm-project#14177) Signed-off-by: Isotr0py <2037008807@qq.com> * [Misc] Add Qwen2MoeForCausalLM moe tuning support (vllm-project#14276) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> * prefix_caching.md: Fixed typo (vllm-project#14293) Signed-off-by: Daivid Savernin-Frenk <daivid.frank@TurboNext.ai> * [Bugfix] Fix broken vision language example (vllm-project#14292) Signed-off-by: Isotr0py <2037008807@qq.com> * [Docs] Add Meta Slides (vllm-project#14297) Signed-off-by: simon-mo <simon.mo@hey.com> * [V1][Minor] Remove obsolete FIXME comment (vllm-project#14304) Signed-off-by: Nick Hill <nhill@redhat.com> * Deprecate `best_of` Sampling Parameter in anticipation for vLLM V1 (vllm-project#13997) Signed-off-by: vincent-4 <vincentzhongy+githubvincent4@gmail.com> Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca> Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [V1][BugFix] Fix for mixed top_k batch (vllm-project#14301) Signed-off-by: Nick Hill <nhill@redhat.com> Co-authored-by: Ye Cao <caoye.cao@alibaba-inc.com> * [misc] Add FlashMLA as a new option of VLLM_ATTENTION_BACKEND env (vllm-project#14267) * [V1][Easy] Add empty allowed_token_ids in the v1 sampler test (vllm-project#14308) Signed-off-by: Lu Fang <lufang@fb.com> * init Signed-off-by: Sage Moore <sage@neuralmagic.com> * [Bugfix] Fix DeepSeek MTP crash when using TP1ModelRunner with CUDA graph due to shape mismatch (vllm-project#14237) Signed-off-by: pyc96 <pychen96@gmail.com> * [Bugfix] Remove num_tokens_across_dp (vllm-project#14302) Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> * [BugFix] Fix prefix caching V0 MLA (vllm-project#14255) Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> Co-authored-by: Ying Zhong <zhongyingmatrix@gmail.com> * [CI/Build] Use spawn multiprocessing mode for V1 test pipeline (vllm-project#14243) Signed-off-by: Russell Bryant <rbryant@redhat.com> * Add benchmark for DeepGEMM and vLLM Block FP8 Dense GEMM (vllm-project#13917) Signed-off-by: mgoin <mgoin64@gmail.com> * [Build] Add UV_HTTP_TIMEOUT to avoid timeout during installation (vllm-project#13850) Signed-off-by: Yuan Tang <terrytangyuan@gmail.com> * [BugFix] MLA + V1, illegal memory access and accuracy issues (vllm-project#14253) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> * [misc] Mention `ray list nodes` command to troubleshoot ray issues (vllm-project#14318) Signed-off-by: Rui Qiao <ruisearch42@gmail.com> * [Bugfix][Structured Output] Support outlines engine with reasoning outputs for DeepSeek R1 (vllm-project#14114) * [V1] LoRA - Enable more V1 tests (vllm-project#14315) Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com> * [Bugfix][CI] ALiBi test case in xformers multi_query_kv_attention (vllm-project#11301) * [Hardware] Update the flash attn tag to support Blackwell (vllm-project#14244) * [Model] Update Paligemma multimodal processing with PromptUpdate (vllm-project#14015) Signed-off-by: Kyle Huang <kylhuang@nvidia.com> Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> * [V1][VLM][Pixtral-HF] Support Pixtral-HF on V1 (vllm-project#14275) Signed-off-by: Linkun Chen <github@lkchen.net> * [Core] Optimizing cross-attention `QKVParallelLinear` computation (vllm-project#12325) Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: NickLucche <nick@nlucches-4xa100.c.openshift-330514.internal> Co-authored-by: NickLucche <nick@nlucches-4xa100.c.openshift-330514.internal> * [Frontend][Docs] Transcription API streaming (vllm-project#13301) Signed-off-by: NickLucche <nlucches@redhat.com> * [Doc] Update reasoning with stream example to use OpenAI library (vllm-project#14077) Signed-off-by: liuyanyi <wolfsonliu@163.com> * [Doc] Correct beam_search using in generative_models.md (vllm-project#14363) * [Kernel] [V1] Improved performance for V1 Triton (ROCm) backend (vllm-project#14152) * [Bugfix][Core] fix abort_seq_group and memory leak when n>1 (vllm-project#14326) Signed-off-by: courage17340 <courage17340@163.com> * [Core] Don't use cache during multi-modal profiling (vllm-project#14336) * [Doc] Fix date typo in README.md (vllm-project#14366) Signed-off-by: Jitse Klomp <jitse.klomp@conclusionxforce.nl> * [RLHF] use worker_extension_cls for compatibility with V0 and V1 (vllm-project#14185) Signed-off-by: youkaichao <youkaichao@gmail.com> * Reinstate `best_of` for V0 (vllm-project#14356) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * Adding cpu inference with VXE ISA for s390x architecture (vllm-project#12613) Signed-off-by: Dilip Gowda Bhagavan <dilip.bhagavan@ibm.com> Signed-off-by: Rishika Kedia <rishika.kedia@in.ibm.com> Co-authored-by: Rishika Kedia <rishika.kedia@in.ibm.com> * Add authors to license header. (vllm-project#14371) Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com> Co-authored-by: Burkhard Ringlein <ngl@zurich.ibm.com> Co-authored-by: Jan van Lunteren <jvl@zurich.ibm.com> * Fix mla prefill context performance (vllm-project#13897) Signed-off-by: ZhongYingMatrix <zhongyingmatrix@gmail.com> * [V1] Do not detokenize if sampling param detokenize is False (vllm-project#14224) Signed-off-by: Himanshu Jaju <hj@mistral.ai> Signed-off-by: Nick Hill <nhill@redhat.com> Co-authored-by: Nick Hill <nhill@redhat.com> * [Distributed] Add enable_expert_parallel arg (vllm-project#14305) Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> * [CI/Build] Use uv python for docker rather than ppa:deadsnakes/ppa (vllm-project#13569) Signed-off-by: mgoin <mgoin64@gmail.com> * [CI] Disable spawn when running V1 Test (vllm-project#14345) Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com> * [Kernel] Add needs_fixed_stride_order tag to most GEMMs (vllm-project#14306) Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> * [Bugfix] Fix use_direct_call condition in FusedMoE layer for (vllm-project#14382) Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> * [Bug] Fix Attention when ignored in by quant_method (vllm-project#14313) Signed-off-by: mgoin <mgoin64@gmail.com> * [V1][Bugfix] Standardize quantized kv cache rejection for attention backends (vllm-project#14221) Signed-off-by: mgoin <mgoin64@gmail.com> * [Docs] Add nsight guide to profiling docs (vllm-project#14298) Signed-off-by: mgoin <mgoin64@gmail.com> * cleanup boolean logic Signed-off-by: Sage Moore <sage@neuralmagic.com> * [Hardware][TPU]Enable ragged paged attention kernel and resolve recompilation issue (vllm-project#14310) Signed-off-by: Chengji Yao <chengjiyao@google.com> * [Doc] Fix a typo (vllm-project#14385) * [Bugfix] Correctly call `cudaProfilerStop` in benchmarks script (vllm-project#14183) Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca> * [Perf] Reduce MLA CPU overheads in V1 (vllm-project#14384) Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> * [FP8] Refactor apply_fp8_linear and apply_fp8_linear_generic into an object (vllm-project#14390) Signed-off-by: luka <luka@neuralmagic.com> * [BugFix] Illegal Memory Access in the blockwise cutlass fp8 GEMMs (vllm-project#14396) * [Bugfix] Fix JambaForCausalLM LoRA (vllm-project#14370) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> * [Build] Add nightly wheel fallback when latest commit wheel unavailable (vllm-project#14358) Signed-off-by: Isotr0py <2037008807@qq.com> * OpenVINO: added CPU-like conditions (vllm-project#14338) Signed-off-by: Ilya Lavrenov <ilya.lavrenov@intel.com> * [GH] Auto-apply multi-modality label to relevant PRs (vllm-project#14402) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * correct wrong markdown syntax (vllm-project#14414) Signed-off-by: vincent-pli <justdoit.pli@gmail.com> * [Bugfix] Further clean up LoRA test (vllm-project#14422) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> * [Bugfix] Clean up multi-modal processors (vllm-project#14417) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [Misc] Set default value of seed to None (vllm-project#14274) Signed-off-by: மனோஜ்குமார் பழனிச்சாமி <smartmanoj42857@gmail.com> * [BUGFIX] Skip tokenization support for throughput benchmark (vllm-project#12712) Signed-off-by: root <root@banff-cyxtera-s73-5.ctr.dcgpu> Signed-off-by: Aleksandr Malyshev <maleksan@amd.com> Co-authored-by: root <root@banff-cyxtera-s73-5.ctr.dcgpu> Co-authored-by: Aleksandr Malyshev <maleksan@amd.com> * Fix missing `kv_caches` and `attn_metadata` in `OpenVINOCausalLM` (vllm-project#14271) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * Use the optimized block sizes after tuning the kernel. (vllm-project#14329) * [V1][Core] Support for Structured Outputs (vllm-project#12388) Signed-off-by: Aaron Pham <contact@aarnphm.xyz> Signed-off-by: Russell Bryant <rbryant@redhat.com> Co-authored-by: Russell Bryant <rbryant@redhat.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: Nick Hill <nhill@redhat.com> * [Doc] Update prefix_caching.md to match the example image (vllm-project#14420) * [Benchmarks] Make detokenization optional in benchmark scripts (vllm-project#11697) Signed-off-by: Jeremy Arnold <Jeremy.Arnold@amd.com> * comments Signed-off-by: Sage Moore <sage@neuralmagic.com> * [Kernel] optimize performance of gptq marlin kernel when n is small (vllm-project#14138) Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com> * [Misc] Add Phi4-MM example (vllm-project#14343) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> * [v1] torch.compile integration explanation (vllm-project#14437) Signed-off-by: youkaichao <youkaichao@gmail.com> * [V1] Eagerly remove finished requests from the batch (vllm-project#14388) Signed-off-by: Nick Hill <nhill@redhat.com> * [V1][Metrics] Fix traceback with preemptions+LoRA (vllm-project#14220) Signed-off-by: Mark McLoughlin <markmc@redhat.com> * [Bugfix] Fix torch_xla which can't handle None seed introduced in vllm-project#14274 (vllm-project#14459) Signed-off-by: Yarong Mu <ymu@google.com> * [V1] Prompt logprobs + APC compatibility; prompt logprobs reqs cannot fill APC (vllm-project#13949) * [Bugfix][V1] Handle MLA in kv_cache_interface (vllm-project#14462) Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> * Revert "[Perf] Reduce MLA CPU overheads in V1 (vllm-project#14384)" (vllm-project#14471) * [Bugfix][Disaggregated] Add a check in send_kv_caches_and_hidden_states and fix the reshape of the KVCache (vllm-project#14369) Signed-off-by: Mathis Felardos <mathis@mistral.ai> * [MISC][V1] Register process killing handler only in the main thread (vllm-project#14380) Signed-off-by: Cody Yu <hao.yu.cody@gmail.com> * [core] add `extra_args` to `SamplingParams` (vllm-project#13300) Signed-off-by: Aviv Keshet <akeshet@scaledcognition.com> * [CI/Build] refactor: set timezone of container to UTC (vllm-project#12888) Signed-off-by: Roger Meier <r.meier@siemens.com> * Default to `generation_config` from model (vllm-project#12622) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [Doc]add doc for Qwen models tool calling (vllm-project#14478) Signed-off-by: WangErXiao <863579016@qq.com> * [Doc] Added QwQ-32B to the supported models list in the reasoning out… (vllm-project#14479) Signed-off-by: WangErXiao <863579016@qq.com> * [Bugfix] Make the deviceprofiler include LoRA memory. (vllm-project#14469) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> * Add training doc signposting to TRL (vllm-project#14439) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [Build/BugFix] Fix hopper 12.8 build (vllm-project#14354) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com> * Add RLHF document (vllm-project#14482) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [CI/Build] Use a fixed seed to avoid flaky tests (vllm-project#14480) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [V1] TPU - Add tensor parallel support via Ray (vllm-project#13618) Signed-off-by: Alexander Matveev <amatveev@redhat.com> * [VLM] Add TP support for Phi-4-MM (vllm-project#14453) Signed-off-by: Isotr0py <2037008807@qq.com> * [Misc] add `use_tqdm_on_load` to reduce logs (vllm-project#14407) Signed-off-by: Aaron Pham <contact@aarnphm.xyz> * [V1][Core] Fix memory issue with logits & sampling (vllm-project#13776) Signed-off-by: Roger Wang <ywang@roblox.com> * [benchmarks] Add option to use unique jsonschema for each request (vllm-project#14457) Signed-off-by: Russell Bryant <rbryant@redhat.com> * [Misc] Don't run ruff at all on 3rd party libs (vllm-project#14493) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * Move requirements into their own directory (vllm-project#12547) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [Bugfix] DeepSeek Accuracy (vllm-project#14476) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> * [Bugfix] Fix profiling OOM and decouple encoder multimodal profiling (vllm-project#14361) Signed-off-by: Isotr0py <2037008807@qq.com> * Update CODEOWNERS for structured output (vllm-project#14496) Signed-off-by: Russell Bryant <rbryant@redhat.com> * [Misc] Upgrade to Python 3.9 typing for additional directories (vllm-project#14492) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [V1] Support bad_words in sampler (vllm-project#13376) Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com> Co-authored-by: Nick Hill <nhill@redhat.com> * Revert "[V1][Core] Fix memory issue with logits & sampling" (vllm-project#14504) Signed-off-by: Roger Wang <ywang@roblox.com> Co-authored-by: Roger Wang <ywang@roblox.com> * [Attention] Default to FlashMLA backend for MLA (vllm-project#14451) Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com> * [V1][TPU] Remove unnecessary padding for running on TPU. (vllm-project#14467) * [Feat] Support chunked prefill for LMCache connector (vllm-project#14505) Signed-off-by: YaoJiayi <120040070@link.cuhk.edu.cn> * [Bugfix] Fix tqdm progress bar when SamplingParams.n > 1 (vllm-project#12428) Signed-off-by: Yuchen Yan <740987012@qq.com> * [Bugfix] Revert QKVCrossParallelLinear usage in Mllama to keep BNB quantization work (vllm-project#14498) Signed-off-by: Isotr0py <2037008807@qq.com> * [Hardware][TPU] Fix the recompiling issue in logits processor after warmup (vllm-project#14510) Signed-off-by: Chengji Yao <chengjiyao@google.com> * [Misc] Ensure out-of-tree quantization method recognize by cli args (vllm-project#14328) Signed-off-by: liuyanyi <wolfsonliu@163.com> * [Bugfix] Wrong requirements path - rocm (vllm-project#14527) Signed-off-by: Martin Hoyer <mhoyer@redhat.com> * [Feature] Consolidate performance benchmark datasets (vllm-project#14036) Signed-off-by: Jennifer Zhao <7443418+JenZhao@users.noreply.github.com> Signed-off-by: Roger Wang <ywang@roblox.com> Co-authored-by: Jennifer Zhao <7443418+JenZhao@users.noreply.github.com> Co-authored-by: Roger Wang <ywang@roblox.com> * [Misc] Add log information for handle_process_request. (vllm-project#14130) Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com> * [Docs] Mention `model_impl` arg when explaining Transformers fallback (vllm-project#14552) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [Frontend] support image embeds (vllm-project#13955) Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com> * [Kernel] Add more dtype support for GGUF kernels (vllm-project#14043) Signed-off-by: SzymonOzog <szymon.ozog@aleph-alpha.com> Signed-off-by: SzymonOzog <szymon.ozog@gmail.com> * [Doc] Update PaliGemma note to a warning (vllm-project#14565) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * V1 rocm support (#469) * Initial commit for V1 successfull compilation * Small improvement for linear * Small improvement for linear * making use of forward_cuda for all except ROPE in llama --------- Co-authored-by: maleksan85 <maleksan@amd.com> * nightly_fixed_aiter_integration_final_20250305 README update (#470) * nightly_fixed_aiter_integration_final_20250305 README update (perf results only) * Update Docker Manifest git hash * Update Docker Manifest and added nightly_fixed_aiter_integration_final_20250305 * some more updates * Update AITER section with example * Updated AITER command with larger batch size and model name * Fixing typo * Removed --max-model-len in AITER command * Updating AITER instructions * typo * Another typo * Whitespace * modifying whats new section * Another typo --------- Co-authored-by: arakowsk-amd <182798202+arakowsk-amd@users.noreply.github.com> Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com> --------- Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Signed-off-by: Mark McLoughlin <markmc@redhat.com> Signed-off-by: Nick Hill <nhill@redhat.com> Signed-off-by: mgoin <mgoin64@gmail.com> Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com> Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com> Signed-off-by: Divakar Verma <divakar.verma@amd.com> Signed-off-by: Cody Yu <hao.yu.cody@gmail.com> Signed-off-by: Rui Qiao <ruisearch42@gmail.com> Signed-off-by: youkaichao <youkaichao@gmail.com> Signed-off-by: Siyuan Liu <lsiyuan@google.com> Signed-off-by: KuntaiDu <kuntai@uchicago.edu> Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> Signed-off-by: Sage Moore <sage@neuralmagic.com> Signed-off-by: Michael Goin <mgoin64@gmail.com> Signed-off-by: Nishidha Panpaliya <nishidha.panpaliya@partner.ibm.com> Signed-off-by: dangshunya <dangshunya@baichuan-inc.com> Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai> Signed-off-by: Roger Wang <ywang@roblox.com> Signed-off-by: Lu Fang <lufang@fb.com> Signed-off-by: Iacopo Poli <iacopo@lighton.ai> Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com> Signed-off-by: Isotr0py <2037008807@qq.com> Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: Daivid Savernin-Frenk <daivid.frank@TurboNext.ai> Signed-off-by: simon-mo <simon.mo@hey.com> Signed-off-by: vincent-4 <vincentzhongy+githubvincent4@gmail.com> Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca> Signed-off-by: pyc96 <pychen96@gmail.com> Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> Signed-off-by: Russell Bryant <rbryant@redhat.com> Signed-off-by: Yuan Tang <terrytangyuan@gmail.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com> Signed-off-by: Kyle Huang <kylhuang@nvidia.com> Signed-off-by: Linkun Chen <github@lkchen.net> Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: NickLucche <nick@nlucches-4xa100.c.openshift-330514.internal> Signed-off-by: liuyanyi <wolfsonliu@163.com> Signed-off-by: courage17340 <courage17340@163.com> Signed-off-by: Jitse Klomp <jitse.klomp@conclusionxforce.nl> Signed-off-by: Dilip Gowda Bhagavan <dilip.bhagavan@ibm.com> Signed-off-by: Rishika Kedia <rishika.kedia@in.ibm.com> Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com> Signed-off-by: ZhongYingMatrix <zhongyingmatrix@gmail.com> Signed-off-by: Himanshu Jaju <hj@mistral.ai> Signed-off-by: Chengji Yao <chengjiyao@google.com> Signed-off-by: luka <luka@neuralmagic.com> Signed-off-by: Ilya Lavrenov <ilya.lavrenov@intel.com> Signed-off-by: vincent-pli <justdoit.pli@gmail.com> Signed-off-by: மனோஜ்குமார் பழனிச்சாமி <smartmanoj42857@gmail.com> Signed-off-by: root <root@banff-cyxtera-s73-5.ctr.dcgpu> Signed-off-by: Aleksandr Malyshev <maleksan@amd.com> Signed-off-by: Aaron Pham <contact@aarnphm.xyz> Signed-off-by: Jeremy Arnold <Jeremy.Arnold@amd.com> Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com> Signed-off-by: Yarong Mu <ymu@google.com> Signed-off-by: Mathis Felardos <mathis@mistral.ai> Signed-off-by: Aviv Keshet <akeshet@scaledcognition.com> Signed-off-by: Roger Meier <r.meier@siemens.com> Signed-off-by: WangErXiao <863579016@qq.com> Signed-off-by: Alexander Matveev <amatveev@redhat.com> Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com> Signed-off-by: YaoJiayi <120040070@link.cuhk.edu.cn> Signed-off-by: Yuchen Yan <740987012@qq.com> Signed-off-by: Martin Hoyer <mhoyer@redhat.com> Signed-off-by: Jennifer Zhao <7443418+JenZhao@users.noreply.github.com> Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com> Signed-off-by: SzymonOzog <szymon.ozog@aleph-alpha.com> Signed-off-by: SzymonOzog <szymon.ozog@gmail.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: Mark McLoughlin <markmc@redhat.com> Co-authored-by: Nick Hill <nhill@redhat.com> Co-authored-by: Michael Goin <michael@neuralmagic.com> Co-authored-by: Qubitium-ModelCloud <qubitium@modelcloud.ai> Co-authored-by: mgoin <mgoin64@gmail.com> Co-authored-by: iefgnoix <isaacwxf23@gmail.com> Co-authored-by: Travis Johnson <tsjohnso@us.ibm.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Co-authored-by: Divakar Verma <137818590+divakar-amd@users.noreply.github.com> Co-authored-by: Cody Yu <hao.yu.cody@gmail.com> Co-authored-by: Rui Qiao <161574667+ruisearch42@users.noreply.github.com> Co-authored-by: Zhanwen Chen <phil.zhanwen.chen@gmail.com> Co-authored-by: youkaichao <youkaichao@gmail.com> Co-authored-by: lkchen <github@lkchen.net> Co-authored-by: kushanam <42385577+kushanam@users.noreply.github.com> Co-authored-by: Siyuan Liu <lsiyuan@google.com> Co-authored-by: Kuntai Du <kuntai@uchicago.edu> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com> Co-authored-by: Sage Moore <sage@neuralmagic.com> Co-authored-by: Nishidha <nishidha.panpaliya@partner.ibm.com> Co-authored-by: rainkert <93575312+rainkert@users.noreply.github.com> Co-authored-by: dangshunya <dangshunya@baichuan-inc.com> Co-authored-by: Congcong Chen <congcongchen@microsoft.com> Co-authored-by: Benjamin Chislett <benjamin.chislett@centml.ai> Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com> Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com> Co-authored-by: Iacopo Poli <iacopo@lighton.ai> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> Co-authored-by: Zhe Zhang <zhz@apache.org> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> Co-authored-by: DaividFrank <49250948+DaividFrank@users.noreply.github.com> Co-authored-by: Simon Mo <simon.mo@hey.com> Co-authored-by: Vincent <vincentzhongy+githubvincent4@gmail.com> Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca> Co-authored-by: Ye Cao <caoye.cao@alibaba-inc.com> Co-authored-by: Serena <yangsijia.614@bytedance.com> Co-authored-by: pyc96 <pychen96@gmail.com> Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com> Co-authored-by: Ying Zhong <zhongyingmatrix@gmail.com> Co-authored-by: Russell Bryant <rbryant@redhat.com> Co-authored-by: Yuan Tang <terrytangyuan@gmail.com> Co-authored-by: Ce Gao <cegao@tensorchord.ai> Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com> Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com> Co-authored-by: Pavani Majety <pmajety@nvidia.com> Co-authored-by: kYLe <kylhuang@nvidia.com> Co-authored-by: NickLucche <nick@nlucches-4xa100.c.openshift-330514.internal> Co-authored-by: Yanyi Liu <wolfsonliu@163.com> Co-authored-by: Irina Yuryeva <76484191+upayuryeva@users.noreply.github.com> Co-authored-by: Thomas Parnell <tpa@zurich.ibm.com> Co-authored-by: courage17340 <courage17340@users.noreply.github.com> Co-authored-by: Jitse Klomp <jitse.klomp@conclusionxforce.nl> Co-authored-by: Dilip Gowda Bhagavan <110233170+dilipgb@users.noreply.github.com> Co-authored-by: Rishika Kedia <rishika.kedia@in.ibm.com> Co-authored-by: Burkhard Ringlein <ngl@zurich.ibm.com> Co-authored-by: Jan van Lunteren <jvl@zurich.ibm.com> Co-authored-by: Himanshu Jaju <hj@mistral.ai> Co-authored-by: Chengji Yao <chengjiyao@google.com> Co-authored-by: Daniel Li <dyli@google.com> Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com> Co-authored-by: Ilya Lavrenov <ilya.lavrenov@intel.com> Co-authored-by: Peng Li <justdoit.pli@gmail.com> Co-authored-by: மனோஜ்குமார் பழனிச்சாமி <smartmanoj42857@gmail.com> Co-authored-by: Aleksandr Malyshev <164964928+maleksan85@users.noreply.github.com> Co-authored-by: root <root@banff-cyxtera-s73-5.ctr.dcgpu> Co-authored-by: Aleksandr Malyshev <maleksan@amd.com> Co-authored-by: Aaron Pham <contact@aarnphm.xyz> Co-authored-by: York-RDWang <103811994+York-RDWang@users.noreply.github.com> Co-authored-by: Jeremy Arnold <103538711+JArnoldAMD@users.noreply.github.com> Co-authored-by: Jinzhen Lin <linjinzhen@hotmail.com> Co-authored-by: yarongmu-google <150371854+yarongmu-google@users.noreply.github.com> Co-authored-by: afeldman-nm <156691304+afeldman-nm@users.noreply.github.com> Co-authored-by: Mathis Felardos <mathis@mistral.ai> Co-authored-by: Aviv Keshet <akeshet@scaledcognition.com> Co-authored-by: Roger Meier <r.meier@siemens.com> Co-authored-by: Robin <863579016@qq.com> Co-authored-by: Alexander Matveev <59768536+alexm-redhat@users.noreply.github.com> Co-authored-by: 22quinn <33176974+22quinn@users.noreply.github.com> Co-authored-by: Roger Wang <ywang@roblox.com> Co-authored-by: Jiayi Yao <82156730+YaoJiayi@users.noreply.github.com> Co-authored-by: Yuchen Yan <50619811+yanyc428@users.noreply.github.com> Co-authored-by: Martin Hoyer <mhoyer@redhat.com> Co-authored-by: Jennifer Zhao <JenZhao@users.noreply.github.com> Co-authored-by: Jennifer Zhao <7443418+JenZhao@users.noreply.github.com> Co-authored-by: Chauncey <chaunceyjiang@gmail.com> Co-authored-by: Szymon Ożóg <58388001+SzymonOzog@users.noreply.github.com> Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com> Co-authored-by: Mcirino1 <57415822+Mcirino1@users.noreply.github.com> Co-authored-by: arakowsk-amd <182798202+arakowsk-amd@users.noreply.github.com>

…t#13917) Signed-off-by: mgoin <mgoin64@gmail.com> Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>

…t#13917) Signed-off-by: mgoin <mgoin64@gmail.com>

Add benchmark for DeepGEMM and vLLM Block FP8 Dense GEMM

8aa430d

Signed-off-by: mgoin <mgoin64@gmail.com>

ywang96 reviewed Feb 26, 2025

View reviewed changes

benchmarks/kernels/deepgemm/benchmark_fp8_block_dense_gemm.py Outdated Show resolved Hide resolved

Fix DeepGEMM compare

83bf65a

Signed-off-by: mgoin <mgoin64@gmail.com>

mgoin added 2 commits February 26, 2025 21:43

Update

7963a9b

Signed-off-by: mgoin <mgoin64@gmail.com>

Update

c95c6ad

Signed-off-by: mgoin <mgoin64@gmail.com>

More update!

1a6f96e

Signed-off-by: mgoin <mgoin64@gmail.com>

ProphetPeng reviewed Feb 27, 2025

View reviewed changes

mgoin marked this pull request as draft February 27, 2025 04:12

Update by removing quantization overhead

b287504

Signed-off-by: mgoin <mgoin64@gmail.com>

mgoin mentioned this pull request Feb 27, 2025

[Kernel] Integrate DeepGEMM dense block fp8 #13996

Closed

mgoin added 2 commits March 4, 2025 20:40

Merge branch 'main' into benchmark-deepgemm-dense-gemm

b1a00a1

Update

07c8762

Signed-off-by: mgoin <mgoin64@gmail.com>

mgoin marked this pull request as ready for review March 5, 2025 02:08

mgoin added the performance Performance-related issues label Mar 5, 2025

weedge mentioned this pull request Mar 5, 2025

feat: add vllm deploy modal inference test ai-bot-pro/achatbot#126

Merged

simon-mo merged commit ca100c9 into vllm-project:main Mar 6, 2025
18 checks passed

mgoin deleted the benchmark-deepgemm-dense-gemm branch March 6, 2025 01:16

hmellor mentioned this pull request Apr 2, 2025

[Performance]: 0.8.1 vs 0.7.4dev122 R1 H20 performance benchmark test，0.8.1 What is the reason for the 14% performance improvement(throughput tokens/s) #15881

Closed

1 task

lulmer pushed a commit to lulmer/vllm that referenced this pull request Apr 7, 2025

Add benchmark for DeepGEMM and vLLM Block FP8 Dense GEMM (vllm-projec…

f89d144

…t#13917) Signed-off-by: mgoin <mgoin64@gmail.com> Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>

ckhordiasma mentioned this pull request Apr 17, 2025

[do not merge] pr test for nm changes into 2.20 red-hat-data-services/vllm#107

Closed

shreyankg pushed a commit to shreyankg/vllm that referenced this pull request May 3, 2025

Add benchmark for DeepGEMM and vLLM Block FP8 Dense GEMM (vllm-projec…

a4f51d2

…t#13917) Signed-off-by: mgoin <mgoin64@gmail.com>

Uh oh!

Add benchmark for DeepGEMM and vLLM Block FP8 Dense GEMM #13917

Add benchmark for DeepGEMM and vLLM Block FP8 Dense GEMM #13917

Uh oh!

Conversation

mgoin commented Feb 26, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Feb 26, 2025

Uh oh!

Uh oh!

houseroad commented Feb 26, 2025

Uh oh!

mgoin commented Feb 26, 2025

Uh oh!

benchislett commented Feb 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LucasWilkinson commented Feb 26, 2025

Uh oh!

mgoin commented Feb 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mgoin commented Feb 26, 2025

Uh oh!

houseroad commented Feb 26, 2025

Uh oh!

LucasWilkinson commented Feb 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ProphetPeng Feb 27, 2025

Choose a reason for hiding this comment

Uh oh!

mgoin Feb 27, 2025

Choose a reason for hiding this comment

Uh oh!

youkaichao commented Feb 27, 2025

Uh oh!

benchislett commented Feb 27, 2025

Uh oh!

mgoin commented Feb 27, 2025

Uh oh!

houseroad commented Mar 3, 2025

Uh oh!

mgoin commented Mar 5, 2025

Uh oh!

Uh oh!

Uh oh!

mgoin commented Feb 26, 2025 •

edited by github-actions bot

Loading

benchislett commented Feb 26, 2025 •

edited

Loading

mgoin commented Feb 26, 2025 •

edited

Loading

LucasWilkinson commented Feb 26, 2025 •

edited

Loading