[Kernel] Faster pre-processing time for W4A8 #23972

czhu-cohere · 2025-08-29T23:26:19Z

Purpose

The W4A8 kernels require a special encoding/pre-processing step (one for weights, one for scales). The current CUTLASS implementation walks the values 1-by-1 on CPU which is slow, causing startup to take ~10 minutes for a 100B model.

The simple CUDA implementation is much faster for this specific op and reduce total model load time from ~10min to <1min. The main logic is to build a lookup table to map 8-bit nibbles (2 4-bit values) at a time, where 1..7 -> (8 - v) and 0, 8..15 are untouched. (A similar thing could be implemented for the scales pre-processing but that is less of a bottleneck than weights, and waiting <1min on startup seems reasonable)

Test Plan

Correctness - the existing test here pytest tests/kernels/quantization/test_cutlass_w4a8.py should cover it (if the encoding step is wrong the result will be wrong)

Speedup - look at logs for vllm serve and compare the startup time for Command A (111B params)

E2E - compare gsm8k lm-eval before and after

Test Result

pytest tests/kernels/quantization/test_cutlass_w4a8.py 
======================================================================== test session starts =========================================================================
platform linux -- Python 3.12.11, pytest-8.4.1, pluggy-1.6.0
rootdir: /root/vllm
configfile: pyproject.toml
plugins: anyio-4.10.0
collected 121 items                                                                                                                                                  

tests/kernels/quantization/test_cutlass_w4a8.py .............................................................................................................. [ 90%]
...........                                                                                                                                                    [100%]

======================================================================== 121 passed in 52.64s ========================================================================

startup time

before
[gpu_model_runner.py:2018] Model loading took 62.5527 GiB and 558.104425 seconds

after
[gpu_model_runner.py:1980] Model loading took 62.5522 GiB and 44.892353 seconds

lm-eval gsm8k

before: 0.8627748294
after: 0.862774829

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: czhu-cohere <conway.zhu@cohere.com>

mgoin

Nice and clean, thanks for the work!

czhu-cohere · 2025-09-12T17:43:22Z

@mgoin Thanks for the review, I think the failure is unrelated, OK to merge?

Signed-off-by: czhu-cohere <conway.zhu@cohere.com>

Signed-off-by: czhu-cohere <conway.zhu@cohere.com> Signed-off-by: charlifu <charlifu@amd.com>

Signed-off-by: czhu-cohere <conway.zhu@cohere.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

Signed-off-by: czhu-cohere <conway.zhu@cohere.com>

Signed-off-by: czhu-cohere <conway.zhu@cohere.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

kernel for encode

7913f56

Signed-off-by: czhu-cohere <conway.zhu@cohere.com>

czhu-cohere force-pushed the fast_w4a8_encode branch from 333c884 to 7913f56 Compare August 29, 2025 23:48

czhu-cohere marked this pull request as ready for review August 29, 2025 23:56

mgoin approved these changes Sep 10, 2025

View reviewed changes

mgoin added the kernel label Sep 10, 2025

mgoin enabled auto-merge (squash) September 10, 2025 22:36

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 10, 2025

mgoin requested a review from LucasWilkinson September 10, 2025 22:36

Merge branch 'main' into fast_w4a8_encode

d3320a2

vllm-bot merged commit 3c068c6 into vllm-project:main Sep 17, 2025
71 of 73 checks passed

debroy-rh pushed a commit to debroy-rh/vllm that referenced this pull request Sep 19, 2025

[Kernel] Faster pre-processing time for W4A8 (vllm-project#23972)

ad10e67

Signed-off-by: czhu-cohere <conway.zhu@cohere.com>

FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025

[Kernel] Faster pre-processing time for W4A8 (vllm-project#23972)

b9833a6

Signed-off-by: czhu-cohere <conway.zhu@cohere.com>

charlifu pushed a commit to ROCm/vllm that referenced this pull request Sep 25, 2025

[Kernel] Faster pre-processing time for W4A8 (vllm-project#23972)

5b2491c

Signed-off-by: czhu-cohere <conway.zhu@cohere.com> Signed-off-by: charlifu <charlifu@amd.com>

xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025

[Kernel] Faster pre-processing time for W4A8 (vllm-project#23972)

8c40cf5

Signed-off-by: czhu-cohere <conway.zhu@cohere.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

choprahetarth pushed a commit to Tandemn-Labs/vllm that referenced this pull request Oct 11, 2025

[Kernel] Faster pre-processing time for W4A8 (vllm-project#23972)

98d79e7

Signed-off-by: czhu-cohere <conway.zhu@cohere.com>

xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025

[Kernel] Faster pre-processing time for W4A8 (vllm-project#23972)

0ef86c4

Signed-off-by: czhu-cohere <conway.zhu@cohere.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

[Kernel] Faster pre-processing time for W4A8 #23972

[Kernel] Faster pre-processing time for W4A8 #23972

Uh oh!

czhu-cohere commented Aug 29, 2025 •

edited by github-actions bot

Loading

Uh oh!

mgoin left a comment

Uh oh!

czhu-cohere commented Sep 12, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Uh oh!

[Kernel] Faster pre-processing time for W4A8 #23972

[Kernel] Faster pre-processing time for W4A8 #23972

Uh oh!

Conversation

czhu-cohere commented Aug 29, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

czhu-cohere commented Sep 12, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

czhu-cohere commented Aug 29, 2025 •

edited by github-actions bot

Loading