Skip to content

Conversation

@czhu-cohere
Copy link
Contributor

@czhu-cohere czhu-cohere commented Aug 29, 2025

Purpose

The W4A8 kernels require a special encoding/pre-processing step (one for weights, one for scales). The current CUTLASS implementation walks the values 1-by-1 on CPU which is slow, causing startup to take ~10 minutes for a 100B model.

The simple CUDA implementation is much faster for this specific op and reduce total model load time from ~10min to <1min. The main logic is to build a lookup table to map 8-bit nibbles (2 4-bit values) at a time, where 1..7 -> (8 - v) and 0, 8..15 are untouched. (A similar thing could be implemented for the scales pre-processing but that is less of a bottleneck than weights, and waiting <1min on startup seems reasonable)

Test Plan

Correctness - the existing test here pytest tests/kernels/quantization/test_cutlass_w4a8.py should cover it (if the encoding step is wrong the result will be wrong)

Speedup - look at logs for vllm serve and compare the startup time for Command A (111B params)

E2E - compare gsm8k lm-eval before and after

Test Result

pytest tests/kernels/quantization/test_cutlass_w4a8.py 
======================================================================== test session starts =========================================================================
platform linux -- Python 3.12.11, pytest-8.4.1, pluggy-1.6.0
rootdir: /root/vllm
configfile: pyproject.toml
plugins: anyio-4.10.0
collected 121 items                                                                                                                                                  

tests/kernels/quantization/test_cutlass_w4a8.py .............................................................................................................. [ 90%]
...........                                                                                                                                                    [100%]

======================================================================== 121 passed in 52.64s ========================================================================

startup time

before
[gpu_model_runner.py:2018] Model loading took 62.5527 GiB and 558.104425 seconds

after
[gpu_model_runner.py:1980] Model loading took 62.5522 GiB and 44.892353 seconds

lm-eval gsm8k

before: 0.8627748294
after: 0.862774829

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: czhu-cohere <conway.zhu@cohere.com>
@czhu-cohere czhu-cohere marked this pull request as ready for review August 29, 2025 23:56
Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice and clean, thanks for the work!

@mgoin mgoin added the kernel label Sep 10, 2025
@mgoin mgoin enabled auto-merge (squash) September 10, 2025 22:36
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 10, 2025
@czhu-cohere
Copy link
Contributor Author

@mgoin Thanks for the review, I think the failure is unrelated, OK to merge?

@vllm-bot vllm-bot merged commit 3c068c6 into vllm-project:main Sep 17, 2025
71 of 73 checks passed
debroy-rh pushed a commit to debroy-rh/vllm that referenced this pull request Sep 19, 2025
Signed-off-by: czhu-cohere <conway.zhu@cohere.com>
FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025
Signed-off-by: czhu-cohere <conway.zhu@cohere.com>
charlifu pushed a commit to ROCm/vllm that referenced this pull request Sep 25, 2025
Signed-off-by: czhu-cohere <conway.zhu@cohere.com>
Signed-off-by: charlifu <charlifu@amd.com>
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025
Signed-off-by: czhu-cohere <conway.zhu@cohere.com>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
choprahetarth pushed a commit to Tandemn-Labs/vllm that referenced this pull request Oct 11, 2025
Signed-off-by: czhu-cohere <conway.zhu@cohere.com>
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025
Signed-off-by: czhu-cohere <conway.zhu@cohere.com>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kernel ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants