[ROCm] [Feature] [Doc] [Dockerfile] Support Per-Token-Activation Per-Channel-Weight FP8 Quantization Inferencing #12499

tjtanaa · 2025-01-28T05:59:33Z

Support Per-Token-Activation Per-Channel-Weight FP8 Quantization Inferencing

Note: This PR feature requires ROCm 6.3 and later and GPU Arch MI300 and later.

Description

This PR involves the following enhancements

This is a PR specific to support Per-Token-Activation Per-Channel-Weight (PTPC-FP8) FP8 Quantization Inferencing.
The model will be quantized on-the-fly from BFloat16 to FP8. Model weight which are store in Float16 will need to be casted into BFloat16.
It used PyTorch latest rowwise scaled GEMM feature in torch._scaled_mm which is introduced in [ROCm] hipblaslt rowwise f8 gemm pytorch/pytorch#144432 , which speeds up current naive implementation by at least 2 times. For more details check out the Performance section

To support this feature, the Dockerfile.rocm_base PyTorch repo commit has been updated to 3a585126.
Dockerfile.rocm is left untouched as the base image is referencing to AMD docker hub registry. That base image at this point in time has already installed with PyTorch repo commit 3a585126.

Small enhancement. The documentation has been updated to ROCm 6.3 and various commits in the installation step has been updated to match the commits in Dockerfile.rocm_base.

Performance

Perplexity Test

Model: Llama-3.1-8B-Instruct
Dataset: Wikitexts
GPU: MI300X

Model	Quantization	KVCacheDtype	Tasks	Metric	Metric Score
Llama-3.1-8B-Instruct/	auto (bf16)	auto (bf16)	wikitext	word_perplexity	9.4281
Llama-3.1-8B-Instruct/	fp8	fp8_e4m3	wikitext	word_perplexity	9.5124
Llama-3.1-8B-Instruct/	ptpc_fp8	fp8_e4m3	wikitext	word_perplexity	9.5093
Llama-3.1-8B-Instruct/	ptpc_fp8 (naive)	fp8_e4m3	wikitext	word_perplexity	9.5095

Speed Test (Old naive implementation vs torch._scaled_mm rowwise scaled GEMM feature)

Model: Llama-3.1-70B-Instruct
Dataset: SharedGPT
GPU: 1xMI300X

Quantization	KVCacheDType	Req/s	Total token/s	Output tokens/s
ptpc_fp8 (naive)	fp8_e4m3	2.43	1003.46	481.28
ptpc_fp8 (torch._scaled_mm rowwise scaled GEMM feature)	fp8_e4m3	6.36	2631.04	1261.91

github-actions · 2025-01-28T05:59:44Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

…12244) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

…m-project#12237) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

…project#12252) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

…ct#12246) Signed-off-by: youkaichao <youkaichao@gmail.com>

…-project#12259) Signed-off-by: Roger Wang <ywang@roblox.com>

…ct#12260) Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

Signed-off-by: Mengqing Cao <cmq0113@163.com>

Signed-off-by: youkaichao <youkaichao@gmail.com>

…ed (vllm-project#10802) Signed-off-by: Jannis Schönleber <joennlae@gmail.com>

Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>

…project#10907) Signed-off-by: rickyx <rickyx@anyscale.com>

Signed-off-by: Andy Lo <andy@mistral.ai>

Signed-off-by: Adrian Cole <adrian.cole@elastic.co>

…ject#12235) Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>

)

…shes (vllm-project#12277) Signed-off-by: maleksan85 <maleksan@amd.com> Co-authored-by: maleksan85 <maleksan@amd.com>

…for perf validation purpose (vllm-project#12281) Signed-off-by: Hongxia Yang <hongxyan@amd.com>

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

…-project#12464) Signed-off-by: Isotr0py <2037008807@qq.com>

…12454) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

Signed-off-by: Isotr0py <2037008807@qq.com>

…t#12339) Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

Signed-off-by: Mark McLoughlin <markmc@redhat.com>

…12469) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

Signed-off-by: Bowen Wang <abmfy@icloud.com> Signed-off-by: youkaichao <youkaichao@gmail.com> Co-authored-by: youkaichao <youkaichao@gmail.com>

…robs` with ChunkedPrefill (vllm-project#10132) Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: wallashss <wallashss@ibm.com> Co-authored-by: wallashss <wallashss@ibm.com>

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

…vllm-project#11277) Signed-off-by: Liangfu Chen <liangfc@amazon.com> Co-authored-by: Jiangfei Duan <jfduan@outlook.com>

Signed-off-by: mgoin <michael@neuralmagic.com>

mergify · 2025-01-28T06:12:33Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @tjtanaa.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

tjtanaa · 2025-01-28T06:45:15Z

This PR is closed as the git history is messed up. The PR is replaced by #12501

tjtanaa requested review from mgoin, robertgshaw2-redhat and tlrmchlsmth as code owners January 28, 2025 05:59

mergify bot added documentation Improvements or additions to documentation ci/build labels Jan 28, 2025

kliuae and others added 24 commits January 28, 2025 06:04

add Per-token-activation per-channel-weight on-the-fly quantization fp8

6559a9e

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

add ptpc fp8 unittests

798c07e

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

remove is_navi check for now

63f9657

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

update rocm gpu installation readme; remove navi check

6dc4604

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

update PyTorch version to enable torch._scaled_mm rowwise

30f0ecd

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

[Misc] Rename MultiModalInputsV2 -> MultiModalInputs (vllm-project#…

be57b24

…12244) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

[Misc]Add BNB quantization for PaliGemmaForConditionalGeneration (vll…

66d6dd2

…m-project#12237) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

[Misc] Remove redundant TypeVar from base model (vllm-project#12248)

e9ddeda

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

[Bugfix] Fix mm_limits access for merged multi-modal processor (vllm-…

0572080

…project#12252) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

[torch.compile] transparent compilation with more logging (vllm-proje…

29b95c6

…ct#12246) Signed-off-by: youkaichao <youkaichao@gmail.com>

[V1][Bugfix] Fix data item ordering in mixed-modality inference (vllm…

b559fa6

…-project#12259) Signed-off-by: Roger Wang <ywang@roblox.com>

Remove pytorch comments for outlines + compressed-tensors (vllm-proje…

6cfb7ac

…ct#12260) Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

[Platform] improve platforms getattr (vllm-project#12264)

27530bb

Signed-off-by: Mengqing Cao <cmq0113@163.com>

[ci/build] update nightly torch for gh200 test (vllm-project#12270)

91b7860

Signed-off-by: youkaichao <youkaichao@gmail.com>

[Bugfix] fix race condition that leads to wrong order of token return…

98b8414

…ed (vllm-project#10802) Signed-off-by: Jannis Schönleber <joennlae@gmail.com>

[Kernel] fix moe_align_block_size error condition (vllm-project#12239)

e4564cb

Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>

[v1][stats][1/n] Add RequestStatsUpdate and RequestStats types (vllm-…

36077d4

…project#10907) Signed-off-by: rickyx <rickyx@anyscale.com>

[Bugfix] Multi-sequence broken (vllm-project#11898)

049885f

Signed-off-by: Andy Lo <andy@mistral.ai>

[Misc] Remove experimental dep from tracing.py (vllm-project#12007)

0db6a75

Signed-off-by: Adrian Cole <adrian.cole@elastic.co>

[Misc] Set default backend to SDPA for get_vit_attn_backend (vllm-pro…

cbe2a73

…ject#12235) Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>

[Core] Free CPU pinned memory on environment cleanup (vllm-project#10477

7980828

)

[BUGFIX] When skip_tokenize_init and multistep are set, execution cra…

10611d8

…shes (vllm-project#12277) Signed-off-by: maleksan85 <maleksan@amd.com> Co-authored-by: maleksan85 <maleksan@amd.com>

[Documentation][AMD] Add information about prebuilt ROCm vLLM docker …

fb43dee

…for perf validation purpose (vllm-project#12281) Signed-off-by: Hongxia Yang <hongxyan@amd.com>

[VLM] Simplify post-processing of replacement info (vllm-project#12269)

4f2fc00

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

DarkLight1337 and others added 12 commits January 28, 2025 06:11

[Bugfix] Fix Granite 3.0 MoE model loading (vllm-project#12446)

4176918

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

[Bugfix] Fix missing seq_start_loc in xformers prefill metadata (vllm…

7a6cded

…-project#12464) Signed-off-by: Isotr0py <2037008807@qq.com>

[V1][Minor] Minor optimizations for update_from_output (vllm-project#…

bd69c90

…12454) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

[Bugfix] Fix gpt2 GGUF inference (vllm-project#12467)

899cea0

Signed-off-by: Isotr0py <2037008807@qq.com>

[Build] Only build 9.0a for scaled_mm and sparse kernels (vllm-projec…

2fa4f8e

…t#12339) Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

[V1][Metrics] Add initial Prometheus logger (vllm-project#12416)

1253304

Signed-off-by: Mark McLoughlin <markmc@redhat.com>

[V1][CI/Test] Do basic test for top-p & top-k sampling (vllm-project#…

0f2a9ce

…12469) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

[FlashInfer] Upgrade to 0.2.0 (vllm-project#11194)

45844a3

Signed-off-by: Bowen Wang <abmfy@icloud.com> Signed-off-by: youkaichao <youkaichao@gmail.com> Co-authored-by: youkaichao <youkaichao@gmail.com>

[Feature] [Spec decode]: Enable MLPSpeculator/Medusa and `prompt_logp…

411e0d2

…robs` with ChunkedPrefill (vllm-project#10132) Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: wallashss <wallashss@ibm.com> Co-authored-by: wallashss <wallashss@ibm.com>

Update pre-commit hooks (vllm-project#12475)

0ae8f3e

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

[Neuron][Kernel] NKI-based flash-attention kernel with paged KV cache (…

008891b

…vllm-project#11277) Signed-off-by: Liangfu Chen <liangfc@amazon.com> Co-authored-by: Jiangfei Duan <jfduan@outlook.com>

Fix bad path in prometheus example (vllm-project#12481)

79151e0

Signed-off-by: mgoin <michael@neuralmagic.com>

tjtanaa force-pushed the ptpc-fp8-rocm branch from d2b5204 to 79151e0 Compare January 28, 2025 06:11

tjtanaa requested review from alexm-redhat, comaniac, WoosukKwon, njhill, LiuXiaoxuanPKU, DarkLight1337, ywang96, simon-mo, zhuohan123 and youkaichao as code owners January 28, 2025 06:11

mergify bot added frontend needs-rebase labels Jan 28, 2025

tjtanaa marked this pull request as draft January 28, 2025 06:12

tjtanaa closed this Jan 28, 2025

tjtanaa deleted the ptpc-fp8-rocm branch February 25, 2025 13:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[ROCm] [Feature] [Doc] [Dockerfile] Support Per-Token-Activation Per-Channel-Weight FP8 Quantization Inferencing #12499

[ROCm] [Feature] [Doc] [Dockerfile] Support Per-Token-Activation Per-Channel-Weight FP8 Quantization Inferencing #12499

Uh oh!

tjtanaa commented Jan 28, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Jan 28, 2025

Uh oh!

mergify bot commented Jan 28, 2025

Uh oh!

tjtanaa commented Jan 28, 2025

Uh oh!

Uh oh!

Uh oh!

[ROCm] [Feature] [Doc] [Dockerfile] Support Per-Token-Activation Per-Channel-Weight FP8 Quantization Inferencing #12499

[ROCm] [Feature] [Doc] [Dockerfile] Support Per-Token-Activation Per-Channel-Weight FP8 Quantization Inferencing #12499

Uh oh!

Conversation

tjtanaa commented Jan 28, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Support Per-Token-Activation Per-Channel-Weight FP8 Quantization Inferencing

Description

Performance

Perplexity Test

Speed Test (Old naive implementation vs torch._scaled_mm rowwise scaled GEMM feature)

Uh oh!

github-actions bot commented Jan 28, 2025

Uh oh!

mergify bot commented Jan 28, 2025

Uh oh!

tjtanaa commented Jan 28, 2025

Uh oh!

Uh oh!

tjtanaa commented Jan 28, 2025 •

edited by github-actions bot

Loading