[Feature] Support sequence parallelism for static fp8 quantization #19181

cascade812 · 2025-06-05T04:42:38Z

Add support sequence parallelism for static fp8 quantization in this PR.
It requires below config to enable it

config = CompilationConfig(level=3,
                           splitting_ops=[],
                           compile_sizes=[4],
                           custom_ops=["+rms_norm"])

# enable_noop is required to be True for correct sp pattern match 
config.pass_config.enable_noop = True
config.pass_config.enable_sequence_parallelism = True

llm = LLM(
    model="RedHatAI/Meta-Llama-3.1-8B-Instruct-FP8",
    enforce_eager=False,
    tensor_parallel_size=2,
    compilation_config=config)

Signed-off-by: cascade812 <cascade812@outlook.com>

gemini-code-assist · 2025-06-05T04:42:42Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

github-actions · 2025-06-05T04:42:47Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

tlrmchlsmth

Overall, looks good to me.

@ProExpertProg @bnellnm Could you take a more detailed look at this?

vllm/config.py

vllm/compilation/sequence_parallelism.py

ProExpertProg

A few asks:

I agree with @tlrmchlsmth that including fused ops might be unnecessary - could we just make this pass run before fusion, and then make sure fusion still works?
Is there any way we could make this pass more general and not reliant on the exact ops? That way it could also work if custom ops are disabled.
- Perhaps here we could enable the custom ops and then lower them after the passes run, like you described in an offline conversation.
Could you post performance numbers? And should we do this for any other ops as well?

vllm/compilation/sequence_parallelism.py

vllm/config.py

Signed-off-by: cascade812 <cascade812@outlook.com>

mergify · 2025-06-17T05:02:46Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @cascade812.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: cascade812 <cascade812@outlook.com>

cascade812 · 2025-06-17T05:35:06Z

A few asks:

I agree with @tlrmchlsmth that including fused ops might be unnecessary - could we just make this pass run before fusion, and then make sure fusion still works?

Is there any way we could make this pass more general and not reliant on the exact ops? That way it could also work if custom ops are disabled.

Perhaps here we could enable the custom ops and then lower them after the passes run, like you described in an offline conversation.

Could you post performance numbers? And should we do this for any other ops as well?

Right, it works after I move the sequence parallel pass to run before the fusion pass.
We can define a custom op that serves as a placeholder, then perform pattern matching on the custom op and lower it after the pass runs.
SP pass doesn't directly provide perf gain, it lays the groundwork for fusing matmul and collective ops like asynctp which can provide good perf gain. I can provide the perf numbers after I add asynctp for scaled mm + collective op fusion, will do it in a separate PR.
We also need similar work for dynamic fp8 ops which require different pattern match.

gemini-code-assist · 2025-06-17T05:39:22Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

gemini-code-assist · 2025-06-17T05:39:48Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

gemini-code-assist · 2025-06-17T05:39:54Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Signed-off-by: cascade812 <cascade812@outlook.com>

gemini-code-assist · 2025-06-17T06:28:45Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Signed-off-by: cascade812 <cascade812@outlook.com>

ProExpertProg

A few comments!

vllm/compilation/fusion.py

tests/compile/test_sequence_parallelism.py

vllm/compilation/sequence_parallelism.py

vllm/config.py

vllm/compilation/sequence_parallelism.py

Signed-off-by: cascade812 <cascade812@outlook.com>

ProExpertProg

A few more nits but looks great and thanks for all the code quality improvements!

tests/compile/test_sequence_parallelism.py

vllm/compilation/sequence_parallelism.py

tests/compile/test_sequence_parallelism.py

Signed-off-by: cascade812 <cascade812@outlook.com>

…llm-project#19181) Signed-off-by: cascade812 <cascade812@outlook.com>

cascade812 added 6 commits June 2, 2025 04:38

add sp for fused rmsnorm with quantize op

d85bb57

Signed-off-by: cascade812 <cascade812@outlook.com>

add sq pass for rms + quant

5a006d7

Signed-off-by: cascade812 <cascade812@outlook.com>

add tests

8c01738

Signed-off-by: cascade812 <cascade812@outlook.com>

update

a5a5ee6

Signed-off-by: cascade812 <cascade812@outlook.com>

fix and remove debug line

184764d

Signed-off-by: cascade812 <cascade812@outlook.com>

update test and address comment

7f19b80

Signed-off-by: cascade812 <cascade812@outlook.com>

mergify bot mentioned this pull request Jun 5, 2025

sp for static fp8 #19157

Closed

yaochengji requested a review from tlrmchlsmth June 5, 2025 05:49

Merge branch 'main' into sp_fp8

3be2a35

tlrmchlsmth reviewed Jun 9, 2025

View reviewed changes

vllm/config.py Outdated Show resolved Hide resolved

tlrmchlsmth reviewed Jun 16, 2025

View reviewed changes

vllm/compilation/sequence_parallelism.py Outdated Show resolved Hide resolved

ProExpertProg reviewed Jun 16, 2025

View reviewed changes

vllm/compilation/sequence_parallelism.py Show resolved Hide resolved

vllm/compilation/sequence_parallelism.py Outdated Show resolved Hide resolved

vllm/compilation/sequence_parallelism.py Outdated Show resolved Hide resolved

vllm/config.py Outdated Show resolved Hide resolved

address comments

1494b3a

Signed-off-by: cascade812 <cascade812@outlook.com>

mergify bot added the needs-rebase label Jun 17, 2025

merge origin main

991e11c

Signed-off-by: cascade812 <cascade812@outlook.com>

mergify bot removed the needs-rebase label Jun 17, 2025

address comment

2ec04aa

Signed-off-by: cascade812 <cascade812@outlook.com>

update default value for config

fd279ed

Signed-off-by: cascade812 <cascade812@outlook.com>

ProExpertProg reviewed Jun 18, 2025

View reviewed changes

ProExpertProg mentioned this pull request Jun 18, 2025

[Feature]: CustomOp cleanup #19817

Open

4 tasks

address comments

53620ac

Signed-off-by: cascade812 <cascade812@outlook.com>

cascade812 force-pushed the sp_fp8 branch from c7946ce to 53620ac Compare June 20, 2025 04:03

ProExpertProg approved these changes Jun 20, 2025

View reviewed changes

tests/compile/test_sequence_parallelism.py Outdated Show resolved Hide resolved

vllm/compilation/sequence_parallelism.py Outdated Show resolved Hide resolved

tests/compile/test_sequence_parallelism.py Outdated Show resolved Hide resolved

ProExpertProg added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 20, 2025

address minor comments

c6e106a

Signed-off-by: cascade812 <cascade812@outlook.com>

ProExpertProg approved these changes Jun 21, 2025

View reviewed changes

cascade812 added 2 commits June 22, 2025 18:48

Merge remote-tracking branch 'origin' into sp_fp8

4bc5567

register fp8 model for test

36b17d5

Signed-off-by: cascade812 <cascade812@outlook.com>

cascade812 requested review from DarkLight1337 and ywang96 as code owners June 23, 2025 06:24

tlrmchlsmth approved these changes Jun 23, 2025

View reviewed changes

tlrmchlsmth merged commit e6327c9 into vllm-project:main Jun 23, 2025
72 checks passed

gmarinho2 pushed a commit to gmarinho2/vllm that referenced this pull request Jun 26, 2025

[Feature] Support sequence parallelism for static fp8 quantization (v…

ad0b72a

…llm-project#19181) Signed-off-by: cascade812 <cascade812@outlook.com>

Uh oh!

[Feature] Support sequence parallelism for static fp8 quantization #19181

[Feature] Support sequence parallelism for static fp8 quantization #19181

Uh oh!

Conversation

cascade812 commented Jun 5, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot commented Jun 5, 2025

Uh oh!

github-actions bot commented Jun 5, 2025

Uh oh!

tlrmchlsmth left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ProExpertProg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Jun 17, 2025

Uh oh!

cascade812 commented Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot commented Jun 17, 2025

Uh oh!

gemini-code-assist bot commented Jun 17, 2025

Uh oh!

gemini-code-assist bot commented Jun 17, 2025

Uh oh!

gemini-code-assist bot commented Jun 17, 2025

Uh oh!

ProExpertProg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ProExpertProg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cascade812 commented Jun 5, 2025 •

edited by github-actions bot

Loading

cascade812 commented Jun 17, 2025 •

edited

Loading