Skip to content

Conversation

@fxmarty-amd
Copy link
Contributor

@fxmarty-amd fxmarty-amd commented May 9, 2025

This PR follows #16943, and adds the possibility to load MOE models using MXFP4 weights with dynamic per-group MXFP4 quantization for activations.

We did not yet release such models publicly, but expect to release some soon.

At the moment, execution on MI300 runs a simulated scheme where weights are dequantized on the fly, and QDQ is done on activations on the fly, using HIP kernels

Left to do:

  • Add test.
  • Add documentation.
  • Implement the code path for real mxfp4 * mxfp4 GEMM (maybe in an other PR)
  • Validate sensible eval results for Deepseek R1, llama 4 and llama 405B

fxmarty-amd and others added 3 commits May 9, 2025 08:23
wip

wip & debug

update

cleanup

use quark realquantizer for pack/quant/dequant

comment on cudagraph issue; remove prints

Keep only 1 place importing quark

cudagraph issue resolved; dq weight at load time for efficiency

Signed-off-by: Bowen Bao <bowenbao@amd.com>

lint

Signed-off-by: Bowen Bao <bowenbao@amd.com>

turn on emulation based on platform

Signed-off-by: Bowen Bao <bowenbao@amd.com>

add fused moe support - ugly wip

running version

Add envar if dequant weight at load time

Signed-off-by: Bowen Bao <bowenbao@amd.com>

Mxfp4 memory leak fixes (#2)

Signed-off-by: Felix Marty <felmarty@amd.com>
Signed-off-by: Bowen Bao <bowenbao@amd.com>
Signed-off-by: Bowen Bao <bowenbao@amd.com>
@github-actions
Copy link

github-actions bot commented May 9, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@DarkLight1337
Copy link
Member

Can you merge from main to fix pre-commit?

@mergify
Copy link

mergify bot commented May 13, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @fxmarty-amd.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label May 13, 2025
fxmarty-amd and others added 11 commits May 13, 2025 13:58
wip & debug

update

cleanup

use quark realquantizer for pack/quant/dequant

comment on cudagraph issue; remove prints

Keep only 1 place importing quark

cudagraph issue resolved; dq weight at load time for efficiency

Signed-off-by: Bowen Bao <bowenbao@amd.com>

lint

Signed-off-by: Bowen Bao <bowenbao@amd.com>

turn on emulation based on platform

Signed-off-by: Bowen Bao <bowenbao@amd.com>

add fused moe support - ugly wip

running version

Add envar if dequant weight at load time

Signed-off-by: Bowen Bao <bowenbao@amd.com>

Mxfp4 memory leak fixes (#2)

Fix VLLM_QUARK_EMU_MEM_OPT route

Signed-off-by: Felix Marty <felmarty@amd.com>
Signed-off-by: Felix Marty <felmarty@amd.com>
Signed-off-by: Felix Marty <felmarty@amd.com>
Signed-off-by: Felix Marty <felmarty@amd.com>
Signed-off-by: Felix Marty <felmarty@amd.com>
wip & debug

update

cleanup

use quark realquantizer for pack/quant/dequant

comment on cudagraph issue; remove prints

Keep only 1 place importing quark

cudagraph issue resolved; dq weight at load time for efficiency

Signed-off-by: Bowen Bao <bowenbao@amd.com>

lint

Signed-off-by: Bowen Bao <bowenbao@amd.com>

turn on emulation based on platform

Signed-off-by: Bowen Bao <bowenbao@amd.com>

add fused moe support - ugly wip

running version

Add envar if dequant weight at load time

Signed-off-by: Bowen Bao <bowenbao@amd.com>

Mxfp4 memory leak fixes (#2)

Fix VLLM_QUARK_EMU_MEM_OPT route

Signed-off-by: Felix Marty <felmarty@amd.com>
Signed-off-by: Felix Marty <felmarty@amd.com>
… select the q/dq/qdq implem for mxfp4

Signed-off-by: Felix Marty <felmarty@amd.com>
Signed-off-by: Felix Marty <felmarty@amd.com>
Co-authored-by: Felix Marty <felmarty@amd.com>
Signed-off-by: Felix Marty <felmarty@amd.com>
Signed-off-by: Felix Marty <felmarty@amd.com>
@mergify mergify bot removed the needs-rebase label May 13, 2025
Signed-off-by: Felix Marty <felmarty@amd.com>
Signed-off-by: Felix Marty <felmarty@amd.com>
Signed-off-by: Felix Marty <felmarty@amd.com>
@mergify mergify bot added the documentation Improvements or additions to documentation label May 13, 2025
@mergify mergify bot added the needs-rebase label Jul 3, 2025
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
@mergify mergify bot removed the needs-rebase label Jul 8, 2025
@fxmarty-amd
Copy link
Contributor Author

Hi @bnellnm, I addressed your comments and also made this compatible with the recent changes in vllm for dynamo/inductor, guarding mxfp4 dequantization & QDQ in custom ops.

Let me know if this looks good!

Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
@fxmarty-amd
Copy link
Contributor Author

@bnellnm concerning the CI, the failing tests seem to be the previous bitsandbytes tests that were failing some weeks ago as well, I think it is unrelated:

[2025-07-08T14:45:22Z] �[31mFAILED�[0m quantization/test_bitsandbytes.py::�[1mtest_load_4bit_bnb_model[facebook/opt-125m-quantize opt model inflight]�[0m - AssertionError: function <function test_load_4bit_bnb_model at 0x7fb0ce2c1620> failed when called with args () and kwargs {'hf_runner': <class 'tests.conftest.HfRunner'>, 'vllm_runner': <class 'tests.conftest.VllmRunner'>, 'example_prompts': ['vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs.\n', 'Briefly describe the major milestones in the development of artificial intelligence from 1950 to 2020.\n', 'Compare and contrast artificial intelligence with human intelligence in terms of processing information.\n', 'Describe the basic components of a neural network and how it can be trained.\n', 'Write a short story about a robot that dreams for the first time.\n', 'Analyze the impact of the COVID-19 pandemic on global economic structures and future business models.\n', 'Explain the cultural significance of the Mona Lisa painting, and how its perception might vary in Western versus Eastern societies.\n', "Translate the following English sentence into Japanese, French, and Swahili: 'The early bird catches the worm.'\n"], 'model_name': 'facebook/opt-125m', 'description': 'quantize opt model inflight'}
[2025-07-08T14:45:22Z] �[31mFAILED�[0m quantization/test_bitsandbytes.py::�[1mtest_load_4bit_bnb_model[mistralai/Mistral-7B-Instruct-v0.3-quantize inflight model with both HF and Mistral format weights]�[0m - AssertionError: function <function test_load_4bit_bnb_model at 0x7fb0ce2c1620> failed when called with args () and kwargs {'hf_runner': <class 'tests.conftest.HfRunner'>, 'vllm_runner': <class 'tests.conftest.VllmRunner'>, 'example_prompts': ['vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs.\n', 'Briefly describe the major milestones in the development of artificial intelligence from 1950 to 2020.\n', 'Compare and contrast artificial intelligence with human intelligence in terms of processing information.\n', 'Describe the basic components of a neural network and how it can be trained.\n', 'Write a short story about a robot that dreams for the first time.\n', 'Analyze the impact of the COVID-19 pandemic on global economic structures and future business models.\n', 'Explain the cultural significance of the Mona Lisa painting, and how its perception might vary in Western versus Eastern societies.\n', "Translate the following English sentence into Japanese, French, and Swahili: 'The early bird catches the worm.'\n"], 'model_name': 'mistralai/Mistral-7B-Instruct-v0.3', 'description': 'quantize inflight model with both HF and Mistral format weights'}
[2025-07-08T14:45:22Z] �[31mFAILED�[0m quantization/test_bitsandbytes.py::�[1mtest_load_pre_quant_4bit_bnb_model[PrunaAI/Einstein-v6.1-Llama3-8B-bnb-4bit-smashed-read pre-quantized 4-bit FP4 model]�[0m - AssertionError: function <function test_load_pre_quant_4bit_bnb_model at 0x7fb0a5c0c5e0> failed when called with args () and kwargs {'hf_runner': <class 'tests.conftest.HfRunner'>, 'vllm_runner': <class 'tests.conftest.VllmRunner'>, 'example_prompts': ['vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs.\n', 'Briefly describe the major milestones in the development of artificial intelligence from 1950 to 2020.\n', 'Compare and contrast artificial intelligence with human intelligence in terms of processing information.\n', 'Describe the basic components of a neural network and how it can be trained.\n', 'Write a short story about a robot that dreams for the first time.\n', 'Analyze the impact of the COVID-19 pandemic on global economic structures and future business models.\n', 'Explain the cultural significance of the Mona Lisa painting, and how its perception might vary in Western versus Eastern societies.\n', "Translate the following English sentence into Japanese, French, and Swahili: 'The early bird catches the worm.'\n"], 'model_name': 'PrunaAI/Einstein-v6.1-Llama3-8B-bnb-4bit-smashed', 'description': 'read pre-quantized 4-bit FP4 model'}
[2025-07-08T14:45:22Z] �[31mFAILED�[0m quantization/test_bitsandbytes.py::�[1mtest_load_pre_quant_4bit_bnb_model[poedator/opt-125m-bnb-4bit-read pre-quantized 4-bit NF4 opt model]�[0m - AssertionError: function <function test_load_pre_quant_4bit_bnb_model at 0x7fb0a5c0c5e0> failed when called with args () and kwargs {'hf_runner': <class 'tests.conftest.HfRunner'>, 'vllm_runner': <class 'tests.conftest.VllmRunner'>, 'example_prompts': ['vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs.\n', 'Briefly describe the major milestones in the development of artificial intelligence from 1950 to 2020.\n', 'Compare and contrast artificial intelligence with human intelligence in terms of processing information.\n', 'Describe the basic components of a neural network and how it can be trained.\n', 'Write a short story about a robot that dreams for the first time.\n', 'Analyze the impact of the COVID-19 pandemic on global economic structures and future business models.\n', 'Explain the cultural significance of the Mona Lisa painting, and how its perception might vary in Western versus Eastern societies.\n', "Translate the following English sentence into Japanese, French, and Swahili: 'The early bird catches the worm.'\n"], 'model_name': 'poedator/opt-125m-bnb-4bit', 'description': 'read pre-quantized 4-bit NF4 opt model'}
[2025-07-08T14:45:22Z] �[31mFAILED�[0m quantization/test_bitsandbytes.py::�[1mtest_load_8bit_bnb_model[meta-llama/Llama-Guard-3-8B-INT8-read pre-quantized llama 8-bit model]�[0m - AssertionError: function <function test_load_8bit_bnb_model at 0x7fb0a5c0cf40> failed when called with args () and kwargs {'hf_runner': <class 'tests.conftest.HfRunner'>, 'vllm_runner': <class 'tests.conftest.VllmRunner'>, 'example_prompts': ['vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs.\n', 'Briefly describe the major milestones in the development of artificial intelligence from 1950 to 2020.\n', 'Compare and contrast artificial intelligence with human intelligence in terms of processing information.\n', 'Describe the basic components of a neural network and how it can be trained.\n', 'Write a short story about a robot that dreams for the first time.\n', 'Analyze the impact of the COVID-19 pandemic on global economic structures and future business models.\n', 'Explain the cultural significance of the Mona Lisa painting, and how its perception might vary in Western versus Eastern societies.\n', "Translate the following English sentence into Japanese, French, and Swahili: 'The early bird catches the worm.'\n"], 'model_name': 'meta-llama/Llama-Guard-3-8B-INT8', 'description': 'read pre-quantized llama 8-bit model'}
[2025-07-08T14:45:22Z] �[31mFAILED�[0m quantization/test_bitsandbytes.py::�[1mtest_load_8bit_bnb_model[yec019/fbopt-350m-8bit-read pre-quantized 8-bit opt model]�[0m - AssertionError: function <function test_load_8bit_bnb_model at 0x7fb0a5c0cf40> failed when called with args () and kwargs {'hf_runner': <class 'tests.conftest.HfRunner'>, 'vllm_runner': <class 'tests.conftest.VllmRunner'>, 'example_prompts': ['vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs.\n', 'Briefly describe the major milestones in the development of artificial intelligence from 1950 to 2020.\n', 'Compare and contrast artificial intelligence with human intelligence in terms of processing information.\n', 'Describe the basic components of a neural network and how it can be trained.\n', 'Write a short story about a robot that dreams for the first time.\n', 'Analyze the impact of the COVID-19 pandemic on global economic structures and future business models.\n', 'Explain the cultural significance of the Mona Lisa painting, and how its perception might vary in Western versus Eastern societies.\n', "Translate the following English sentence into Japanese, French, and Swahili: 'The early bird catches the worm.'\n"], 'model_name': 'yec019/fbopt-350m-8bit', 'description': 'read pre-quantized 8-bit opt model'}
[2025-07-08T14:45:22Z] �[31mFAILED�[0m quantization/test_bitsandbytes.py::�[1mtest_4bit_bnb_embedding_model[half-intfloat/e5-mistral-7b-instruct-quantize embedding model inflight]�[0m - AssertionError: function <function test_4bit_bnb_embedding_model at 0x7fb0a5c0d300> failed when called with args () and kwargs {'model_name': 'intfloat/e5-mistral-7b-instruct', 'description': 'quantize embedding model inflight', 'hf_runner': <class 'tests.conftest.HfRunner'>, 'vllm_runner': <class 'tests.conftest.VllmRunner'>, 'example_prompts': ['vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs.\n', 'Briefly describe the major milestones in the development of artificial intelligence from 1950 to 2020.\n', 'Compare and contrast artificial intelligence with human intelligence in terms of processing information.\n', 'Describe the basic components of a neural network and how it can be trained.\n', 'Write a short story about a robot that dreams for the first time.\n', 'Analyze the impact of the COVID-19 pandemic on global economic structures and future business models.\n', 'Explain the cultural significance of the Mona Lisa painting, and how its perception might vary in Western versus Eastern societies.\n', "Translate the following English sentence into Japanese, French, and Swahili: 'The early bird catches the worm.'\n"], 'dtype': 'half'}

@mgoin
Copy link
Member

mgoin commented Jul 9, 2025

Thanks, I'll take a look now. Bill is OOO for a bit

Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments left

a1_scale=None,
a2_scale=None,
block_shape=None,
per_channel_quant=True,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like you are still missing activation=activation here and why does per_channel_quant=True need to be set for mxfp4?

Copy link
Contributor Author

@fxmarty-amd fxmarty-amd Jul 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added per_channel_quant=True to address #17888 (comment), see https://github.com/fxmarty-amd/vllm/blob/e570709cfe79c3a43d3e777bb34e0adfa22788f3/vllm/model_executor/layers/fused_moe/utils.py#L87.

Later on in fused_moe.py we have per_act_token_quant=per_channel_quant

qcurr_hidden_states, a1q_scale = moe_kernel_quantize_input(
A=curr_hidden_states,
A_scale=a1_scale,
quant_dtype=qtype,
per_act_token_quant=per_channel_quant,
block_shape=block_shape)

Actually you are right, this is not compatible with

def _validate_scale_shape(
a: torch.Tensor,
a_scale: Optional[torch.Tensor],
per_act_token_quant: bool,
block_shape: Optional[list[int]],
) -> None:
if a_scale is None:
return
if not per_act_token_quant and block_shape is None:
assert a_scale.numel() == 1, f"{a_scale.shape}"
elif per_act_token_quant:
assert a_scale.shape[0] == a.shape[0] and a_scale.shape[1] == 1, (
f"{a_scale.shape[0]} == {a.shape[0]} and {a_scale.shape[1]} == 1")
else:
assert block_shape is not None
expected = (a.shape[0], cdiv(a.shape[1], block_shape[1]))
assert a_scale.shape == expected, f"{a_scale.shape} == {expected}"
which considers per-token quantization as having a single scale per token.

So I removed per_channel_quant=True in 4ffff1d and will leave #17888 (comment) open. Does that sound ok?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm sorry I'm not sure what is the "right" way here just looking at it quickly..

Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
@fxmarty-amd
Copy link
Contributor Author

fxmarty-amd commented Jul 9, 2025

@mgoin I reran tests in test_quark.py and kernels/moe/test_mxfp4_moe.py, looks good.

@simon-mo simon-mo merged commit 332d4cb into vllm-project:main Jul 9, 2025
70 of 72 checks passed
Pradyun92 pushed a commit to Pradyun92/vllm that referenced this pull request Aug 6, 2025
)

Signed-off-by: Felix Marty <felmarty@amd.com>
Signed-off-by: Bowen Bao <bowenbao@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Co-authored-by: Bowen Bao <bowenbao@amd.com>
npanpaliya pushed a commit to odh-on-pz/vllm-upstream that referenced this pull request Aug 6, 2025
)

Signed-off-by: Felix Marty <felmarty@amd.com>
Signed-off-by: Bowen Bao <bowenbao@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Co-authored-by: Bowen Bao <bowenbao@amd.com>
jinzhen-lin pushed a commit to jinzhen-lin/vllm that referenced this pull request Aug 9, 2025
)

Signed-off-by: Felix Marty <felmarty@amd.com>
Signed-off-by: Bowen Bao <bowenbao@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Co-authored-by: Bowen Bao <bowenbao@amd.com>
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
diegocastanibm pushed a commit to diegocastanibm/vllm that referenced this pull request Aug 15, 2025
)

Signed-off-by: Felix Marty <felmarty@amd.com>
Signed-off-by: Bowen Bao <bowenbao@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Co-authored-by: Bowen Bao <bowenbao@amd.com>
Signed-off-by: Diego-Castan <diego.castan@ibm.com>
epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 27, 2025
)

Signed-off-by: Felix Marty <felmarty@amd.com>
Signed-off-by: Bowen Bao <bowenbao@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Co-authored-by: Bowen Bao <bowenbao@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation quantization ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants