Skip to content

Add MXFP8 to Marlin dense kernel#34664

Open
mgoin wants to merge 1 commit intovllm-project:mainfrom
neuralmagic:mxfp8-marlin
Open

Add MXFP8 to Marlin dense kernel#34664
mgoin wants to merge 1 commit intovllm-project:mainfrom
neuralmagic:mxfp8-marlin

Conversation

@mgoin
Copy link
Member

@mgoin mgoin commented Feb 17, 2026

Purpose

vLLM currently supports MXFP8 (Microscaling FP8) quantization via ModelOpt checkpoints, but only through an unfused emulation path that dequantizes weights to BF16 and runs a standard GEMM.

The Marlin kernel already supports FP8 (per-channel/group scales) and MXFP4 (per-32-element e8m0 scales). MXFP8 is a natural combination: FP8 weights (like existing FP8 Marlin) with e8m0 microscaling block scales (like existing MXFP4 Marlin). We just have to wire the kernel building blocks together.

Test Plan

Test Result

Eval with mgoin/Qwen3-0.6B-MXFP8


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: mgoin <mgoin64@gmail.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for MXFP8 quantization in the Marlin kernel, providing a faster alternative to the existing emulation path. The changes span across kernel generation, C++ dispatch logic, and Python-level integration. The implementation introduces new utility functions for handling MXFP8-specific weight and scale preparation for Marlin. My review identifies a critical issue in the hardware capability check that could lead to runtime errors on unsupported GPUs.

Comment on lines +1692 to +1700
from vllm.model_executor.layers.quantization.utils.marlin_utils_fp8 import (
is_fp8_marlin_supported,
)

if is_fp8_marlin_supported():
self.backend = Mxfp8LinearBackend.MARLIN
else:
self.backend = Mxfp8LinearBackend.EMULATION
self.mxfp8_linear_op = Mxfp8LinearOp(backend=self.backend)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The check is_fp8_marlin_supported() returns true for GPUs with compute capability 7.5+, but the new MXFP8 Marlin kernel requires compute capability 8.0+ (as stated in the comment for get_min_capability and the change from 100 to 80). Using this check will incorrectly enable the Marlin backend on SM75 GPUs (like T4), leading to runtime errors.

A more accurate check for SM 8.0+ should be used here to ensure the correct backend is selected based on hardware capabilities.

Suggested change
from vllm.model_executor.layers.quantization.utils.marlin_utils_fp8 import (
is_fp8_marlin_supported,
)
if is_fp8_marlin_supported():
self.backend = Mxfp8LinearBackend.MARLIN
else:
self.backend = Mxfp8LinearBackend.EMULATION
self.mxfp8_linear_op = Mxfp8LinearOp(backend=self.backend)
from vllm.platforms import current_platform
if current_platform.has_device_capability(80):
self.backend = Mxfp8LinearBackend.MARLIN
else:
self.backend = Mxfp8LinearBackend.EMULATION
self.mxfp8_linear_op = Mxfp8LinearOp(backend=self.backend)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant