Add MXFP8 to Marlin dense kernel by mgoin · Pull Request #34664 · vllm-project/vllm

mgoin · 2026-02-17T01:29:07Z

Purpose

vLLM currently supports MXFP8 (Microscaling FP8) quantization via ModelOpt checkpoints, but only through an unfused emulation path that dequantizes weights to BF16 and runs a standard GEMM.

The Marlin kernel already supports FP8 (per-channel/group scales) and MXFP4 (per-32-element e8m0 scales). MXFP8 is a natural combination: FP8 weights (like existing FP8 Marlin) with e8m0 microscaling block scales (like existing MXFP4 Marlin). We just have to wire the kernel building blocks together.

Test Plan

Test Result

Eval with mgoin/Qwen3-0.6B-MXFP8

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: mgoin <mgoin64@gmail.com>

gemini-code-assist

Code Review

This pull request adds support for MXFP8 quantization in the Marlin kernel, providing a faster alternative to the existing emulation path. The changes span across kernel generation, C++ dispatch logic, and Python-level integration. The implementation introduces new utility functions for handling MXFP8-specific weight and scale preparation for Marlin. My review identifies a critical issue in the hardware capability check that could lead to runtime errors on unsupported GPUs.

gemini-code-assist · 2026-02-17T01:31:19Z

vllm/model_executor/layers/quantization/modelopt.py

+        from vllm.model_executor.layers.quantization.utils.marlin_utils_fp8 import (
+            is_fp8_marlin_supported,
+        )
+
+        if is_fp8_marlin_supported():
+            self.backend = Mxfp8LinearBackend.MARLIN
+        else:
+            self.backend = Mxfp8LinearBackend.EMULATION
+            self.mxfp8_linear_op = Mxfp8LinearOp(backend=self.backend)


The check is_fp8_marlin_supported() returns true for GPUs with compute capability 7.5+, but the new MXFP8 Marlin kernel requires compute capability 8.0+ (as stated in the comment for get_min_capability and the change from 100 to 80). Using this check will incorrectly enable the Marlin backend on SM75 GPUs (like T4), leading to runtime errors.

A more accurate check for SM 8.0+ should be used here to ensure the correct backend is selected based on hardware capabilities.

Suggested change

from vllm.model_executor.layers.quantization.utils.marlin_utils_fp8 import (

is_fp8_marlin_supported,

)

if is_fp8_marlin_supported():

self.backend = Mxfp8LinearBackend.MARLIN

else:

self.backend = Mxfp8LinearBackend.EMULATION

self.mxfp8_linear_op = Mxfp8LinearOp(backend=self.backend)

from vllm.platforms import current_platform

if current_platform.has_device_capability(80):

self.backend = Mxfp8LinearBackend.MARLIN

else:

self.backend = Mxfp8LinearBackend.EMULATION

self.mxfp8_linear_op = Mxfp8LinearOp(backend=self.backend)

Add MXFP8 to Marlin dense kernel

b8f2fb6

Signed-off-by: mgoin <mgoin64@gmail.com>

mgoin requested review from pavanimajety, robertgshaw2-redhat, tlrmchlsmth and yewentao256 as code owners February 17, 2026 01:29

gemini-code-assist bot reviewed Feb 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add MXFP8 to Marlin dense kernel#34664

Add MXFP8 to Marlin dense kernel#34664
mgoin wants to merge 1 commit intovllm-project:mainfrom
neuralmagic:mxfp8-marlin

mgoin commented Feb 17, 2026 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Feb 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

mgoin commented Feb 17, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mgoin commented Feb 17, 2026 •

edited by github-actions bot

Loading