Conversation
Signed-off-by: mgoin <mgoin64@gmail.com>
There was a problem hiding this comment.
Code Review
This pull request adds support for MXFP8 quantization in the Marlin kernel, providing a faster alternative to the existing emulation path. The changes span across kernel generation, C++ dispatch logic, and Python-level integration. The implementation introduces new utility functions for handling MXFP8-specific weight and scale preparation for Marlin. My review identifies a critical issue in the hardware capability check that could lead to runtime errors on unsupported GPUs.
| from vllm.model_executor.layers.quantization.utils.marlin_utils_fp8 import ( | ||
| is_fp8_marlin_supported, | ||
| ) | ||
|
|
||
| if is_fp8_marlin_supported(): | ||
| self.backend = Mxfp8LinearBackend.MARLIN | ||
| else: | ||
| self.backend = Mxfp8LinearBackend.EMULATION | ||
| self.mxfp8_linear_op = Mxfp8LinearOp(backend=self.backend) |
There was a problem hiding this comment.
The check is_fp8_marlin_supported() returns true for GPUs with compute capability 7.5+, but the new MXFP8 Marlin kernel requires compute capability 8.0+ (as stated in the comment for get_min_capability and the change from 100 to 80). Using this check will incorrectly enable the Marlin backend on SM75 GPUs (like T4), leading to runtime errors.
A more accurate check for SM 8.0+ should be used here to ensure the correct backend is selected based on hardware capabilities.
| from vllm.model_executor.layers.quantization.utils.marlin_utils_fp8 import ( | |
| is_fp8_marlin_supported, | |
| ) | |
| if is_fp8_marlin_supported(): | |
| self.backend = Mxfp8LinearBackend.MARLIN | |
| else: | |
| self.backend = Mxfp8LinearBackend.EMULATION | |
| self.mxfp8_linear_op = Mxfp8LinearOp(backend=self.backend) | |
| from vllm.platforms import current_platform | |
| if current_platform.has_device_capability(80): | |
| self.backend = Mxfp8LinearBackend.MARLIN | |
| else: | |
| self.backend = Mxfp8LinearBackend.EMULATION | |
| self.mxfp8_linear_op = Mxfp8LinearOp(backend=self.backend) |
Purpose
vLLM currently supports MXFP8 (Microscaling FP8) quantization via ModelOpt checkpoints, but only through an unfused emulation path that dequantizes weights to BF16 and runs a standard GEMM.
The Marlin kernel already supports FP8 (per-channel/group scales) and MXFP4 (per-32-element e8m0 scales). MXFP8 is a natural combination: FP8 weights (like existing FP8 Marlin) with e8m0 microscaling block scales (like existing MXFP4 Marlin). We just have to wire the kernel building blocks together.
Test Plan
Test Result
Eval with
mgoin/Qwen3-0.6B-MXFP8Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.