Open
Description
Motivation.
We are in the process of making incremental changes for Blackwell Support in vLLM. This issue is a tracker for all the items that are planned.
Planned or In Progress Features
The following items are either planned or currently in progress to enable vLLM support on Blackwell.
-
Enable NVFP4 Support
- (NVIDIA) Add functional support for NVFP4 Kernels for linear layers
- (NVIDIA) Add functional support for NVFP4 MoE Kernels
- (NVIDIA) Add Model Integration for nvidia/*-FP4 models
- Finetune GEMM configurations for Blackwell
- (NVIDIA) Optimize MoE for Latency
- (NVIDIA) Optimize MoE for Throughput FI: PR !1113
- (NVIDIA) MoE All Reduce Fusion FI: PR !1108
-
Optimize communication overlap ops
- (NVIDIA) Enable NCCL’s symmetric memory
- (NVIDIA) Add support for Gemm + comm overlap
-
Blackwell Attention Kernels
- (NVIDIA) Integrate Cutlass MLA Kernels [NVIDIA] Add Cutlass MLA backend #17625
- (NVIDIA) Integrate vLLM v1-compatible Blackwell prefill and decode GQA kernels FI: PR !1051
-
FP8 Blockscale Gemm and MoE
- (NVIDIA) FP8 Blockscale GEMM
- (NVIDIA) FP8 Blockscale gemm optimizations: Sm100 blockwise fp8 swap ab #18564
- (NVIDIA) FP8 Blockscale MoE
- (NVIDIA) Latency and throughput optimizations
-
MTP support
Feedback Period.
No response
CC List.
Any Other Things.
No response
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.