Open
Description
This issue tracks follow up enhancements after initial support for the Deepseek V3 model. Please feel free to chime in and contribute!
- Follow up [Model] [Quantization] Support deepseek_v3 w8a8 fp8 block-wise quantization #11523: enhance testing with shapes of production models and run it regularly on H100.
- Solving via cutlas blockwise quantization kernels.
- Follow up Deepseek v3 #11502:
- Test and enable torch.compile
-
Refactor MoEMethodBase to unify and clean up the extra arguments ofscoring_func
ande_correction_bias
- Kernel tuning for 8xH200, MI300x, H100 (TP16 and TP8PP2 case)
- Use https://github.com/vllm-project/vllm/blob/main/benchmarks/kernels/benchmark_moe.py, but adapt it for the w8a8 fused moe kernel.
- CUDA Graph support
- MLA [WIP] Deepseek V2 MLA #10927 @simon-mo
- Support nextn prediction heads (EAGLE style prediction heads)
- Support expert parallelism for MoE.
- Support data parallelism for MLA.