Skip to content

DeepSeek V4 support#4554

Open
grimoire wants to merge 77 commits into
InternLM:mainfrom
grimoire:dsv4
Open

DeepSeek V4 support#4554
grimoire wants to merge 77 commits into
InternLM:mainfrom
grimoire:dsv4

Conversation

@grimoire
Copy link
Copy Markdown
Collaborator

@grimoire grimoire commented Apr 24, 2026

Requirements:

  • FlashMLA
  • fast_hadamard_transform
  • deep_gemm
  • tile_kernels

This PR is Hopper only since we do not have devices with fp4 support.
Both TP and DPEP have been supported.

result

DeepseekV4-Flash Thinking High

dataset version metric mode dsv4
GPQA_diamond_repeat_4 772ea0 accuracy (4 runs average) gen 86.24

@grimoire grimoire marked this pull request as ready for review May 19, 2026 12:28
Copilot AI review requested due to automatic review settings May 19, 2026 12:28
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This WIP PR adds DeepSeek-V4 model support to the PyTorch engine on Hopper GPUs, including a new sparse FlashMLA attention path, FP8×FP4 fused MoE (both TP and EP via DeepEP), V4-specific compressor/indexer/sinkhorn ops, an HF config registry, and new Triton kernels for FP4 grouped GEMM, KV flattening, and window packing.

Changes:

  • New V4 attention / compressor / indexer / sinkhorn op layers with CUDA Triton+tilelang implementations, plus FP4 fused MoE kernels and grouped GEMM wrappers.
  • New lmdeploy.hf_configs module centralizing HF config registration/loading (used by tokenizer, archs, config, check_env); adds DeepseekV4Config and moves DeepseekV32Config.
  • Cache engine refactor introducing BlockCacheSpec / StateCacheSpec (named, optionally layer-scoped caches), use_standard_kv_cache flag, and update_cache_config_func hook; V4 model config wires these specs.
  • Rotary embedding gains complex_mode for adjacent-pair RoPE; bitonic top-k kernel becomes persistent + partial-sort; small fixes in rms_norm/ds_index.
  • Chat template, module map, env flag (LMDEPLOY_FAKE_CUDA_GRAPH_CAPTURE), and graph runner support for the new model; tests added for FP4 GEMM, fused MoE, and complex RoPE.

Reviewed changes

Copilot reviewed 67 out of 68 changed files in this pull request and generated no comments.

Show a summary per file
File Description
lmdeploy/hf_configs/{init,configuration_deepseek_v32,configuration_deepseek_v4}.py New central HF config module; adds V4 config and moves V32 config here.
lmdeploy/pytorch/transformers/{init,configuration_deepseek_v32,configuration_deepseek_v4}.py Re-export shims pointing to lmdeploy.hf_configs.
lmdeploy/{archs,tokenizer,model}.py Switch to config_from_pretrained; add DeepseekV4 chat template.
lmdeploy/pytorch/check_env/model.py Use centralized config loader.
lmdeploy/pytorch/config.py Add BlockCacheSpec/StateCacheSpec, use_standard_kv_cache, post_build_func, update_cache_config_func.
lmdeploy/pytorch/configurations/deepseek_v4.py V4 model config builder, env check, cache spec materialization.
lmdeploy/pytorch/consts.py V4 FlashMLA sparse FP8 layout constants.
lmdeploy/pytorch/engine/cache_engine.py Named block/state caches, layer-scoped state cache shape expansion.
lmdeploy/pytorch/engine/executor/base.py Hook into update_cache_config_func; thread state specs through sizing.
lmdeploy/pytorch/engine/model_agent/agent.py Attach block_caches/named_state_caches to step context; pass model config to StateCacheEngine.
lmdeploy/pytorch/envs.py LMDEPLOY_FAKE_CUDA_GRAPH_CAPTURE flag.
lmdeploy/pytorch/model_inputs.py New step context fields (max_q_seqlen, named caches).
lmdeploy/pytorch/models/module_map.py Register DeepseekV4ForCausalLM.
lmdeploy/pytorch/models/utils/cudagraph.py Copy block_offsets into context from cudagraph buffers.
lmdeploy/pytorch/backends/{base,attention,compressor,hc_split_sinkhorn,indexer,apply_rotary_emb}.py New op types (V4Attention/Indexer/Compressor/HcSplitSinkhorn/FusedMoEV4FP4), metadata dataclasses, complex_mode arg on rotary.
lmdeploy/pytorch/backends/default/{norm,apply_rotary_emb}.py RMSNorm fp32 weight; complex-mode rotate path.
lmdeploy/pytorch/backends/dlinfer/apply_rotary_emb.py Reject complex_mode.
lmdeploy/pytorch/backends/cuda/{op_backend,apply_rotary_emb,hc_split_sinkhorn,v4_indexer,v4_compressor,graph_runner}.py Wire V4 op builders; fake-capture path in graph runner.
lmdeploy/pytorch/backends/cuda/attention/{init,v4,v4_utils}.py V4 attention impl, metadata pre-computation, helper kernels.
lmdeploy/pytorch/backends/cuda/moe/{init,v4_fp4}.py V4 FP4 MoE TP / EP / DeepGEMM impls.
lmdeploy/pytorch/kernels/cuda/{apply_rotary_pos_emb,bitonic_topk,ds_index,rms_norm,v4_fp4_fused_moe,v4_fp4_grouped_gemm,v4_flatten_kv,v4_pack_window,dsv4/*}.py New/updated Triton kernels for V4.
lmdeploy/pytorch/nn/{init,norm,rotary_embedding,hc_split_sinkhorn,v4_attention,v4_compressor,v4_indexer,moe/init,moe/v4_fp4}.py New nn wrappers for V4 ops; rms_scale, forward_single/complex_mode on rotary, FusedMoEV4FP4.
tests/pytorch/kernel/{dsv4_utils,test_apply_rotary,test_fuse_moe_v4_fp4,test_v4_fp4_grouped_gemm}.py Reference helpers and tests for FP4 GEMM/MoE and complex RoPE.
.gitignore Add .DS_Store.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@grimoire grimoire changed the title [WIP]DeepSeek V4 support DeepSeek V4 support May 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants