DeepSeek V4 support#4554
Open
grimoire wants to merge 77 commits into
Open
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This WIP PR adds DeepSeek-V4 model support to the PyTorch engine on Hopper GPUs, including a new sparse FlashMLA attention path, FP8×FP4 fused MoE (both TP and EP via DeepEP), V4-specific compressor/indexer/sinkhorn ops, an HF config registry, and new Triton kernels for FP4 grouped GEMM, KV flattening, and window packing.
Changes:
- New V4 attention / compressor / indexer / sinkhorn op layers with CUDA Triton+tilelang implementations, plus FP4 fused MoE kernels and grouped GEMM wrappers.
- New
lmdeploy.hf_configsmodule centralizing HF config registration/loading (used by tokenizer, archs, config, check_env); addsDeepseekV4Configand movesDeepseekV32Config. - Cache engine refactor introducing
BlockCacheSpec/StateCacheSpec(named, optionally layer-scoped caches),use_standard_kv_cacheflag, andupdate_cache_config_funchook; V4 model config wires these specs. - Rotary embedding gains
complex_modefor adjacent-pair RoPE; bitonic top-k kernel becomes persistent + partial-sort; small fixes inrms_norm/ds_index. - Chat template, module map, env flag (
LMDEPLOY_FAKE_CUDA_GRAPH_CAPTURE), and graph runner support for the new model; tests added for FP4 GEMM, fused MoE, and complex RoPE.
Reviewed changes
Copilot reviewed 67 out of 68 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| lmdeploy/hf_configs/{init,configuration_deepseek_v32,configuration_deepseek_v4}.py | New central HF config module; adds V4 config and moves V32 config here. |
| lmdeploy/pytorch/transformers/{init,configuration_deepseek_v32,configuration_deepseek_v4}.py | Re-export shims pointing to lmdeploy.hf_configs. |
| lmdeploy/{archs,tokenizer,model}.py | Switch to config_from_pretrained; add DeepseekV4 chat template. |
| lmdeploy/pytorch/check_env/model.py | Use centralized config loader. |
| lmdeploy/pytorch/config.py | Add BlockCacheSpec/StateCacheSpec, use_standard_kv_cache, post_build_func, update_cache_config_func. |
| lmdeploy/pytorch/configurations/deepseek_v4.py | V4 model config builder, env check, cache spec materialization. |
| lmdeploy/pytorch/consts.py | V4 FlashMLA sparse FP8 layout constants. |
| lmdeploy/pytorch/engine/cache_engine.py | Named block/state caches, layer-scoped state cache shape expansion. |
| lmdeploy/pytorch/engine/executor/base.py | Hook into update_cache_config_func; thread state specs through sizing. |
| lmdeploy/pytorch/engine/model_agent/agent.py | Attach block_caches/named_state_caches to step context; pass model config to StateCacheEngine. |
| lmdeploy/pytorch/envs.py | LMDEPLOY_FAKE_CUDA_GRAPH_CAPTURE flag. |
| lmdeploy/pytorch/model_inputs.py | New step context fields (max_q_seqlen, named caches). |
| lmdeploy/pytorch/models/module_map.py | Register DeepseekV4ForCausalLM. |
| lmdeploy/pytorch/models/utils/cudagraph.py | Copy block_offsets into context from cudagraph buffers. |
| lmdeploy/pytorch/backends/{base,attention,compressor,hc_split_sinkhorn,indexer,apply_rotary_emb}.py | New op types (V4Attention/Indexer/Compressor/HcSplitSinkhorn/FusedMoEV4FP4), metadata dataclasses, complex_mode arg on rotary. |
| lmdeploy/pytorch/backends/default/{norm,apply_rotary_emb}.py | RMSNorm fp32 weight; complex-mode rotate path. |
| lmdeploy/pytorch/backends/dlinfer/apply_rotary_emb.py | Reject complex_mode. |
| lmdeploy/pytorch/backends/cuda/{op_backend,apply_rotary_emb,hc_split_sinkhorn,v4_indexer,v4_compressor,graph_runner}.py | Wire V4 op builders; fake-capture path in graph runner. |
| lmdeploy/pytorch/backends/cuda/attention/{init,v4,v4_utils}.py | V4 attention impl, metadata pre-computation, helper kernels. |
| lmdeploy/pytorch/backends/cuda/moe/{init,v4_fp4}.py | V4 FP4 MoE TP / EP / DeepGEMM impls. |
| lmdeploy/pytorch/kernels/cuda/{apply_rotary_pos_emb,bitonic_topk,ds_index,rms_norm,v4_fp4_fused_moe,v4_fp4_grouped_gemm,v4_flatten_kv,v4_pack_window,dsv4/*}.py | New/updated Triton kernels for V4. |
| lmdeploy/pytorch/nn/{init,norm,rotary_embedding,hc_split_sinkhorn,v4_attention,v4_compressor,v4_indexer,moe/init,moe/v4_fp4}.py | New nn wrappers for V4 ops; rms_scale, forward_single/complex_mode on rotary, FusedMoEV4FP4. |
| tests/pytorch/kernel/{dsv4_utils,test_apply_rotary,test_fuse_moe_v4_fp4,test_v4_fp4_grouped_gemm}.py | Reference helpers and tests for FP4 GEMM/MoE and complex RoPE. |
| .gitignore | Add .DS_Store. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Requirements:
This PR is Hopper only since we do not have devices with fp4 support.
Both TP and DPEP have been supported.
result
DeepseekV4-Flash Thinking High