DeepSeek V4 support by grimoire · Pull Request #4554 · InternLM/lmdeploy

grimoire · 2026-04-24T10:29:31Z

Requirements:

FlashMLA
fast_hadamard_transform
deep_gemm
tile_kernels

This PR is Hopper only since we do not have devices with fp4 support.
Both TP and DPEP have been supported.

result

DeepseekV4-Flash Thinking High

dataset	version	metric	mode	dsv4
GPQA_diamond_repeat_4	772ea0	accuracy (4 runs average)	gen	86.24

Copilot

Pull request overview

This WIP PR adds DeepSeek-V4 model support to the PyTorch engine on Hopper GPUs, including a new sparse FlashMLA attention path, FP8×FP4 fused MoE (both TP and EP via DeepEP), V4-specific compressor/indexer/sinkhorn ops, an HF config registry, and new Triton kernels for FP4 grouped GEMM, KV flattening, and window packing.

Changes:

New V4 attention / compressor / indexer / sinkhorn op layers with CUDA Triton+tilelang implementations, plus FP4 fused MoE kernels and grouped GEMM wrappers.
New lmdeploy.hf_configs module centralizing HF config registration/loading (used by tokenizer, archs, config, check_env); adds DeepseekV4Config and moves DeepseekV32Config.
Cache engine refactor introducing BlockCacheSpec / StateCacheSpec (named, optionally layer-scoped caches), use_standard_kv_cache flag, and update_cache_config_func hook; V4 model config wires these specs.
Rotary embedding gains complex_mode for adjacent-pair RoPE; bitonic top-k kernel becomes persistent + partial-sort; small fixes in rms_norm/ds_index.
Chat template, module map, env flag (LMDEPLOY_FAKE_CUDA_GRAPH_CAPTURE), and graph runner support for the new model; tests added for FP4 GEMM, fused MoE, and complex RoPE.

Reviewed changes

Copilot reviewed 67 out of 68 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
lmdeploy/hf_configs/{init,configuration_deepseek_v32,configuration_deepseek_v4}.py	New central HF config module; adds V4 config and moves V32 config here.
lmdeploy/pytorch/transformers/{init,configuration_deepseek_v32,configuration_deepseek_v4}.py	Re-export shims pointing to `lmdeploy.hf_configs`.
lmdeploy/{archs,tokenizer,model}.py	Switch to `config_from_pretrained`; add DeepseekV4 chat template.
lmdeploy/pytorch/check_env/model.py	Use centralized config loader.
lmdeploy/pytorch/config.py	Add `BlockCacheSpec`/`StateCacheSpec`, `use_standard_kv_cache`, `post_build_func`, `update_cache_config_func`.
lmdeploy/pytorch/configurations/deepseek_v4.py	V4 model config builder, env check, cache spec materialization.
lmdeploy/pytorch/consts.py	V4 FlashMLA sparse FP8 layout constants.
lmdeploy/pytorch/engine/cache_engine.py	Named block/state caches, layer-scoped state cache shape expansion.
lmdeploy/pytorch/engine/executor/base.py	Hook into `update_cache_config_func`; thread state specs through sizing.
lmdeploy/pytorch/engine/model_agent/agent.py	Attach `block_caches`/`named_state_caches` to step context; pass model config to `StateCacheEngine`.
lmdeploy/pytorch/envs.py	`LMDEPLOY_FAKE_CUDA_GRAPH_CAPTURE` flag.
lmdeploy/pytorch/model_inputs.py	New step context fields (max_q_seqlen, named caches).
lmdeploy/pytorch/models/module_map.py	Register `DeepseekV4ForCausalLM`.
lmdeploy/pytorch/models/utils/cudagraph.py	Copy `block_offsets` into context from cudagraph buffers.
lmdeploy/pytorch/backends/{base,attention,compressor,hc_split_sinkhorn,indexer,apply_rotary_emb}.py	New op types (V4Attention/Indexer/Compressor/HcSplitSinkhorn/FusedMoEV4FP4), metadata dataclasses, complex_mode arg on rotary.
lmdeploy/pytorch/backends/default/{norm,apply_rotary_emb}.py	RMSNorm fp32 weight; complex-mode rotate path.
lmdeploy/pytorch/backends/dlinfer/apply_rotary_emb.py	Reject complex_mode.
lmdeploy/pytorch/backends/cuda/{op_backend,apply_rotary_emb,hc_split_sinkhorn,v4_indexer,v4_compressor,graph_runner}.py	Wire V4 op builders; fake-capture path in graph runner.
lmdeploy/pytorch/backends/cuda/attention/{init,v4,v4_utils}.py	V4 attention impl, metadata pre-computation, helper kernels.
lmdeploy/pytorch/backends/cuda/moe/{init,v4_fp4}.py	V4 FP4 MoE TP / EP / DeepGEMM impls.
lmdeploy/pytorch/kernels/cuda/{apply_rotary_pos_emb,bitonic_topk,ds_index,rms_norm,v4_fp4_fused_moe,v4_fp4_grouped_gemm,v4_flatten_kv,v4_pack_window,dsv4/*}.py	New/updated Triton kernels for V4.
lmdeploy/pytorch/nn/{init,norm,rotary_embedding,hc_split_sinkhorn,v4_attention,v4_compressor,v4_indexer,moe/init,moe/v4_fp4}.py	New nn wrappers for V4 ops; `rms_scale`, `forward_single`/`complex_mode` on rotary, `FusedMoEV4FP4`.
tests/pytorch/kernel/{dsv4_utils,test_apply_rotary,test_fuse_moe_v4_fp4,test_v4_fp4_grouped_gemm}.py	Reference helpers and tests for FP4 GEMM/MoE and complex RoPE.
.gitignore	Add `.DS_Store`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

grimoire and others added 30 commits April 24, 2026 18:28

DeepSeek V4 support eager

27da1cd

add template

7e0e304

remove compressed_cache_engine.py

ed59863

fix fused moe

ab63081

support cudagraph

1ef28b6

fix cudagraph phase1/2

daf655f

agent suck

973ebf1

wtf

486002e

fix

41bb9f9

remove quantlinear

2d57ab8

remove debug

016e1dc

remove key

d24ca01

statecache

92b3008

window as state

40d157a

remove

d864604

better fill sliding window

5b6cd60

use flashmla

740eb3a

new start

a1f98f6

add kernels

3a0d6c3

fix layout

da6092e

fix

b8ad3c3

fix

2129e11

newnew

0e9b0c0

fix

2877b8d

opt indexer

dd36198

update compress kernel

615bf1e

optimize attn forward

45c4045

sparse attn

cd86685

fp8 cache

e23d6d0

mla

e8746f5

grimoire added 25 commits May 9, 2026 17:03

add skip layers for debug

0b0c1c8

fix

625356d

refactor v4

f654c54

fix kernel

3cf0b07

opt indexer

a0ec686

merge main

89d89c2

force bitonic topk

43543d8

ep

76056cb

auto block size

616849d

fix

7dcb18d

opt

c7ac9ab

opt moe

37f6ca1

optimize topk

5f5f941

optimize

3fb9f0c

opt kernel

7abd076

opt compressor

a5dba47

decode attn meta once

6adc8c9

optimize prefill

e0ca8a8

optimize

fea204e

fix

d7d47b2

update template

7746461

no tp indexer

1ade47f

opt prefix pos

8a769b2

fix lint

374a0ca

Merge branch 'main' into dsv4

4c9ed5d

grimoire marked this pull request as ready for review May 19, 2026 12:28

Copilot AI review requested due to automatic review settings May 19, 2026 12:28

Copilot started reviewing on behalf of grimoire May 19, 2026 12:28 View session

Copilot AI reviewed May 19, 2026

View reviewed changes

grimoire changed the title ~~[WIP]DeepSeek V4 support~~ DeepSeek V4 support May 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DeepSeek V4 support#4554

DeepSeek V4 support#4554
grimoire wants to merge 77 commits into
InternLM:mainfrom
grimoire:dsv4

grimoire commented Apr 24, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

grimoire commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

result

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

grimoire commented Apr 24, 2026 •

edited

Loading