Add DeepSeek V2 Model into Transformers #36400

VladOS95-cyber · 2025-02-25T14:52:27Z

What does this PR do?

DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token, and supports a context length of 128K tokens. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference through significantly compressing the Key-Value (KV) cache into a latent vector, while DeepSeekMoE enables training strong models at an economical cost through sparse computation. Compared with DeepSeek 67B, DeepSeek-V2 achieves significantly stronger performance, and meanwhile saves 42.5% of training costs, reduces the KV cache by 93.3%, and boosts the maximum generation throughput to 5.76 times. We pretrain DeepSeek-V2 on a high-quality and multi-source corpus consisting of 8.1T tokens, and further perform Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) to fully unlock its potential. Evaluation results show that, even with only 21B activated parameters, DeepSeek-V2 and its chat versions still achieve top-tier performance among open-source models.

This PR add DeepSeekV2 into the Transformers library. There is a new thing in transformers called modular, which adds new models by creating a modeling_modelname.py file. Since DeepSeekV2 could reuse some Llama arch parts, it serves as an ideal use case for this modular approach.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case. Link: Deepseek v2 #35317
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@ArthurZucker @Rocketknight1

Rocketknight1 · 2025-03-07T14:34:50Z

cc @ArthurZucker let me know if you want me to take the first review here!

VladOS95-cyber · 2025-03-10T18:01:39Z

Hi @Rocketknight1! The main structure as well as most code of the model is done, except unit tests which I add later. Just one thing that really bothers me is that model produces always the same token. I don't know how many times I checked everything compared with original implementation but from my point of view, everything is absolutely logically the same, I compared hidden_states from every logical layer as well, it was identical. But anyway, model produces gubberish. I keep investigating of course, but I am really confused. Maybe is there anything I really missed?
For now, I check everything on deepseek-ai/DeepSeek-V2-Lite.

Rocketknight1 · 2025-03-11T13:19:35Z

hi @VladOS95-cyber, that's a very odd bug! Try comparing the output logits for an input sentence rather than just the hidden states - if those are identical, then I don't understand how this model could generate gibberish but the original model could generate correct text.

VladOS95-cyber · 2025-03-11T13:49:42Z

Hey @Rocketknight1! I found an issue, just one line was accidentally removed in forward prop of Decoder layer.. I think there was a mistake in between commits and I did not notice it as it was removed after I compared states. I fixed that and model works as expected! So, now, I am going to work on unit tests and some small refactoring in names, types and so on.

VladOS95-cyber · 2025-03-12T16:20:38Z

Hi @Rocketknight1! I added unit and integration tests and did small refactoring. So, this PR is completely ready for review.

Rocketknight1 · 2025-03-13T15:06:26Z

hi @VladOS95-cyber, take a look in the CI! There are still some test failures. You may also need to add some classes to the documentation - look at the specific errors in "check_repository_consistency" and check out other model addition PRs to see where the class autodoc lines should go

VladOS95-cyber · 2025-03-14T12:06:52Z

Hi @Rocketknight1! All tests are green

ArthurZucker

Great work / highly related to #35926 🤗 let's take a bit of these convention in here!
mostly removing the code pathes

docs/source/en/index.md

src/transformers/models/deepseek_v2/modular_deepseek_v2.py

ArthurZucker · 2025-03-20T10:41:06Z

src/transformers/models/deepseek_v2/modular_deepseek_v2.py

+        query_shape = (batch_size, seq_length, -1, self.qk_head_dim)
+        key_shape = (batch_size, seq_length, -1, self.qk_nope_head_dim + self.v_head_dim)
+
+        if self.q_lora_rank is None:


is this not training dependant?

Well I don't see how it might be affected by that

Its just that on the deepv3 thread I got the impression that training influences which path you should be taking!

also same question, are there any checkpoint where qlora_rank is None ?

ArthurZucker · 2025-03-20T10:42:17Z

src/transformers/models/deepseek_v2/modular_deepseek_v2.py

+    base_model_tp_plan = {
+        "layers.*.self_attn.q_proj": "colwise",
+        "layers.*.self_attn.q_a_proj": "colwise",
+        "layers.*.self_attn.q_b_proj": "colwise",
+        "layers.*.self_attn.kv_b_proj": "colwise",
+        "layers.*.self_attn.o_proj": "rowwise",
+        "layers.*.mlp.gate_proj": "colwise",
+        "layers.*.mlp.up_proj": "colwise",
+        "layers.*.mlp.down_proj": "rowwise",


can you confirm that this works? (TP is quite important for such a big model!)

Unfortunately, I can't confirm as I don't have environment to test that. As reference I took another tp_plan implementations for other models as well as distributed pytorch docs.

I'll try to check that on my side!

Thank you so much!

tests/models/deepseek_v2/test_modeling_deepseek_v2.py

VladOS95-cyber · 2025-03-26T14:24:00Z

Hi @ArthurZucker @Rocketknight1! I resolved all comments I could or left mine in response.

ArthurZucker · 2025-03-27T12:58:26Z

Thanks! before merging the PR we need to make sure the modeling code works! (big and small) do you have access to enough compute to do that or should we do it ?

VladOS95-cyber · 2025-03-27T13:23:29Z

Thanks! before merging the PR we need to make sure the modeling code works! (big and small) do you have access to enough compute to do that or should we do it ?

Hi @ArthurZucker! Unfortunately I don't, I've already spent everything on testing during development..

Cyrilvallez · 2025-07-08T16:03:54Z

Looks like fp8 will not run on older hardware, my bad I forgot. Let's use bit and bytes instead then. Sorry about that

Cyrilvallez · 2025-07-08T16:04:02Z

run-slow: deepseek_v2

github-actions · 2025-07-08T16:05:29Z

This comment contains run-slow, running the specified jobs:

models: ['models/deepseek_v2']
quantizations: [] ...

VladOS95-cyber · 2025-07-08T16:09:01Z

Looks like fp8 will not run on older hardware, my bad I forgot. Let's use bit and bytes instead then. Sorry about that

Ok, sure, so I will remove fp8 quantization then and try to apply another logic for quantization

Cyrilvallez · 2025-07-08T17:35:51Z

Just another quantiozation method instead: https://huggingface.co/docs/transformers/en/quantization/bitsandbytes

VladOS95-cyber · 2025-07-08T19:08:09Z

@Cyrilvallez I just tried to use BitsAndBytesConfig(load_in_8bit=True) for quantization and model produces very different output compared with normal usage. But I ran it on cpu, so i suppose it should be way better on gpu, but I cannot test that unfortunately, but we should not expect a big difference in quantized/not quantized version, right?With not quantized model all tests are green. Should I push it anyway and we can run tests again or?

github-actions · 2025-07-09T10:06:24Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, deepseek_v2

github-actions · 2025-07-09T10:34:43Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, deepseek_v2

Cyrilvallez · 2025-07-09T10:43:45Z

run-slow: deepseek_v2

github-actions · 2025-07-09T10:44:34Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, deepseek_v2

github-actions · 2025-07-09T10:45:08Z

This comment contains run-slow, running the specified jobs:

models: ['models/deepseek_v2']
quantizations: [] ...

HuggingFaceDocBuilderDev · 2025-07-09T10:58:27Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

github-actions · 2025-07-09T12:33:25Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, deepseek_v2

Cyrilvallez · 2025-07-09T12:34:05Z

run-slow: deepseek_v2

github-actions · 2025-07-09T12:35:24Z

This comment contains run-slow, running the specified jobs:

models: ['models/deepseek_v2']
quantizations: [] ...

github-actions · 2025-07-09T14:51:00Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, deepseek_v2

github-actions · 2025-07-09T14:55:12Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, deepseek_v2

Cyrilvallez

Alright, I fixed the remaining issues with the tests etc! This is now ready to be merged!
Thanks a lot for your work! Great contribution! And thanks for bearing with us! 🤗🚀

VladOS95-cyber · 2025-07-09T15:10:57Z

Hi @Cyrilvallez! Thank you so much for taking care of tests, really appreciate that and thank you for your support! I am very glad that it is completed!

* add initial structure * doc fixes, add model base logic * update init files * some fixes to config and modular * some improvements for attention * format * remove unused attn * some fixes for moe layer and for decoder * adapt _compute_yarn_parameters for deepseek * format * small fix * fix for decoder forward * add tests, small refactoring * fix dummies * fix init * fix doc * fix config docs * add sequce doc, fix init for gate * fix issues in tests * fix config doc * remove unused args * some fixes and refactoring after review * fix doc for config * small fixes for config args * revert config refactoring * small refactoring * minor fixes after rebase * small fix after merge * fix modular * remove rotaryembd from public init * small test fix * some rotary pos calculation improvement * fix format * some improvements and fixes * fix config * some refactoring * adjust some unit tests * skip test * small fixes and tests adjustment * reapply modular * fix all tests except Integration * fix integration testzs * cleanup BC stuff * rope * fix integrations tests based on a10 * style --------- Co-authored-by: Cyril Vallez <cyril.vallez@huggingface.co> Co-authored-by: Cyril Vallez <cyril.vallez@gmail.com>

geetu040 · 2025-07-22T15:38:47Z

@VladOS95-cyber really nice work!
I just wanted to ask where can I find the HF compatible weights? Or is it supposed to work with the original weights from this official collection: collections/deepseek-ai/deepseek-v2

VladOS95-cyber · 2025-07-23T17:12:55Z

@VladOS95-cyber really nice work! I just wanted to ask where can I find the HF compatible weights? Or is it supposed to work with the original weights from this official collection: collections/deepseek-ai/deepseek-v2

Hi, thank you! It is supposed to work with original weights from official DeepSeek HF repo.

VladOS95-cyber force-pushed the add-deepseekv2 branch from a43e9e3 to 1a4f2d4 Compare February 26, 2025 16:22

VladOS95-cyber force-pushed the add-deepseekv2 branch from 1f95056 to 17ffdce Compare March 7, 2025 14:00

VladOS95-cyber force-pushed the add-deepseekv2 branch from be7eabf to c7c1d28 Compare March 10, 2025 17:03

VladOS95-cyber marked this pull request as ready for review March 10, 2025 17:03

github-actions bot requested review from ArthurZucker and Rocketknight1 March 10, 2025 17:04

VladOS95-cyber force-pushed the add-deepseekv2 branch 2 times, most recently from beac6a5 to 99d5d6c Compare March 13, 2025 12:13

VladOS95-cyber force-pushed the add-deepseekv2 branch from 99d5d6c to a8ed886 Compare March 14, 2025 11:30

ArthurZucker reviewed Mar 20, 2025

View reviewed changes

VladOS95-cyber force-pushed the add-deepseekv2 branch from 7dae9da to f9a98e4 Compare March 26, 2025 14:22

VladOS95-cyber requested a review from ArthurZucker March 26, 2025 14:42

VladOS95-cyber force-pushed the add-deepseekv2 branch 2 times, most recently from 7ae4489 to 04522ee Compare March 28, 2025 13:45

VladOS95-cyber added 6 commits March 29, 2025 11:57

add initial structure

e690c32

doc fixes, add model base logic

4c4e6e4

update init files

f544bba

some fixes to config and modular

841e47a

some improvements for attention

436551b

format

03e4b08

fix all tests except Integration

59e1ddb

fix integration testzs

019efb5

cleanup BC stuff

8e85935

rope

89a2e71

fix integrations tests based on a10

4733bf1

style

797bf10

Cyrilvallez approved these changes Jul 9, 2025

View reviewed changes

Cyrilvallez merged commit c980904 into huggingface:main Jul 9, 2025
23 checks passed

ArthurZucker added the New model label Jul 15, 2025

Add DeepSeek V2 Model into Transformers #36400

Add DeepSeek V2 Model into Transformers #36400

Uh oh!

Conversation

VladOS95-cyber commented Feb 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

Rocketknight1 commented Mar 7, 2025

Uh oh!

VladOS95-cyber commented Mar 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Rocketknight1 commented Mar 11, 2025

Uh oh!

VladOS95-cyber commented Mar 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

VladOS95-cyber commented Mar 12, 2025

Uh oh!

Rocketknight1 commented Mar 13, 2025

Uh oh!

VladOS95-cyber commented Mar 14, 2025

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

VladOS95-cyber commented Mar 26, 2025

Uh oh!

ArthurZucker commented Mar 27, 2025

Uh oh!

VladOS95-cyber commented Mar 27, 2025

Uh oh!

Cyrilvallez commented Jul 8, 2025

Uh oh!

Cyrilvallez commented Jul 8, 2025

Uh oh!

github-actions bot commented Jul 8, 2025

Uh oh!

VladOS95-cyber commented Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Cyrilvallez commented Jul 8, 2025

Uh oh!

VladOS95-cyber commented Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jul 9, 2025

Uh oh!

github-actions bot commented Jul 9, 2025

Uh oh!

Cyrilvallez commented Jul 9, 2025

Uh oh!

github-actions bot commented Jul 9, 2025

VladOS95-cyber commented Feb 25, 2025 •

edited

Loading

VladOS95-cyber commented Mar 10, 2025 •

edited

Loading

VladOS95-cyber commented Mar 11, 2025 •

edited

Loading

VladOS95-cyber commented Jul 8, 2025 •

edited

Loading

VladOS95-cyber commented Jul 8, 2025 •

edited

Loading