Skip to content

Add DeepSeek V2 Model into Transformers #36400

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 55 commits into from
Jul 9, 2025

Conversation

VladOS95-cyber
Copy link
Contributor

@VladOS95-cyber VladOS95-cyber commented Feb 25, 2025

What does this PR do?

DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token, and supports a context length of 128K tokens. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference through significantly compressing the Key-Value (KV) cache into a latent vector, while DeepSeekMoE enables training strong models at an economical cost through sparse computation. Compared with DeepSeek 67B, DeepSeek-V2 achieves significantly stronger performance, and meanwhile saves 42.5% of training costs, reduces the KV cache by 93.3%, and boosts the maximum generation throughput to 5.76 times. We pretrain DeepSeek-V2 on a high-quality and multi-source corpus consisting of 8.1T tokens, and further perform Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) to fully unlock its potential. Evaluation results show that, even with only 21B activated parameters, DeepSeek-V2 and its chat versions still achieve top-tier performance among open-source models.

This PR add DeepSeekV2 into the Transformers library. There is a new thing in transformers called modular, which adds new models by creating a modeling_modelname.py file. Since DeepSeekV2 could reuse some Llama arch parts, it serves as an ideal use case for this modular approach.

Before submitting

Who can review?

@ArthurZucker @Rocketknight1

@Rocketknight1
Copy link
Member

cc @ArthurZucker let me know if you want me to take the first review here!

@VladOS95-cyber VladOS95-cyber marked this pull request as ready for review March 10, 2025 17:03
@VladOS95-cyber
Copy link
Contributor Author

VladOS95-cyber commented Mar 10, 2025

Hi @Rocketknight1! The main structure as well as most code of the model is done, except unit tests which I add later. Just one thing that really bothers me is that model produces always the same token. I don't know how many times I checked everything compared with original implementation but from my point of view, everything is absolutely logically the same, I compared hidden_states from every logical layer as well, it was identical. But anyway, model produces gubberish. I keep investigating of course, but I am really confused. Maybe is there anything I really missed?
For now, I check everything on deepseek-ai/DeepSeek-V2-Lite.

@Rocketknight1
Copy link
Member

hi @VladOS95-cyber, that's a very odd bug! Try comparing the output logits for an input sentence rather than just the hidden states - if those are identical, then I don't understand how this model could generate gibberish but the original model could generate correct text.

@VladOS95-cyber
Copy link
Contributor Author

VladOS95-cyber commented Mar 11, 2025

Hey @Rocketknight1! I found an issue, just one line was accidentally removed in forward prop of Decoder layer.. I think there was a mistake in between commits and I did not notice it as it was removed after I compared states. I fixed that and model works as expected! So, now, I am going to work on unit tests and some small refactoring in names, types and so on.

@VladOS95-cyber
Copy link
Contributor Author

Hi @Rocketknight1! I added unit and integration tests and did small refactoring. So, this PR is completely ready for review.

@VladOS95-cyber VladOS95-cyber force-pushed the add-deepseekv2 branch 2 times, most recently from beac6a5 to 99d5d6c Compare March 13, 2025 12:13
@Rocketknight1
Copy link
Member

hi @VladOS95-cyber, take a look in the CI! There are still some test failures. You may also need to add some classes to the documentation - look at the specific errors in "check_repository_consistency" and check out other model addition PRs to see where the class autodoc lines should go

@VladOS95-cyber
Copy link
Contributor Author

Hi @Rocketknight1! All tests are green

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work / highly related to #35926 🤗 let's take a bit of these convention in here!
mostly removing the code pathes

query_shape = (batch_size, seq_length, -1, self.qk_head_dim)
key_shape = (batch_size, seq_length, -1, self.qk_nope_head_dim + self.v_head_dim)

if self.q_lora_rank is None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this not training dependant?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well I don't see how it might be affected by that

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its just that on the deepv3 thread I got the impression that training influences which path you should be taking!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also same question, are there any checkpoint where qlora_rank is None ?

Comment on lines +153 to +160
base_model_tp_plan = {
"layers.*.self_attn.q_proj": "colwise",
"layers.*.self_attn.q_a_proj": "colwise",
"layers.*.self_attn.q_b_proj": "colwise",
"layers.*.self_attn.kv_b_proj": "colwise",
"layers.*.self_attn.o_proj": "rowwise",
"layers.*.mlp.gate_proj": "colwise",
"layers.*.mlp.up_proj": "colwise",
"layers.*.mlp.down_proj": "rowwise",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you confirm that this works? (TP is quite important for such a big model!)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, I can't confirm as I don't have environment to test that. As reference I took another tp_plan implementations for other models as well as distributed pytorch docs.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try to check that on my side!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much!

@VladOS95-cyber
Copy link
Contributor Author

Hi @ArthurZucker @Rocketknight1! I resolved all comments I could or left mine in response.

@ArthurZucker
Copy link
Collaborator

Thanks! before merging the PR we need to make sure the modeling code works! (big and small) do you have access to enough compute to do that or should we do it ?

@VladOS95-cyber
Copy link
Contributor Author

Thanks! before merging the PR we need to make sure the modeling code works! (big and small) do you have access to enough compute to do that or should we do it ?

Hi @ArthurZucker! Unfortunately I don't, I've already spent everything on testing during development..

@VladOS95-cyber VladOS95-cyber force-pushed the add-deepseekv2 branch 2 times, most recently from 7ae4489 to 04522ee Compare March 28, 2025 13:45
@Cyrilvallez
Copy link
Member

Looks like fp8 will not run on older hardware, my bad I forgot. Let's use bit and bytes instead then. Sorry about that

@Cyrilvallez
Copy link
Member

run-slow: deepseek_v2

Copy link
Contributor

github-actions bot commented Jul 8, 2025

This comment contains run-slow, running the specified jobs:

models: ['models/deepseek_v2']
quantizations: [] ...

@VladOS95-cyber
Copy link
Contributor Author

VladOS95-cyber commented Jul 8, 2025

Looks like fp8 will not run on older hardware, my bad I forgot. Let's use bit and bytes instead then. Sorry about that

Ok, sure, so I will remove fp8 quantization then and try to apply another logic for quantization

@Cyrilvallez
Copy link
Member

Just another quantiozation method instead: https://huggingface.co/docs/transformers/en/quantization/bitsandbytes

@VladOS95-cyber
Copy link
Contributor Author

VladOS95-cyber commented Jul 8, 2025

@Cyrilvallez I just tried to use BitsAndBytesConfig(load_in_8bit=True) for quantization and model produces very different output compared with normal usage. But I ran it on cpu, so i suppose it should be way better on gpu, but I cannot test that unfortunately, but we should not expect a big difference in quantized/not quantized version, right?With not quantized model all tests are green. Should I push it anyway and we can run tests again or?

Copy link
Contributor

github-actions bot commented Jul 9, 2025

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, deepseek_v2

Copy link
Contributor

github-actions bot commented Jul 9, 2025

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, deepseek_v2

@Cyrilvallez
Copy link
Member

run-slow: deepseek_v2

Copy link
Contributor

github-actions bot commented Jul 9, 2025

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, deepseek_v2

Copy link
Contributor

github-actions bot commented Jul 9, 2025

This comment contains run-slow, running the specified jobs:

models: ['models/deepseek_v2']
quantizations: [] ...

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Contributor

github-actions bot commented Jul 9, 2025

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, deepseek_v2

@Cyrilvallez
Copy link
Member

run-slow: deepseek_v2

Copy link
Contributor

github-actions bot commented Jul 9, 2025

This comment contains run-slow, running the specified jobs:

models: ['models/deepseek_v2']
quantizations: [] ...

Copy link
Contributor

github-actions bot commented Jul 9, 2025

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, deepseek_v2

Copy link
Contributor

github-actions bot commented Jul 9, 2025

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, deepseek_v2

Copy link
Member

@Cyrilvallez Cyrilvallez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I fixed the remaining issues with the tests etc! This is now ready to be merged!
Thanks a lot for your work! Great contribution! And thanks for bearing with us! 🤗🚀

@Cyrilvallez Cyrilvallez merged commit c980904 into huggingface:main Jul 9, 2025
23 checks passed
@VladOS95-cyber
Copy link
Contributor Author

Hi @Cyrilvallez! Thank you so much for taking care of tests, really appreciate that and thank you for your support! I am very glad that it is completed!

rjgleaton pushed a commit to rjgleaton/transformers that referenced this pull request Jul 17, 2025
* add initial structure

* doc fixes, add model base logic

* update init files

* some fixes to config and modular

* some improvements for attention

* format

* remove unused attn

* some fixes for moe layer and for decoder

* adapt _compute_yarn_parameters for deepseek

* format

* small fix

* fix for decoder forward

* add tests, small refactoring

* fix dummies

* fix init

* fix doc

* fix config docs

* add sequce doc, fix init for gate

* fix issues in tests

* fix config doc

* remove unused args

* some fixes and refactoring after review

* fix doc for config

* small fixes for config args

* revert config refactoring

* small refactoring

* minor fixes after rebase

* small fix after merge

* fix modular

* remove rotaryembd from public init

* small test fix

* some rotary pos calculation improvement

* fix format

* some improvements and fixes

* fix config

* some refactoring

* adjust some unit tests

* skip test

* small fixes and tests adjustment

* reapply modular

* fix all tests except Integration

* fix integration testzs

* cleanup BC stuff

* rope

* fix integrations tests based on a10

* style

---------

Co-authored-by: Cyril Vallez <cyril.vallez@huggingface.co>
Co-authored-by: Cyril Vallez <cyril.vallez@gmail.com>
@geetu040
Copy link
Contributor

@VladOS95-cyber really nice work!
I just wanted to ask where can I find the HF compatible weights? Or is it supposed to work with the original weights from this official collection: collections/deepseek-ai/deepseek-v2

@VladOS95-cyber
Copy link
Contributor Author

@VladOS95-cyber really nice work! I just wanted to ask where can I find the HF compatible weights? Or is it supposed to work with the original weights from this official collection: collections/deepseek-ai/deepseek-v2

Hi, thank you! It is supposed to work with original weights from official DeepSeek HF repo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants