-
Notifications
You must be signed in to change notification settings - Fork 29.9k
Add DeepSeek V2 Model into Transformers #36400
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
a43e9e3
to
1a4f2d4
Compare
1f95056
to
17ffdce
Compare
cc @ArthurZucker let me know if you want me to take the first review here! |
be7eabf
to
c7c1d28
Compare
Hi @Rocketknight1! The main structure as well as most code of the model is done, except unit tests which I add later. Just one thing that really bothers me is that model produces always the same token. I don't know how many times I checked everything compared with original implementation but from my point of view, everything is absolutely logically the same, I compared hidden_states from every logical layer as well, it was identical. But anyway, model produces gubberish. I keep investigating of course, but I am really confused. Maybe is there anything I really missed? |
hi @VladOS95-cyber, that's a very odd bug! Try comparing the output logits for an input sentence rather than just the hidden states - if those are identical, then I don't understand how this model could generate gibberish but the original model could generate correct text. |
Hey @Rocketknight1! I found an issue, just one line was accidentally removed in forward prop of Decoder layer.. I think there was a mistake in between commits and I did not notice it as it was removed after I compared states. I fixed that and model works as expected! So, now, I am going to work on unit tests and some small refactoring in names, types and so on. |
Hi @Rocketknight1! I added unit and integration tests and did small refactoring. So, this PR is completely ready for review. |
beac6a5
to
99d5d6c
Compare
hi @VladOS95-cyber, take a look in the CI! There are still some test failures. You may also need to add some classes to the documentation - look at the specific errors in "check_repository_consistency" and check out other model addition PRs to see where the class autodoc lines should go |
99d5d6c
to
a8ed886
Compare
Hi @Rocketknight1! All tests are green |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work / highly related to #35926 🤗 let's take a bit of these convention in here!
mostly removing the code pathes
query_shape = (batch_size, seq_length, -1, self.qk_head_dim) | ||
key_shape = (batch_size, seq_length, -1, self.qk_nope_head_dim + self.v_head_dim) | ||
|
||
if self.q_lora_rank is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this not training dependant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well I don't see how it might be affected by that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its just that on the deepv3 thread I got the impression that training influences which path you should be taking!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also same question, are there any checkpoint where qlora_rank is None ?
base_model_tp_plan = { | ||
"layers.*.self_attn.q_proj": "colwise", | ||
"layers.*.self_attn.q_a_proj": "colwise", | ||
"layers.*.self_attn.q_b_proj": "colwise", | ||
"layers.*.self_attn.kv_b_proj": "colwise", | ||
"layers.*.self_attn.o_proj": "rowwise", | ||
"layers.*.mlp.gate_proj": "colwise", | ||
"layers.*.mlp.up_proj": "colwise", | ||
"layers.*.mlp.down_proj": "rowwise", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you confirm that this works? (TP is quite important for such a big model!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, I can't confirm as I don't have environment to test that. As reference I took another tp_plan implementations for other models as well as distributed pytorch docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try to check that on my side!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much!
7dae9da
to
f9a98e4
Compare
Hi @ArthurZucker @Rocketknight1! I resolved all comments I could or left mine in response. |
Thanks! before merging the PR we need to make sure the modeling code works! (big and small) do you have access to enough compute to do that or should we do it ? |
Hi @ArthurZucker! Unfortunately I don't, I've already spent everything on testing during development.. |
7ae4489
to
04522ee
Compare
Looks like fp8 will not run on older hardware, my bad I forgot. Let's use bit and bytes instead then. Sorry about that |
run-slow: deepseek_v2 |
This comment contains run-slow, running the specified jobs: models: ['models/deepseek_v2'] |
Ok, sure, so I will remove fp8 quantization then and try to apply another logic for quantization |
Just another quantiozation method instead: https://huggingface.co/docs/transformers/en/quantization/bitsandbytes |
@Cyrilvallez I just tried to use |
[For maintainers] Suggested jobs to run (before merge) run-slow: auto, deepseek_v2 |
[For maintainers] Suggested jobs to run (before merge) run-slow: auto, deepseek_v2 |
run-slow: deepseek_v2 |
[For maintainers] Suggested jobs to run (before merge) run-slow: auto, deepseek_v2 |
This comment contains run-slow, running the specified jobs: models: ['models/deepseek_v2'] |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
[For maintainers] Suggested jobs to run (before merge) run-slow: auto, deepseek_v2 |
run-slow: deepseek_v2 |
This comment contains run-slow, running the specified jobs: models: ['models/deepseek_v2'] |
[For maintainers] Suggested jobs to run (before merge) run-slow: auto, deepseek_v2 |
[For maintainers] Suggested jobs to run (before merge) run-slow: auto, deepseek_v2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, I fixed the remaining issues with the tests etc! This is now ready to be merged!
Thanks a lot for your work! Great contribution! And thanks for bearing with us! 🤗🚀
Hi @Cyrilvallez! Thank you so much for taking care of tests, really appreciate that and thank you for your support! I am very glad that it is completed! |
* add initial structure * doc fixes, add model base logic * update init files * some fixes to config and modular * some improvements for attention * format * remove unused attn * some fixes for moe layer and for decoder * adapt _compute_yarn_parameters for deepseek * format * small fix * fix for decoder forward * add tests, small refactoring * fix dummies * fix init * fix doc * fix config docs * add sequce doc, fix init for gate * fix issues in tests * fix config doc * remove unused args * some fixes and refactoring after review * fix doc for config * small fixes for config args * revert config refactoring * small refactoring * minor fixes after rebase * small fix after merge * fix modular * remove rotaryembd from public init * small test fix * some rotary pos calculation improvement * fix format * some improvements and fixes * fix config * some refactoring * adjust some unit tests * skip test * small fixes and tests adjustment * reapply modular * fix all tests except Integration * fix integration testzs * cleanup BC stuff * rope * fix integrations tests based on a10 * style --------- Co-authored-by: Cyril Vallez <cyril.vallez@huggingface.co> Co-authored-by: Cyril Vallez <cyril.vallez@gmail.com>
@VladOS95-cyber really nice work! |
Hi, thank you! It is supposed to work with original weights from official DeepSeek HF repo. |
What does this PR do?
DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token, and supports a context length of 128K tokens. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference through significantly compressing the Key-Value (KV) cache into a latent vector, while DeepSeekMoE enables training strong models at an economical cost through sparse computation. Compared with DeepSeek 67B, DeepSeek-V2 achieves significantly stronger performance, and meanwhile saves 42.5% of training costs, reduces the KV cache by 93.3%, and boosts the maximum generation throughput to 5.76 times. We pretrain DeepSeek-V2 on a high-quality and multi-source corpus consisting of 8.1T tokens, and further perform Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) to fully unlock its potential. Evaluation results show that, even with only 21B activated parameters, DeepSeek-V2 and its chat versions still achieve top-tier performance among open-source models.
This PR add DeepSeekV2 into the Transformers library. There is a new thing in transformers called modular, which adds new models by creating a modeling_modelname.py file. Since DeepSeekV2 could reuse some Llama arch parts, it serves as an ideal use case for this modular approach.
Before submitting
Pull Request section?
to it if that's the case. Link: Deepseek v2 #35317
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@ArthurZucker @Rocketknight1