Skip to content

Commit 958a242

Browse files
authored
Merge branch 'main' into update-electra-model-card
2 parents 2792fa3 + 9fd9476 commit 958a242

File tree

134 files changed

+4194
-1028
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

134 files changed

+4194
-1028
lines changed

.github/ISSUE_TEMPLATE/bug-report.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -48,11 +48,11 @@ body:
4848
- pipelines: @Rocketknight1
4949
- tensorflow: @gante and @Rocketknight1
5050
- tokenizers: @ArthurZucker and @itazap
51-
- trainer: @muellerzr @SunMarc
51+
- trainer: @zach-huggingface @SunMarc
5252
5353
Integrations:
5454
55-
- deepspeed: HF Trainer/Accelerate: @muellerzr
55+
- deepspeed: HF Trainer/Accelerate: @SunMarc @zach-huggingface
5656
- ray/raytune: @richardliaw, @amogkam
5757
- Big Model Inference: @SunMarc
5858
- quantization (bitsandbytes, autogpt): @SunMarc @MekkCyber

.github/PULL_REQUEST_TEMPLATE.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -51,12 +51,12 @@ Library:
5151
- pipelines: @Rocketknight1
5252
- tensorflow: @gante and @Rocketknight1
5353
- tokenizers: @ArthurZucker
54-
- trainer: @muellerzr and @SunMarc
54+
- trainer: @zach-huggingface and @SunMarc
5555
- chat templates: @Rocketknight1
5656
5757
Integrations:
5858
59-
- deepspeed: HF Trainer/Accelerate: @muellerzr
59+
- deepspeed: HF Trainer/Accelerate: @SunMarc @zach-huggingface
6060
- ray/raytune: @richardliaw, @amogkam
6161
- Big Model Inference: @SunMarc
6262
- quantization (bitsandbytes, autogpt): @SunMarc @MekkCyber

.github/scripts/codeowners_for_review_action

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ docs/ @stevhliu
1414
# Owners of subsections of the library
1515
/src/transformers/generation/ @gante
1616
/src/transformers/pipeline/ @Rocketknight1 @yonigozlan
17-
/src/transformers/integrations/ @SunMarc @MekkCyber @muellerzr
17+
/src/transformers/integrations/ @SunMarc @MekkCyber @zach-huggingface
1818
/src/transformers/quantizers/ @SunMarc @MekkCyber
1919
tests/ @ydshieh
2020
tests/generation/ @gante
@@ -27,8 +27,8 @@ tests/generation/ @gante
2727
# Specific files come after the sections/globs, so they take priority
2828
/.circleci/config.yml @ArthurZucker @ydshieh
2929
/utils/tests_fetcher.py @ydshieh
30-
trainer.py @muellerzr @SunMarc
31-
trainer_utils.py @muellerzr @SunMarc
30+
trainer.py @zach-huggingface @SunMarc
31+
trainer_utils.py @zach-huggingface @SunMarc
3232
/utils/modular_model_converter.py @Cyrilvallez @ArthurZucker
3333

3434
# Owners of individual models are specific / high priority, and so they come last

benchmark/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ def run_benchmark(logger: Logger, branch: str, commit_id: str, commit_msg: str,
1212

1313
## Writing metrics to the database
1414

15-
`MetricRecorder` is thread-safe, in the sense of the python [`Thread`](https://docs.python.org/3/library/threading.html#threading.Thread). This means you can start a background thread to do the readings on the device measurements while not blocking the main thread to execute the model measurements.
15+
`MetricsRecorder` is thread-safe, in the sense of the python [`Thread`](https://docs.python.org/3/library/threading.html#threading.Thread). This means you can start a background thread to do the readings on the device measurements while not blocking the main thread to execute the model measurements.
1616

1717
cf [`llama.py`](./llama.py) to see an example of this in practice.
1818

benchmark/benchmarks_entrypoint.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,6 @@
33
import logging
44
import os
55
from typing import Dict
6-
import psycopg2
76
import sys
87

98
from psycopg2.extras import Json

benchmark/llama.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -215,7 +215,7 @@ def decode_one_token(model, cur_token, cache_position, past_key_values):
215215
torch.cuda.synchronize()
216216
end = perf_counter()
217217
time_to_second_token = end - start
218-
logger.info(f"completed second compile generation in: {time_to_first_token}s")
218+
logger.info(f"completed second compile generation in: {time_to_second_token}s")
219219
cache_position += 1
220220
all_generated_tokens += next_token.clone().detach().cpu().tolist()
221221

@@ -227,7 +227,7 @@ def decode_one_token(model, cur_token, cache_position, past_key_values):
227227
torch.cuda.synchronize()
228228
end = perf_counter()
229229
time_to_third_token = end - start
230-
logger.info(f"completed third compile forward in: {time_to_first_token}s")
230+
logger.info(f"completed third compile forward in: {time_to_third_token}s")
231231
cache_position += 1
232232
all_generated_tokens += next_token.clone().detach().cpu().tolist()
233233

@@ -298,7 +298,7 @@ def decode_one_token(model, cur_token, cache_position, past_key_values):
298298
output = model.generate(**inputs, past_key_values=past_key_values)
299299
end = perf_counter()
300300
third_compile_generate_time = end - start
301-
logger.info(f"completed second compile generation in: {third_compile_generate_time}s")
301+
logger.info(f"completed third compile generation in: {third_compile_generate_time}s")
302302
logger.info(f"generated: {tokenizer.batch_decode(output.cpu().tolist())}")
303303

304304
past_key_values = StaticCache(
@@ -313,7 +313,7 @@ def decode_one_token(model, cur_token, cache_position, past_key_values):
313313
output = model.generate(**inputs, past_key_values=past_key_values)
314314
end = perf_counter()
315315
fourth_compile_generate_time = end - start
316-
logger.info(f"completed second compile generation in: {fourth_compile_generate_time}s")
316+
logger.info(f"completed fourth compile generation in: {fourth_compile_generate_time}s")
317317
logger.info(f"generated: {tokenizer.batch_decode(output.cpu().tolist())}")
318318

319319
metrics_recorder.collect_model_measurements(

docs/source/en/_toctree.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -415,6 +415,8 @@
415415
title: DeBERTa
416416
- local: model_doc/deberta-v2
417417
title: DeBERTa-v2
418+
- local: model_doc/deepseek_v3
419+
title: DeepSeek-V3
418420
- local: model_doc/dialogpt
419421
title: DialoGPT
420422
- local: model_doc/diffllama

docs/source/en/attention_interface.md

Lines changed: 30 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -23,13 +23,13 @@ supported models.
2323
Most recent models can now switch from one attention function used in the Attention layer to the other, thanks to a simple mapping.
2424
By default, we provide the implementation for [`sdpa`](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html),
2525
[`flash_attention_2`](https://github.com/Dao-AILab/flash-attention) and [`flex_attention`](https://pytorch.org/docs/stable/nn.attention.flex_attention.html#module-torch.nn.attention.flex_attention)
26-
as well as `eager`, which is simple matrix multiplication without any optimization on top.
26+
as well as `eager`, which is a simple matrix multiplication without any optimization on top.
2727
This is the setting you can usually choose when instantiating a model:
2828

2929
```python
3030
from transformers import AutoModelForCausalLM
3131

32-
model_id = "meta-llama/Llama-3.2-1B
32+
model_id = "meta-llama/Llama-3.2-1B"
3333

3434
# Here, using flash attention as an example
3535
model = AutoModelForCausalLM.from_pretrained(model_id, attn_implementation="flash_attention_2")
@@ -43,7 +43,7 @@ from transformers import AutoModelForCausalLM, AttentionInterface
4343
from transformers.integrations.sdpa_attention import sdpa_attention_forward
4444
import torch
4545

46-
model_id = "meta-llama/Llama-3.2-1B
46+
model_id = "meta-llama/Llama-3.2-1B"
4747

4848
def my_new_sdpa(*args, **kwargs):
4949
print("I just entered the attention computation")
@@ -56,7 +56,7 @@ model = AutoModelForCausalLM.from_pretrained(model_id, attn_implementation="my_n
5656
model(torch.ones(1, 5, dtype=int))
5757
```
5858

59-
You will see it prints "I just entered the attention computation" as many times as there are layers in the model (with this example, 16 times.
59+
You will see it prints "I just entered the attention computation" as many times as there are layers in the model (with this example, 16 times).
6060

6161
## Dynamically switching attention function
6262

@@ -70,12 +70,12 @@ model(torch.ones(1, 5, dtype=int))
7070
```
7171

7272
and it will stop printing the statements, as it now uses the `sdpa` attention.
73-
This allows to quickly change attention function, without needing to reload the model!
73+
This allows to quickly change an attention function, without needing to reload the model!
7474

75-
## What about new args needed in my custom function?
75+
## What about new args needed in my custom attention function?
7676

7777
But indeed, what if the new function requires a new arg to be properly used? It's no issue! Models supporting the
78-
`AttentionInterface` propagates kwargs all the way to the Attention layers, and to the attention function used. That way,
78+
`AttentionInterface` propagate kwargs all the way to the Attention layers, and to the used attention function. That way,
7979
you can simply pass the arg (as a kwargs, i.e. you need to qualify the name of the arg) in the model's forward, and it will be correctly used in the attention. However, custom attention functions have some limitations. In particular, it must follow the signature and return format of other attention functions, i.e.
8080

8181
```python
@@ -103,4 +103,26 @@ model = AutoModelForCausalLM.from_pretrained(model_id, attn_implementation="cust
103103
model(torch.ones(1, 5, dtype=int), a_new_kwargs=..., another_new_kwargs=...)
104104
```
105105

106-
If in doubt about what args/kwargs a given model sends to the attention function, simply check that model's modeling code on [GitHub](https://github.com/huggingface/transformers/tree/main/src/transformers/models)!
106+
If in doubt about what args/kwargs a given model sends to the attention function, simply check that model's modeling code on [GitHub](https://github.com/huggingface/transformers/tree/main/src/transformers/models)!
107+
108+
## Accessing current available implementations
109+
110+
Most of the time, you will simply need to `register` a new function. If, however, you need to access an existing one,
111+
and/or perform a few checks, the prefered way is to use the global `ALL_ATTENTION_FUNCTIONS`. It behaves the same way you
112+
would expect from a usual Python dictionary:
113+
114+
```python
115+
>>> from transformers.modeling_utils import ALL_ATTENTION_FUNCTIONS
116+
117+
>>> list(ALL_ATTENTION_FUNCTIONS.keys())
118+
>>> ['flash_attention_2', 'flex_attention', 'sdpa']
119+
120+
>>> ALL_ATTENTION_FUNCTIONS["sdpa"]
121+
>>> <function transformers.integrations.sdpa_attention.sdpa_attention_forward>
122+
123+
>>> ALL_ATTENTION_FUNCTIONS.get("sdpa", None)
124+
>>> <function transformers.integrations.sdpa_attention.sdpa_attention_forward>
125+
126+
# You can also globally `register` a new function directly on it
127+
>>> ALL_ATTENTION_FUNCTIONS.register("new_func", new_func)
128+
```
Lines changed: 184 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,184 @@
1+
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License.
11+
12+
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
13+
rendered properly in your Markdown viewer.
14+
15+
-->
16+
17+
# DeepSeek-V3
18+
19+
## Overview
20+
21+
The DeepSeek-V3 model was proposed in [DeepSeek-V3 Technical Report](https://arxiv.org/abs/2412.19437) by DeepSeek-AI Team.
22+
23+
The abstract from the paper is the following:
24+
We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token. To achieve efficient inference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architectures, which were thoroughly validated in DeepSeek-V2. Furthermore, DeepSeek-V3 pioneers an auxiliary-loss-free strategy for load balancing and sets a multi-token prediction training objective for stronger performance. We pre-train DeepSeek-V3 on 14.8 trillion diverse and high-quality tokens, followed by Supervised Fine-Tuning and Reinforcement Learning stages to fully harness its capabilities. Comprehensive evaluations reveal that DeepSeek-V3 outperforms other open-source models and achieves performance comparable to leading closed-source models. Despite its excellent performance, DeepSeek-V3 requires only 2.788M H800 GPU hours for its full training. In addition, its training process is remarkably stable. Throughout the entire training process, we did not experience any irrecoverable loss spikes or perform any rollbacks. The model checkpoints are available at https://github.com/deepseek-ai/DeepSeek-V3.
25+
26+
## Limitations and call for contribution!
27+
28+
We are super happy to make this code community-powered, and would love to see how you can best optimize the following:
29+
30+
- current implementation uses the "naive" attention compution (so not really MLA)
31+
- current implementation loops through the experts. This should be replaced. Pointers to use `get_packed_weights` from `intetrations/tensor_parallel`.
32+
- current implementation uses the eleuther formula for ROPE, using the orginal one would be more efficient! (should still follow our API)
33+
- static cache is not supported (this should be just a generation config issue / config shape issues)
34+
35+
### Usage tips
36+
The model uses Multi-head Latent Attention (MLA) and DeepSeekMoE architectures for efficient inference and cost-effective training. It employs an auxiliary-loss-free strategy for load balancing and multi-token prediction training objective. The model can be used for various language tasks after being pre-trained on 14.8 trillion tokens and going through Supervised Fine-Tuning and Reinforcement Learning stages.
37+
38+
You can run the model in `FP8` automatically, using 2 nodes of 8 H100 should be more than enough!
39+
40+
```python
41+
# `run_deepseek_v1.py`
42+
from transformers import AutoModelForCausalLM, AutoTokenizer
43+
import torch
44+
torch.manual_seed(30)
45+
46+
tokenizer = AutoTokenizer.from_pretrained("deepseek-r1")
47+
48+
chat = [
49+
{"role": "user", "content": "Hello, how are you?"},
50+
{"role": "assistant", "content": "I'm doing great. How can I help you today?"},
51+
{"role": "user", "content": "I'd like to show off how chat templating works!"},
52+
]
53+
54+
55+
model = AutoModelForCausalLM.from_pretrained("deepseek-r1", device_map="auto", torch_dtype=torch.bfloat16)
56+
inputs = tokenizer.apply_chat_template(chat, tokenize=True, add_generation_prompt=True, return_tensors="pt").to(model.device)
57+
import time
58+
start = time.time()
59+
outputs = model.generate(inputs, max_new_tokens=50)
60+
print(tokenizer.batch_decode(outputs))
61+
print(time.time()-start)
62+
```
63+
This generated:
64+
65+
``````
66+
<|Assistant|><think>
67+
Okay, the user wants to demonstrate how chat templating works. Let me break down what that means. Chat templating is about structuring the conversation data, especially for models that need specific input formats. Maybe they're referring to something like how messages are formatted with roles (user, assistant, system) in APIs like OpenAI.
68+
69+
First, I should explain what chat templating is. It's the process of formatting conversation data into a structured format that the model can understand. This usually includes roles and content. For example, user messages, assistant responses, and system messages each have their own role tags.
70+
71+
They might want an example. Let me think of a simple conversation. The user says "Hello, how are you?" and the assistant responds "I'm doing great. How can I help you today?" Then the user follows up with wanting to show off chat templating. So the example should include the history and the new message.
72+
73+
In some frameworks, like Hugging Face's Transformers, chat templates are applied using Jinja2 templates. The template might look something like combining system messages, then looping through user and assistant messages with appropriate tags. For instance, using {% for message in messages %} and assigning roles like <|user|>, <|assistant|>, etc.
74+
75+
I should structure the example with the messages array, showing each role and content. Then apply a hypothetical template to convert that into a formatted string the model uses. Also, mention that different models have different templating requirements, like using special tokens or varying role labels.
76+
77+
Wait, the user mentioned "chat templating" in the context of showing off. Maybe they want a practical example they can present. So providing a code snippet or a structured data example would be helpful. Let me outline a typical messages array and then the templated output.
78+
79+
Also, it's important to note that proper templating ensures the model knows the conversation flow, which is crucial for generating coherent responses. Maybe include a note about why it's important, like maintaining context and role-specific processing.
80+
81+
Let me check if there are any common mistakes or things to avoid. For example, not closing tags properly, or mismatching roles. But maybe that's too detailed unless the user asks. Focus on the positive example first.
82+
83+
Putting it all together, the response should have an example messages array, the applied template, and the final formatted string. Maybe use angle brackets or special tokens as placeholders. Also, mention that this helps in training or fine-tuning models with structured data.
84+
85+
I think that's a solid approach. Let me structure it step by step to make it clear.
86+
</think>
87+
88+
Chat templating is a way to structure conversation data (e.g., user/assistant interactions) into a format that language models understand. This is especially important for models trained to handle multi-turn dialogues, where the input must explicitly separate roles (user, assistant, system, etc.) and messages. Let’s break this down with an example!
89+
90+
---
91+
92+
### **Step 1: Raw Conversation History**
93+
Suppose we have this conversation:
94+
- **User**: "Hello, how are you?"
95+
- **Assistant**: "I'm doing great. How can I help you today?"
96+
- **User**: "I'd like to show off how chat templating works!"
97+
98+
---
99+
100+
### **Step 2: Structured Messages**
101+
In frameworks like Hugging Face Transformers or OpenAI, conversations are often formatted as a list of dictionaries with `role` and `content`:
102+
```python
103+
messages = [
104+
{"role": "user", "content": "Hello, how are you?"},
105+
{"role": "assistant", "content": "I'm doing great. How can I help you today?"},
106+
{"role": "user", "content": "I'd like to show off how chat templating works!"},
107+
]
108+
```
109+
110+
---
111+
112+
### **Step 3: Apply a Chat Template**
113+
A **chat template** converts this structured data into a single string formatted for the model. For example, using a Jinja-style template (common in Hugging Face):
114+
115+
```jinja
116+
{% for message in messages %}
117+
{% if message['role'] == 'user' %}
118+
<|user|>{{ message['content'] }}<|end|>
119+
{% elif message['role'] == 'assistant' %}
120+
<|assistant|>{{ message['content'] }}<|end|>
121+
{% endif %}
122+
{% endfor %}
123+
<|assistant|>
124+
```
125+
126+
---
127+
128+
### **Step 4: Final Templated Output**
129+
Applying the template to our `messages` list would produce:
130+
```text
131+
<|user|>Hello, how are you?<|end|>
132+
<|assistant|>I'm doing great. How can I help you today?<|end|>
133+
<|user|>I'd like to show off how chat templating works!<|end|>
134+
<|assistant|>
135+
```
136+
137+
This tells the model:
138+
1. The conversation history (user/assistant turns).
139+
2. The model’s turn to generate a response (`<|assistant|>` at the end).
140+
141+
---
142+
143+
### **Key Notes**:
144+
- **Role Separation**: Tags like `<|user|>` and `<|assistant|>` help the model distinguish speakers.
145+
- **Special Tokens**: Models often use unique tokens (e.g., `<|end|>`) to mark message boundaries.
146+
- **Flexibility**: Templates vary by model (e.g., OpenAI uses `{"role": "user", "content": "..."}` instead of tags).
147+
148+
---
149+
150+
### **Why This Matters**:
151+
- **Consistency**: Ensures the model understands dialogue structure.
152+
- **Context Preservation**: Maintains the flow of multi-turn conversations.
153+
- **Alignment**: Matches the format the model was trained on for better performance.
154+
155+
Want to dive deeper or see a specific framework’s implementation (e.g., OpenAI, Llama, Mistral)? Let me know! 😊<|end▁of▁sentence|>
156+
``````
157+
158+
Use the following to run it
159+
```bash
160+
torchrun --nproc_per_node=8 --nnodes=2 --node_rank=0|1 --rdzv-id an_id --rdzv-backend c10d --rdzv-endpoint master_addr:master_port run_deepseek_r1.py
161+
```
162+
163+
If you have:
164+
```bash
165+
[rank0]: ncclInternalError: Internal check failed.
166+
[rank0]: Last error:
167+
[rank0]: Bootstrap : no socket interface found
168+
```
169+
error, it means NCCL was probably not loaded.
170+
171+
172+
## DeepseekV3Config
173+
174+
[[autodoc]] DeepseekV3Config
175+
176+
## DeepseekV3Model
177+
178+
[[autodoc]] DeepseekV3Model
179+
- forward
180+
181+
## DeepseekV3ForCausalLM
182+
183+
[[autodoc]] DeepseekV3ForCausalLM
184+
- forward

0 commit comments

Comments
 (0)