Skip to content

Commit edd68f4

Browse files
🚨 No more default chat templates (huggingface#31733)
* No more default chat templates * Add the template to the GPT-SW3 tests since it's not available by default now * Fix GPT2 test * Fix Bloom test * Fix Bloom test * Remove default templates again
1 parent 1c122a4 commit edd68f4

29 files changed

+28
-747
lines changed

docs/source/en/chat_templating.md

+1-18
Original file line numberDiff line numberDiff line change
@@ -580,7 +580,7 @@ default template for that model class is used instead. Let's take a look at the
580580
>>> from transformers import AutoTokenizer
581581
>>> tokenizer = AutoTokenizer.from_pretrained("facebook/blenderbot-400M-distill")
582582

583-
>>> tokenizer.default_chat_template
583+
>>> tokenizer.chat_template
584584
"{% for message in messages %}{% if message['role'] == 'user' %}{{ ' ' }}{% endif %}{{ message['content'] }}{% if not loop.last %}{{ ' ' }}{% endif %}{% endfor %}{{ eos_token }}"
585585
```
586586

@@ -704,23 +704,6 @@ with other names, pass the name of the template you want to the `chat_template`
704704
We find that this can be a bit confusing for users, though - so if you're writing a template yourself, we recommend
705705
trying to put it all in a single template where possible!
706706

707-
### What are "default" templates?
708-
709-
Before the introduction of chat templates, chat handling was hardcoded at the model class level. For backwards
710-
compatibility, we have retained this class-specific handling as default templates, also set at the class level. If a
711-
model does not have a chat template set, but there is a default template for its model class, the `TextGenerationPipeline`
712-
class and methods like `apply_chat_template` will use the class template instead. You can find out what the default
713-
template for your tokenizer is by checking the `tokenizer.default_chat_template` attribute.
714-
715-
This is something we do purely for backward compatibility reasons, to avoid breaking any existing workflows. Even when
716-
the class template is appropriate for your model, we strongly recommend overriding the default template by
717-
setting the `chat_template` attribute explicitly to make it clear to users that your model has been correctly configured
718-
for chat.
719-
720-
Now that actual chat templates have been adopted more widely, default templates have been deprecated and will be
721-
removed in a future release. We strongly recommend setting the `chat_template` attribute for any tokenizers that
722-
still depend on them!
723-
724707
### What template should I use?
725708

726709
When setting the template for a model that's already been trained for chat, you should ensure that the template

docs/source/es/chat_templating.md

+1-7
Original file line numberDiff line numberDiff line change
@@ -220,7 +220,7 @@ La plantilla de chat para un modelo se almacena en el atributo `tokenizer.chat_t
220220
>>> from transformers import AutoTokenizer
221221
>>> tokenizer = AutoTokenizer.from_pretrained("facebook/blenderbot-400M-distill")
222222

223-
>>> tokenizer.default_chat_template
223+
>>> tokenizer.chat_template
224224
"{% for message in messages %}{% if message['role'] == 'user' %}{{ ' ' }}{% endif %}{{ message['content'] }}{% if not loop.last %}{{ ' ' }}{% endif %}{% endfor %}{{ eos_token }}"
225225
```
226226

@@ -307,12 +307,6 @@ Si estás ajustando finamente un modelo para chat, además de establecer una pla
307307

308308
</Tip>
309309

310-
### ¿Qué son las plantillas "default"?
311-
312-
Antes de la introducción de las plantillas de chat, el manejo del chat estaba codificado en el nivel de la clase del modelo. Por razones de compatibilidad con versiones anteriores, hemos conservado este manejo específico de la clase como plantillas predeterminadas, también establecidas a nivel de clase. Si un modelo no tiene una plantilla de chat establecida, pero hay una plantilla predeterminada para su clase de modelo, la clase `TextGenerationPipeline` y métodos como `apply_chat_template` usarán la plantilla de clase en su lugar. Puedes averiguar cuál es la plantilla predeterminada para tu tokenizador comprobando el atributo `tokenizer.default_chat_template`.
313-
314-
Esto es algo que hacemos puramente por razones de compatibilidad con versiones anteriores, para evitar romper cualquier flujo de trabajo existente. Incluso cuando la plantilla de clase es apropiada para tu modelo, recomendamos encarecidamente anular la plantilla predeterminada estableciendo explícitamente el atributo `chat_template` para dejar claro a los usuarios que tu modelo ha sido configurado correctamente para el chat, y para estar preparados para el futuro en caso de que las plantillas predeterminadas alguna vez se alteren o se eliminen.
315-
316310
### ¿Qué plantilla debería usar?
317311

318312
Cuando establezcas la plantilla para un modelo que ya ha sido entrenado para chat, debes asegurarte de que la plantilla coincida exactamente con el formato de mensajes que el modelo vio durante el entrenamiento, o de lo contrario es probable que experimentes degradación del rendimiento. Esto es cierto incluso si estás entrenando aún más el modelo; probablemente obtendrás el mejor rendimiento si mantienes constantes los tokens de chat. Esto es muy análogo a la tokenización: generalmente obtienes el mejor rendimiento para la inferencia o el ajuste fino cuando coincides precisamente con la tokenización utilizada durante el entrenamiento.

docs/source/ja/chat_templating.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -85,7 +85,7 @@ LLM(Language Model)のますます一般的な使用事例の1つは「チ
8585
>>> from transformers import AutoTokenizer
8686
>>> tokenizer = AutoTokenizer.from_pretrained("facebook/blenderbot-400M-distill")
8787

88-
>>> tokenizer.default_chat_template
88+
>>> tokenizer.chat_template
8989
"{% for message in messages %}{% if message['role'] == 'user' %}{{ ' ' }}{% endif %}{{ message['content'] }}{% if not loop.last %}{{ ' ' }}{% endif %}{% endfor %}{{ eos_token }}"
9090
```
9191

docs/source/zh/chat_templating.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -228,7 +228,7 @@ The sun.</s>
228228
>>> from transformers import AutoTokenizer
229229
>>> tokenizer = AutoTokenizer.from_pretrained("facebook/blenderbot-400M-distill")
230230

231-
>>> tokenizer.default_chat_template
231+
>>> tokenizer.chat_template
232232
"{% for message in messages %}{% if message['role'] == 'user' %}{{ ' ' }}{% endif %}{{ message['content'] }}{% if not loop.last %}{{ ' ' }}{% endif %}{% endfor %}{{ eos_token }}"
233233
```
234234

src/transformers/models/blenderbot/tokenization_blenderbot.py

-14
Original file line numberDiff line numberDiff line change
@@ -405,17 +405,3 @@ def build_inputs_with_special_tokens(self, token_ids_0: List[int], token_ids_1:
405405
`List[int]`: list of [input IDs](../glossary#input-ids) with the appropriate special tokens.
406406
"""
407407
return token_ids_0 + [self.eos_token_id]
408-
409-
@property
410-
def default_chat_template(self):
411-
"""
412-
A very simple chat template that just adds whitespace between messages.
413-
"""
414-
return (
415-
"{% for message in messages %}"
416-
"{% if message['role'] == 'user' %}{{ ' ' }}{% endif %}"
417-
"{{ message['content'] }}"
418-
"{% if not loop.last %}{{ ' ' }}{% endif %}"
419-
"{% endfor %}"
420-
"{{ eos_token }}"
421-
)

src/transformers/models/blenderbot/tokenization_blenderbot_fast.py

-15
Original file line numberDiff line numberDiff line change
@@ -287,18 +287,3 @@ def build_inputs_with_special_tokens(self, token_ids_0: List[int], token_ids_1:
287287
`List[int]`: list of [input IDs](../glossary#input-ids) with the appropriate special tokens.
288288
"""
289289
return token_ids_0 + [self.eos_token_id]
290-
291-
@property
292-
# Copied from transformers.models.blenderbot.tokenization_blenderbot.BlenderbotTokenizer.default_chat_template
293-
def default_chat_template(self):
294-
"""
295-
A very simple chat template that just adds whitespace between messages.
296-
"""
297-
return (
298-
"{% for message in messages %}"
299-
"{% if message['role'] == 'user' %}{{ ' ' }}{% endif %}"
300-
"{{ message['content'] }}"
301-
"{% if not loop.last %}{{ ' ' }}{% endif %}"
302-
"{% endfor %}"
303-
"{{ eos_token }}"
304-
)

src/transformers/models/blenderbot_small/tokenization_blenderbot_small.py

-15
Original file line numberDiff line numberDiff line change
@@ -217,18 +217,3 @@ def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] =
217217
index += 1
218218

219219
return vocab_file, merge_file
220-
221-
@property
222-
# Copied from transformers.models.blenderbot.tokenization_blenderbot.BlenderbotTokenizer.default_chat_template
223-
def default_chat_template(self):
224-
"""
225-
A very simple chat template that just adds whitespace between messages.
226-
"""
227-
return (
228-
"{% for message in messages %}"
229-
"{% if message['role'] == 'user' %}{{ ' ' }}{% endif %}"
230-
"{{ message['content'] }}"
231-
"{% if not loop.last %}{{ ' ' }}{% endif %}"
232-
"{% endfor %}"
233-
"{{ eos_token }}"
234-
)

src/transformers/models/blenderbot_small/tokenization_blenderbot_small_fast.py

-15
Original file line numberDiff line numberDiff line change
@@ -98,18 +98,3 @@ def create_token_type_ids_from_sequences(
9898
if token_ids_1 is None:
9999
return len(cls + token_ids_0 + sep) * [0]
100100
return len(cls + token_ids_0 + sep + sep + token_ids_1 + sep) * [0]
101-
102-
@property
103-
# Copied from transformers.models.blenderbot.tokenization_blenderbot.BlenderbotTokenizer.default_chat_template
104-
def default_chat_template(self):
105-
"""
106-
A very simple chat template that just adds whitespace between messages.
107-
"""
108-
return (
109-
"{% for message in messages %}"
110-
"{% if message['role'] == 'user' %}{{ ' ' }}{% endif %}"
111-
"{{ message['content'] }}"
112-
"{% if not loop.last %}{{ ' ' }}{% endif %}"
113-
"{% endfor %}"
114-
"{{ eos_token }}"
115-
)

src/transformers/models/bloom/tokenization_bloom_fast.py

-8
Original file line numberDiff line numberDiff line change
@@ -147,11 +147,3 @@ def _encode_plus(self, *args, **kwargs) -> BatchEncoding:
147147
def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]:
148148
files = self._tokenizer.model.save(save_directory, name=filename_prefix)
149149
return tuple(files)
150-
151-
@property
152-
# Copied from transformers.models.gpt2.tokenization_gpt2.GPT2Tokenizer.default_chat_template
153-
def default_chat_template(self):
154-
"""
155-
A simple chat template that ignores role information and just concatenates messages with EOS tokens.
156-
"""
157-
return "{% for message in messages %}" "{{ message.content }}{{ eos_token }}" "{% endfor %}"

src/transformers/models/code_llama/tokenization_code_llama.py

-55
Original file line numberDiff line numberDiff line change
@@ -437,61 +437,6 @@ def create_token_type_ids_from_sequences(
437437

438438
return output
439439

440-
@property
441-
# Copied from transformers.models.llama.tokenization_llama.LlamaTokenizer.default_chat_template
442-
def default_chat_template(self):
443-
"""
444-
LLaMA uses [INST] and [/INST] to indicate user messages, and <<SYS>> and <</SYS>> to indicate system messages.
445-
Assistant messages do not have special tokens, because LLaMA chat models are generally trained with strict
446-
user/assistant/user/assistant message ordering, and so assistant messages can be identified from the ordering
447-
rather than needing special tokens. The system message is partly 'embedded' in the first user message, which
448-
results in an unusual token ordering when it is present. This template should definitely be changed if you wish
449-
to fine-tune a model with more flexible role ordering!
450-
451-
The output should look something like:
452-
453-
<bos>[INST] B_SYS SystemPrompt E_SYS Prompt [/INST] Answer <eos><bos>[INST] Prompt [/INST] Answer <eos>
454-
<bos>[INST] Prompt [/INST]
455-
456-
The reference for this chat template is [this code
457-
snippet](https://github.com/facebookresearch/llama/blob/556949fdfb72da27c2f4a40b7f0e4cf0b8153a28/llama/generation.py#L320-L362)
458-
in the original repository.
459-
"""
460-
template = (
461-
"{% if messages[0]['role'] == 'system' %}"
462-
"{% set loop_messages = messages[1:] %}" # Extract system message if it's present
463-
"{% set system_message = messages[0]['content'] %}"
464-
"{% elif USE_DEFAULT_PROMPT == true and not '<<SYS>>' in messages[0]['content'] %}"
465-
"{% set loop_messages = messages %}" # Or use the default system message if the flag is set
466-
"{% set system_message = 'DEFAULT_SYSTEM_MESSAGE' %}"
467-
"{% else %}"
468-
"{% set loop_messages = messages %}"
469-
"{% set system_message = false %}"
470-
"{% endif %}"
471-
"{% for message in loop_messages %}" # Loop over all non-system messages
472-
"{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}"
473-
"{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}"
474-
"{% endif %}"
475-
"{% if loop.index0 == 0 and system_message != false %}" # Embed system message in first message
476-
"{% set content = '<<SYS>>\\n' + system_message + '\\n<</SYS>>\\n\\n' + message['content'] %}"
477-
"{% else %}"
478-
"{% set content = message['content'] %}"
479-
"{% endif %}"
480-
"{% if message['role'] == 'user' %}" # After all of that, handle messages/roles in a fairly normal way
481-
"{{ bos_token + '[INST] ' + content.strip() + ' [/INST]' }}"
482-
"{% elif message['role'] == 'system' %}"
483-
"{{ '<<SYS>>\\n' + content.strip() + '\\n<</SYS>>\\n\\n' }}"
484-
"{% elif message['role'] == 'assistant' %}"
485-
"{{ ' ' + content.strip() + ' ' + eos_token }}"
486-
"{% endif %}"
487-
"{% endfor %}"
488-
)
489-
template = template.replace("USE_DEFAULT_PROMPT", "true" if self.use_default_system_prompt else "false")
490-
default_message = DEFAULT_SYSTEM_PROMPT.replace("\n", "\\n").replace("'", "\\'")
491-
template = template.replace("DEFAULT_SYSTEM_MESSAGE", default_message)
492-
493-
return template
494-
495440
def __getstate__(self):
496441
state = self.__dict__.copy()
497442
state["sp_model"] = None

src/transformers/models/code_llama/tokenization_code_llama_fast.py

-55
Original file line numberDiff line numberDiff line change
@@ -349,61 +349,6 @@ def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] =
349349

350350
return (out_vocab_file,)
351351

352-
@property
353-
# Copied from transformers.models.llama.tokenization_llama.LlamaTokenizer.default_chat_template
354-
def default_chat_template(self):
355-
"""
356-
LLaMA uses [INST] and [/INST] to indicate user messages, and <<SYS>> and <</SYS>> to indicate system messages.
357-
Assistant messages do not have special tokens, because LLaMA chat models are generally trained with strict
358-
user/assistant/user/assistant message ordering, and so assistant messages can be identified from the ordering
359-
rather than needing special tokens. The system message is partly 'embedded' in the first user message, which
360-
results in an unusual token ordering when it is present. This template should definitely be changed if you wish
361-
to fine-tune a model with more flexible role ordering!
362-
363-
The output should look something like:
364-
365-
<bos>[INST] B_SYS SystemPrompt E_SYS Prompt [/INST] Answer <eos><bos>[INST] Prompt [/INST] Answer <eos>
366-
<bos>[INST] Prompt [/INST]
367-
368-
The reference for this chat template is [this code
369-
snippet](https://github.com/facebookresearch/llama/blob/556949fdfb72da27c2f4a40b7f0e4cf0b8153a28/llama/generation.py#L320-L362)
370-
in the original repository.
371-
"""
372-
template = (
373-
"{% if messages[0]['role'] == 'system' %}"
374-
"{% set loop_messages = messages[1:] %}" # Extract system message if it's present
375-
"{% set system_message = messages[0]['content'] %}"
376-
"{% elif USE_DEFAULT_PROMPT == true and not '<<SYS>>' in messages[0]['content'] %}"
377-
"{% set loop_messages = messages %}" # Or use the default system message if the flag is set
378-
"{% set system_message = 'DEFAULT_SYSTEM_MESSAGE' %}"
379-
"{% else %}"
380-
"{% set loop_messages = messages %}"
381-
"{% set system_message = false %}"
382-
"{% endif %}"
383-
"{% for message in loop_messages %}" # Loop over all non-system messages
384-
"{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}"
385-
"{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}"
386-
"{% endif %}"
387-
"{% if loop.index0 == 0 and system_message != false %}" # Embed system message in first message
388-
"{% set content = '<<SYS>>\\n' + system_message + '\\n<</SYS>>\\n\\n' + message['content'] %}"
389-
"{% else %}"
390-
"{% set content = message['content'] %}"
391-
"{% endif %}"
392-
"{% if message['role'] == 'user' %}" # After all of that, handle messages/roles in a fairly normal way
393-
"{{ bos_token + '[INST] ' + content.strip() + ' [/INST]' }}"
394-
"{% elif message['role'] == 'system' %}"
395-
"{{ '<<SYS>>\\n' + content.strip() + '\\n<</SYS>>\\n\\n' }}"
396-
"{% elif message['role'] == 'assistant' %}"
397-
"{{ ' ' + content.strip() + ' ' + eos_token }}"
398-
"{% endif %}"
399-
"{% endfor %}"
400-
)
401-
template = template.replace("USE_DEFAULT_PROMPT", "true" if self.use_default_system_prompt else "false")
402-
default_message = DEFAULT_SYSTEM_PROMPT.replace("\n", "\\n").replace("'", "\\'")
403-
template = template.replace("DEFAULT_SYSTEM_MESSAGE", default_message)
404-
405-
return template
406-
407352
def build_inputs_with_special_tokens(
408353
self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
409354
) -> List[int]:

0 commit comments

Comments
 (0)