Skip to content

Commit a4ed16f

Browse files
committed
Merge branch 'main' of https://github.com/huggingface/transformers into finish-fix-spm-models
2 parents 4b5315b + c385de2 commit a4ed16f

File tree

446 files changed

+8563
-5993
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

446 files changed

+8563
-5993
lines changed

.circleci/create_circleci_config.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -285,6 +285,7 @@ def job_name(self):
285285
"pip install -U --upgrade-strategy eager git+https://github.com/huggingface/accelerate",
286286
],
287287
parallelism=1,
288+
pytest_num_workers=8,
288289
)
289290

290291

Makefile

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -80,6 +80,7 @@ fix-copies:
8080
python utils/check_copies.py --fix_and_overwrite
8181
python utils/check_table.py --fix_and_overwrite
8282
python utils/check_dummies.py --fix_and_overwrite
83+
python utils/check_doctest_list.py --fix_and_overwrite
8384
python utils/check_task_guides.py --fix_and_overwrite
8485

8586
# Run tests for the library

awesome-transformers.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -601,3 +601,9 @@ All Hugging Face models and pipelines can be seamlessly integrated into BentoML
601601

602602
Keywords: BentoML, Framework, Deployment, AI Applications
603603

604+
## [LLaMA-Efficient-Tuning](https://github.com/hiyouga/LLaMA-Efficient-Tuning)
605+
606+
[LLaMA-Efficient-Tuning](https://github.com/hiyouga/LLaMA-Efficient-Tuning) offers a user-friendly fine-tuning framework that incorporates PEFT. The repository includes training(fine-tuning) and inference examples for LLaMA-2, BLOOM, Falcon, Baichuan, Qwen, and other LLMs. A ChatGLM version is also available in [ChatGLM-Efficient-Tuning](https://github.com/hiyouga/ChatGLM-Efficient-Tuning).
607+
608+
Keywords: PEFT, fine-tuning, LLaMA-2, ChatGLM, Qwen
609+

docker/transformers-all-latest-gpu/Dockerfile

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,7 @@ RUN echo torch=$VERSION
3131
# TODO: We might need to specify proper versions that work with a specific torch version (especially for past CI).
3232
RUN [ "$PYTORCH" != "pre" ] && python3 -m pip install --no-cache-dir -U $VERSION torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/$CUDA || python3 -m pip install --no-cache-dir -U --pre torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/nightly/$CUDA
3333

34-
RUN python3 -m pip install --no-cache-dir -U tensorflow==2.12 protobuf==3.20.3 tensorflow_text tensorflow_probability
34+
RUN python3 -m pip install --no-cache-dir -U tensorflow==2.13 protobuf==3.20.3 tensorflow_text tensorflow_probability
3535

3636
RUN python3 -m pip install --no-cache-dir -e ./transformers[dev,onnxruntime]
3737

@@ -47,8 +47,11 @@ RUN python3 -m pip install --no-cache-dir git+https://github.com/huggingface/acc
4747
# Add bitsandbytes for mixed int8 testing
4848
RUN python3 -m pip install --no-cache-dir bitsandbytes
4949

50-
# For bettertransformer
51-
RUN python3 -m pip install --no-cache-dir optimum
50+
# Add auto-gptq for gtpq quantization testing
51+
RUN python3 -m pip install --no-cache-dir auto-gptq
52+
53+
# For bettertransformer + gptq
54+
RUN python3 -m pip install --no-cache-dir git+https://github.com/huggingface/optimum@main#egg=optimum
5255

5356
# For video model testing
5457
RUN python3 -m pip install --no-cache-dir decord av==9.2.0

docker/transformers-tensorflow-gpu/Dockerfile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ RUN git clone https://github.com/huggingface/transformers && cd transformers &&
1212
RUN python3 -m pip install --no-cache-dir -e ./transformers[dev-tensorflow,testing]
1313

1414
# If set to nothing, will install the latest version
15-
ARG TENSORFLOW='2.12'
15+
ARG TENSORFLOW='2.13'
1616

1717
RUN [ ${#TENSORFLOW} -gt 0 ] && VERSION='tensorflow=='$TENSORFLOW'.*' || VERSION='tensorflow'; python3 -m pip install --no-cache-dir -U $VERSION
1818
RUN python3 -m pip uninstall -y torch flax

docs/source/de/quicktour.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -68,11 +68,13 @@ Installieren Sie die folgenden Abhängigkeiten, falls Sie dies nicht bereits get
6868

6969
<frameworkcontent>
7070
<pt>
71+
7172
```bash
7273
pip install torch
7374
```
7475
</pt>
7576
<tf>
77+
7678
```bash
7779
pip install tensorflow
7880
```
@@ -226,6 +228,7 @@ Genau wie die [`pipeline`] akzeptiert der Tokenizer eine Liste von Eingaben. Dar
226228

227229
<frameworkcontent>
228230
<pt>
231+
229232
```py
230233
>>> pt_batch = tokenizer(
231234
... ["We are very happy to show you the 🤗 Transformers library.", "We hope you don't hate it."],
@@ -237,6 +240,7 @@ Genau wie die [`pipeline`] akzeptiert der Tokenizer eine Liste von Eingaben. Dar
237240
```
238241
</pt>
239242
<tf>
243+
240244
```py
241245
>>> tf_batch = tokenizer(
242246
... ["We are very happy to show you the 🤗 Transformers library.", "We hope you don't hate it."],
@@ -375,6 +379,7 @@ Ein besonders cooles 🤗 Transformers-Feature ist die Möglichkeit, ein Modell
375379

376380
<frameworkcontent>
377381
<pt>
382+
378383
```py
379384
>>> from transformers import AutoModel
380385

@@ -383,6 +388,7 @@ Ein besonders cooles 🤗 Transformers-Feature ist die Möglichkeit, ein Modell
383388
```
384389
</pt>
385390
<tf>
391+
386392
```py
387393
>>> from transformers import TFAutoModel
388394

docs/source/en/_toctree.yml

Lines changed: 11 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,8 @@
2323
title: Share your model
2424
- local: transformers_agents
2525
title: Agents
26+
- local: llm_tutorial
27+
title: Generation with LLMs
2628
title: Tutorials
2729
- sections:
2830
- sections:
@@ -73,18 +75,23 @@
7375
title: Image captioning
7476
- local: tasks/document_question_answering
7577
title: Document Question Answering
78+
- local: tasks/visual_question_answering
79+
title: Visual Question Answering
7680
- local: tasks/text-to-speech
7781
title: Text to speech
7882
title: Multimodal
7983
isExpanded: false
84+
- sections:
85+
- local: generation_strategies
86+
title: Customize the generation strategy
87+
title: Generation
88+
isExpanded: false
8089
title: Task Guides
8190
- sections:
8291
- local: fast_tokenizers
8392
title: Use fast tokenizers from 🤗 Tokenizers
8493
- local: multilingual
8594
title: Run inference with multilingual models
86-
- local: generation_strategies
87-
title: Customize text generation strategy
8895
- local: create_a_model
8996
title: Use model-specific APIs
9097
- local: custom_models
@@ -147,6 +154,8 @@
147154
title: Troubleshooting
148155
- local: tf_xla
149156
title: XLA Integration for TensorFlow Models
157+
- local: perf_torch_compile
158+
title: Optimize inference using `torch.compile()`
150159
title: Performance and scalability
151160
- sections:
152161
- local: contributing

docs/source/en/add_new_model.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -101,7 +101,7 @@ own regarding how code should be written :-)
101101
1. The forward pass of your model should be fully written in the modeling file while being fully independent of other
102102
models in the library. If you want to reuse a block from another model, copy the code and paste it with a
103103
`# Copied from` comment on top (see [here](https://github.com/huggingface/transformers/blob/v4.17.0/src/transformers/models/roberta/modeling_roberta.py#L160)
104-
for a good example).
104+
for a good example and [there](pr_checks#check-copies) for more documentation on Copied from).
105105
2. The code should be fully understandable, even by a non-native English speaker. This means you should pick
106106
descriptive variable names and avoid abbreviations. As an example, `activation` is preferred to `act`.
107107
One-letter variable names are strongly discouraged unless it's an index in a for loop.

docs/source/en/llm_tutorial.md

Lines changed: 221 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,221 @@
1+
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License.
11+
12+
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
13+
rendered properly in your Markdown viewer.
14+
15+
-->
16+
17+
18+
# Generation with LLMs
19+
20+
[[open-in-colab]]
21+
22+
LLMs, or Large Language Models, are the key component behind text generation. In a nutshell, they consist of large pretrained transformer models trained to predict the next word (or, more precisely, token) given some input text. Since they predict one token at a time, you need to do something more elaborate to generate new sentences other than just calling the model -- you need to do autoregressive generation.
23+
24+
Autoregressive generation is the inference-time procedure of iteratively calling a model with its own generated outputs, given a few initial inputs. In 🤗 Transformers, this is handled by the [`~generation.GenerationMixin.generate`] method, which is available to all models with generative capabilities.
25+
26+
This tutorial will show you how to:
27+
28+
* Generate text with an LLM
29+
* Avoid common pitfalls
30+
* Next steps to help you get the most out your LLM
31+
32+
Before you begin, make sure you have all the necessary libraries installed:
33+
34+
```bash
35+
pip install transformers bitsandbytes>=0.39.0 -q
36+
```
37+
38+
39+
## Generate text
40+
41+
A language model trained for [causal language modeling](tasks/language_modeling) takes a sequence of text tokens as input and returns the probability distribution for the next token.
42+
43+
<!-- [GIF 1 -- FWD PASS] -->
44+
<figure class="image table text-center m-0 w-full">
45+
<video
46+
style="max-width: 90%; margin: auto;"
47+
autoplay loop muted playsinline
48+
src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/assisted-generation/gif_1_1080p.mov"
49+
></video>
50+
<figcaption>"Forward pass of an LLM"</figcaption>
51+
</figure>
52+
53+
A critical aspect of autoregressive generation with LLMs is how to select the next token from this probability distribution. Anything goes in this step as long as you end up with a token for the next iteration. This means it can be as simple as selecting the most likely token from the probability distribution or as complex as applying a dozen transformations before sampling from the resulting distribution.
54+
55+
<!-- [GIF 2 -- TEXT GENERATION] -->
56+
<figure class="image table text-center m-0 w-full">
57+
<video
58+
style="max-width: 90%; margin: auto;"
59+
autoplay loop muted playsinline
60+
src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/assisted-generation/gif_2_1080p.mov"
61+
></video>
62+
<figcaption>"Autoregressive generation iteratively selects the next token from a probability distribution to generate text"</figcaption>
63+
</figure>
64+
65+
The process depicted above is repeated iteratively until some stopping condition is reached. Ideally, the stopping condition is dictated by the model, which should learn when to output an end-of-sequence (`EOS`) token. If this is not the case, generation stops when some predefined maximum length is reached.
66+
67+
Properly setting up the token selection step and the stopping condition is essential to make your model behave as you'd expect on your task. That is why we have a [`~generation.GenerationConfig`] file associated with each model, which contains a good default generative parameterization and is loaded alongside your model.
68+
69+
Let's talk code!
70+
71+
<Tip>
72+
73+
If you're interested in basic LLM usage, our high-level [`Pipeline`](pipeline_tutorial) interface is a great starting point. However, LLMs often require advanced features like quantization and fine control of the token selection step, which is best done through [`~generation.GenerationMixin.generate`]. Autoregressive generation with LLMs is also resource-intensive and should be executed on a GPU for adequate throughput.
74+
75+
</Tip>
76+
77+
<!-- TODO: update example to llama 2 (or a newer popular baseline) when it becomes ungated -->
78+
First, you need to load the model.
79+
80+
```py
81+
>>> from transformers import AutoModelForCausalLM
82+
83+
>>> model = AutoModelForCausalLM.from_pretrained(
84+
... "openlm-research/open_llama_7b", device_map="auto", load_in_4bit=True
85+
... )
86+
```
87+
88+
You'll notice two flags in the `from_pretrained` call:
89+
90+
- `device_map` ensures the model is moved to your GPU(s)
91+
- `load_in_4bit` applies [4-bit dynamic quantization](main_classes/quantization) to massively reduce the resource requirements
92+
93+
There are other ways to initialize a model, but this is a good baseline to begin with an LLM.
94+
95+
Next, you need to preprocess your text input with a [tokenizer](tokenizer_summary).
96+
97+
```py
98+
>>> from transformers import AutoTokenizer
99+
100+
>>> tokenizer = AutoTokenizer.from_pretrained("openlm-research/open_llama_7b")
101+
>>> model_inputs = tokenizer(["A list of colors: red, blue"], return_tensors="pt").to("cuda")
102+
```
103+
104+
The `model_inputs` variable holds the tokenized text input, as well as the attention mask. While [`~generation.GenerationMixin.generate`] does its best effort to infer the attention mask when it is not passed, we recommend passing it whenever possible for optimal results.
105+
106+
Finally, call the [`~generation.GenerationMixin.generate`] method to returns the generated tokens, which should be converted to text before printing.
107+
108+
```py
109+
>>> generated_ids = model.generate(**model_inputs)
110+
>>> tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
111+
'A list of colors: red, blue, green, yellow, black, white, and brown'
112+
```
113+
114+
And that's it! In a few lines of code, you can harness the power of an LLM.
115+
116+
117+
## Common pitfalls
118+
119+
There are many [generation strategies](generation_strategies), and sometimes the default values may not be appropriate for your use case. If your outputs aren't aligned with what you're expecting, we've created a list of the most common pitfalls and how to avoid them.
120+
121+
```py
122+
>>> from transformers import AutoModelForCausalLM, AutoTokenizer
123+
124+
>>> tokenizer = AutoTokenizer.from_pretrained("openlm-research/open_llama_7b")
125+
>>> tokenizer.pad_token = tokenizer.eos_token # Llama has no pad token by default
126+
>>> model = AutoModelForCausalLM.from_pretrained(
127+
... "openlm-research/open_llama_7b", device_map="auto", load_in_4bit=True
128+
... )
129+
```
130+
131+
### Generated output is too short/long
132+
133+
If not specified in the [`~generation.GenerationConfig`] file, `generate` returns up to 20 tokens by default. We highly recommend manually setting `max_new_tokens` in your `generate` call to control the maximum number of new tokens it can return. Keep in mind LLMs (more precisely, [decoder-only models](https://huggingface.co/learn/nlp-course/chapter1/6?fw=pt)) also return the input prompt as part of the output.
134+
135+
136+
```py
137+
>>> model_inputs = tokenizer(["A sequence of numbers: 1, 2"], return_tensors="pt").to("cuda")
138+
139+
>>> # By default, the output will contain up to 20 tokens
140+
>>> generated_ids = model.generate(**model_inputs)
141+
>>> tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
142+
'A sequence of numbers: 1, 2, 3, 4, 5'
143+
144+
>>> # Setting `max_new_tokens` allows you to control the maximum length
145+
>>> generated_ids = model.generate(**model_inputs, max_new_tokens=50)
146+
>>> tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
147+
'A sequence of numbers: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,'
148+
```
149+
150+
### Incorrect generation mode
151+
152+
By default, and unless specified in the [`~generation.GenerationConfig`] file, `generate` selects the most likely token at each iteration (greedy decoding). Depending on your task, this may be undesirable; creative tasks like chatbots or writing an essay benefit from sampling. On the other hand, input-grounded tasks like audio transcription or translation benefit from greedy decoding. Enable sampling with `do_sample=True`, and you can learn more about this topic in this [blog post](https://huggingface.co/blog/how-to-generate).
153+
154+
```py
155+
>>> # Set seed or reproducibility -- you don't need this unless you want full reproducibility
156+
>>> from transformers import set_seed
157+
>>> set_seed(0)
158+
159+
>>> model_inputs = tokenizer(["I am a cat."], return_tensors="pt").to("cuda")
160+
161+
>>> # LLM + greedy decoding = repetitive, boring output
162+
>>> generated_ids = model.generate(**model_inputs)
163+
>>> tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
164+
'I am a cat. I am a cat. I am a cat. I am a cat'
165+
166+
>>> # With sampling, the output becomes more creative!
167+
>>> generated_ids = model.generate(**model_inputs, do_sample=True)
168+
>>> tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
169+
'I am a cat.\nI just need to be. I am always.\nEvery time'
170+
```
171+
172+
### Wrong padding side
173+
174+
LLMs are [decoder-only](https://huggingface.co/learn/nlp-course/chapter1/6?fw=pt) architectures, meaning they continue to iterate on your input prompt. If your inputs do not have the same length, they need to be padded. Since LLMs are not trained to continue from pad tokens, your input needs to be left-padded. Make sure you also don't forget to pass the attention mask to generate!
175+
176+
```py
177+
>>> # The tokenizer initialized above has right-padding active by default: the 1st sequence,
178+
>>> # which is shorter, has padding on the right side. Generation fails.
179+
>>> model_inputs = tokenizer(
180+
... ["1, 2, 3", "A, B, C, D, E"], padding=True, return_tensors="pt"
181+
... ).to("cuda")
182+
>>> generated_ids = model.generate(**model_inputs)
183+
>>> tokenizer.batch_decode(generated_ids[0], skip_special_tokens=True)[0]
184+
''
185+
186+
>>> # With left-padding, it works as expected!
187+
>>> tokenizer = AutoTokenizer.from_pretrained("openlm-research/open_llama_7b", padding_side="left")
188+
>>> tokenizer.pad_token = tokenizer.eos_token # Llama has no pad token by default
189+
>>> model_inputs = tokenizer(
190+
... ["1, 2, 3", "A, B, C, D, E"], padding=True, return_tensors="pt"
191+
... ).to("cuda")
192+
>>> generated_ids = model.generate(**model_inputs)
193+
>>> tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
194+
'1, 2, 3, 4, 5, 6,'
195+
```
196+
197+
<!-- TODO: when the prompting guide is ready, mention the importance of setting the right prompt in this section -->
198+
199+
## Further resources
200+
201+
While the autoregressive generation process is relatively straightforward, making the most out of your LLM can be a challenging endeavor because there are many moving parts. For your next steps to help you dive deeper into LLM usage and understanding:
202+
203+
<!-- TODO: complete with new guides -->
204+
### Advanced generate usage
205+
206+
1. [Guide](generation_strategies) on how to control different generation methods, how to set up the generation configuration file, and how to stream the output;
207+
2. API reference on [`~generation.GenerationConfig`], [`~generation.GenerationMixin.generate`], and [generate-related classes](internal/generation_utils).
208+
209+
### LLM leaderboards
210+
211+
1. [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard), which focuses on the quality of the open-source models;
212+
2. [Open LLM-Perf Leaderboard](https://huggingface.co/spaces/optimum/llm-perf-leaderboard), which focuses on LLM throughput.
213+
214+
### Latency and throughput
215+
216+
1. [Guide](main_classes/quantization) on dynamic quantization, which shows you how to drastically reduce your memory requirements.
217+
218+
### Related libraries
219+
220+
1. [`text-generation-inference`](https://github.com/huggingface/text-generation-inference), a production-ready server for LLMs;
221+
2. [`optimum`](https://github.com/huggingface/optimum), an extension of 🤗 Transformers that optimizes for specific hardware devices.

0 commit comments

Comments
 (0)