Skip to content

Commit a57efa4

Browse files
stevhliuzucchini-nlp
authored andcommitted
[docs] Model docs (huggingface#36469)
* initial * fix * fix * update * fix * fixes * quantization * attention mask visualizer * multimodal * small changes * fix code samples
1 parent 03d94ea commit a57efa4

File tree

7 files changed

+625
-693
lines changed

7 files changed

+625
-693
lines changed

docs/source/en/model_doc/bert.md

Lines changed: 70 additions & 168 deletions
Large diffs are not rendered by default.

docs/source/en/model_doc/gemma3.md

Lines changed: 142 additions & 84 deletions
Original file line numberDiff line numberDiff line change
@@ -15,36 +15,63 @@ rendered properly in your Markdown viewer.
1515
1616
-->
1717

18-
# Gemma3
18+
<div style="float: right;">
19+
<div class="flex flex-wrap space-x-1">
20+
<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
21+
<img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
22+
</div>
23+
</div>
1924

20-
## Overview
25+
# Gemma 3
2126

22-
The Gemma 3 model was proposed in the [Gemma 3 Techncial Report](https://goo.gle/Gemma3Report) by Google. It is a vision-language model composed by a [SigLIP](siglip) vision encoder and a [Gemma 2](gemma_2) language decoder, linked by a multimodal linear projection. It cuts an image into a fixed number of tokens, in the same way as SigLIP, as long as the image does not exceed certain aspect ratio. For images that exceed the given aspect ratio, it crops the image into multiple smaller patches and concatenates them with the base image embedding. One particularity is that the model uses bidirectional attention on all the image tokens. In addition, the model interleaves sliding window local attention with full causal attention in the language backbone, where each sixth layer is a full causal attention layer.
27+
[Gemma 3](https://goo.gle/Gemma3Report) is a multimodal model with pretrained and instruction-tuned variants, available in 1B, 13B, and 27B parameters. The architecture is mostly the same as the previous Gemma versions. The key differences are alternating 5 local sliding window self-attention layers for every global self-attention layer, support for a longer context length of 128K tokens, and a [SigLip](./siglip) encoder that can "pan & scan" high-resolution images to prevent information from disappearing in high resolution images or images with non-square aspect ratios.
2328

24-
This model was contributed by [Ryan Mullins](https://huggingface.co/RyanMullins), [Raushan Turganbay](https://huggingface.co/RaushanTurganbay) [Arthur Zucker](https://huggingface.co/ArthurZ), and [Pedro Cuenca](https://huggingface.co/pcuenq).
29+
The instruction-tuned variant was post-trained with knowledge distillation and reinforcement learning.
2530

31+
You can find all the original Gemma 3 checkpoints under the [Gemma 3](https://huggingface.co/collections/meta-llama/llama-2-family-661da1f90a9d678b6f55773b) release.
2632

27-
## Usage tips
33+
> [!TIP]
34+
> Click on the Gemma 3 models in the right sidebar for more examples of how to apply Gemma to different vision and language tasks.
2835
36+
The example below demonstrates how to generate text based on an image with [`Pipeline`] or the [`AutoModel`] class.
2937

30-
- For image+text and image-only inputs use `Gemma3ForConditionalGeneration`.
31-
- For text-only inputs use `Gemma3ForCausalLM` for generation to avoid loading the vision tower.
32-
- Each sample can contain multiple images, and the number of images can vary between samples. However, make sure to pass correctly batched images to the processor, where each batch is a list of one or more images.
33-
- The text passed to the processor should have a `<start_of_image>` token wherever an image should be inserted.
34-
- The processor has its own `apply_chat_template` method to convert chat messages to model inputs. See the examples below for more details on how to use it.
38+
<hfoptions id="usage">
39+
<hfoption id="Pipeline">
3540

41+
```py
42+
import torch
43+
from transformers import pipeline
3644

37-
### Image cropping for high resolution images
38-
39-
The model supports cropping images into smaller patches when the image aspect ratio exceeds a certain value. By default the images are not cropped and only the base image is forwarded to the model. Users can set `do_pan_and_scan=True` to obtain several crops per image along with the base image to improve the quality in DocVQA or similar tasks requiring higher resolution images.
45+
pipeline = pipeline(
46+
task="image-text-to-text",
47+
model="google/gemma-3-4b-pt",
48+
device=0,
49+
torch_dtype=torch.bfloat16
50+
)
51+
pipeline(
52+
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg",
53+
text="<start_of_image> What is shown in this image?"
54+
)
55+
```
4056

41-
Pan and scan is an inference time optimization to handle images with skewed aspect ratios. When enabled, it improves performance on tasks related to document understanding, infographics, OCR, etc.
57+
</hfoption>
58+
<hfoption id="AutoModel">
4259

43-
```python
60+
```py
61+
import torch
62+
from transformers import AutoProcessor, Gemma3ForConditionalGeneration
4463

45-
processor = AutoProcessor.from_pretrained("google/gemma-3-4b-it", padding_side="left")
64+
model = Gemma3ForConditionalGeneration.from_pretrained(
65+
"google/gemma-3-4b-it",
66+
torch_dtype=torch.bfloat16,
67+
device_map="auto",
68+
attn_implementation="sdpa"
69+
)
70+
processor = AutoProcessor.from_pretrained(
71+
"google/gemma-3-4b-it",
72+
padding_side="left"
73+
)
4674

47-
url = "https://media.istockphoto.com/id/1192867753/photo/cow-in-berchida-beach-siniscola.jpg?s=612x612&w=0&k=20&c=v0hjjniwsMNfJSuKWZuIn8pssmD5h5bSN1peBd1CmH4="
4875
messages = [
4976
{
5077
"role": "system",
@@ -54,7 +81,7 @@ messages = [
5481
},
5582
{
5683
"role": "user", "content": [
57-
{"type": "image", "url": url},
84+
{"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"},
5885
{"type": "text", "text": "What is shown in this image?"},
5986
]
6087
},
@@ -65,59 +92,43 @@ inputs = processor.apply_chat_template(
6592
return_dict=True,
6693
return_tensors="pt",
6794
add_generation_prompt=True,
68-
do_pan_and_scan=True,
69-
).to(model.device)
95+
).to("cuda")
7096

97+
output = model.generate(**inputs, max_new_tokens=50, cache_implementation="static")
98+
print(processor.decode(output[0], skip_special_tokens=True))
7199
```
72100

101+
</hfoption>
102+
<hfoption id="transformers-cli">
73103

74-
## Usage Example
75-
76-
### Single-image Inference
77-
78-
```python
79-
from transformers import AutoProcessor, Gemma3ForConditionalGeneration
104+
```bash
105+
echo -e "Plants create energy through a process known as" | transformers-cli run --task text-generation --model google/gemma-3-1b-pt --device 0
106+
```
80107

81-
model_id = "google/gemma-3-4b-it"
82-
model = Gemma3ForConditionalGeneration.from_pretrained(model_id, device_map="auto")
83-
processor = AutoProcessor.from_pretrained(model_id, padding_side="left")
108+
</hfoption>
109+
</hfoptions>
84110

85-
url = "https://media.istockphoto.com/id/1192867753/photo/cow-in-berchida-beach-siniscola.jpg?s=612x612&w=0&k=20&c=v0hjjniwsMNfJSuKWZuIn8pssmD5h5bSN1peBd1CmH4="
86-
messages = [
87-
{
88-
"role": "system",
89-
"content": [
90-
{"type": "text", "text": "You are a helpful assistant."}
91-
]
92-
},
93-
{
94-
"role": "user", "content": [
95-
{"type": "image", "url": url},
96-
{"type": "text", "text": "What is shown in this image?"},
97-
]
98-
},
99-
]
100-
inputs = processor.apply_chat_template(
101-
messages,
102-
tokenize=True,
103-
return_dict=True,
104-
return_tensors="pt",
105-
add_generation_prompt=True,
106-
).to(model.device)
111+
Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends.
107112

108-
output = model.generate(**inputs, max_new_tokens=50)
109-
print(processor.decode(output[0], skip_special_tokens=True)[inputs.input_ids.shape[1]: ])
110-
```
113+
The example below uses [torchao](../quantization/torchao) to only quantize the weights to int4.
111114

112-
### Multi-image Inference
115+
```py
116+
# pip install torchao
117+
import torch
118+
from transformers import TorchAoConfig, Gemma3ForConditionalGeneration, AutoProcessor
113119

114-
```python
115-
model_id = "google/gemma-3-4b-it"
116-
model = Gemma3ForConditionalGeneration.from_pretrained(model_id, device_map="auto")
117-
processor = AutoProcessor.from_pretrained(model_id, padding_side="left")
120+
quantization_config = TorchAoConfig("int4_weight_only", group_size=128)
121+
model = Gemma3ForConditionalGeneration.from_pretrained(
122+
"google/gemma-3-27b-it",
123+
torch_dtype=torch.bfloat16,
124+
device_map="auto",
125+
quantization_config=quantization_config
126+
)
127+
processor = AutoProcessor.from_pretrained(
128+
"google/gemma-3-27b-it",
129+
padding_side="left"
130+
)
118131

119-
url_cow = "https://media.istockphoto.com/id/1192867753/photo/cow-in-berchida-beach-siniscola.jpg?s=612x612&w=0&k=20&c=v0hjjniwsMNfJSuKWZuIn8pssmD5h5bSN1peBd1CmH4="
120-
url_stop = "https://www.ilankelman.org/stopsigns/australia.jpg"
121132
messages = [
122133
{
123134
"role": "system",
@@ -127,9 +138,8 @@ messages = [
127138
},
128139
{
129140
"role": "user", "content": [
130-
{"type": "image", "url": url_cow},
131-
{"type": "image", "url": url_stop},
132-
{"type": "text", "text": "Are these two images identical?"},
141+
{"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"},
142+
{"type": "text", "text": "What is shown in this image?"},
133143
]
134144
},
135145
]
@@ -139,33 +149,81 @@ inputs = processor.apply_chat_template(
139149
return_dict=True,
140150
return_tensors="pt",
141151
add_generation_prompt=True,
142-
).to(model.device)
143-
144-
output = model.generate(**inputs, max_new_tokens=50)
145-
print(processor.decode(output[0], skip_special_tokens=True)[inputs.input_ids.shape[1]: ])
152+
).to("cuda")
146153

154+
output = model.generate(**inputs, max_new_tokens=50, cache_implementation="static")
155+
print(processor.decode(output[0], skip_special_tokens=True))
147156
```
148157

149-
### Text-only inference
150-
151-
You can use the VLMs for text-only generation by omitting images in your input. However, you can also load the models in text-only mode as shown below. This will skip loading the vision tower and will save resources when you just need the LLM capabilities.
152-
```python
153-
from transformers import AutoTokenizer, Gemma3ForCausalLM
154-
155-
model_id = "google/gemma-3-1b-it"
156-
157-
tokenizer = AutoTokenizer.from_pretrained(model_id)
158-
model = Gemma3ForCausalLM.from_pretrained(model_id, device_map="auto")
159-
160-
input_ids = tokenizer("Write me a poem about Machine Learning.", return_tensors="pt").to(model.device)
161-
162-
outputs = model.generate(**input_ids, max_new_tokens=100)
163-
text = tokenizer.batch_decode(outputs, skip_special_tokens=True)
158+
Use the [AttentionMaskVisualizer](https://github.com/huggingface/transformers/blob/beb9b5b02246b9b7ee81ddf938f93f44cfeaad19/src/transformers/utils/attention_visualizer.py#L139) to better understand what tokens the model can and cannot attend to.
164159

165-
print(text)
160+
```py
161+
from transformers.utils.attention_visualizer import AttentionMaskVisualizer
166162

163+
visualizer = AttentionMaskVisualizer("google/gemma-3-4b-it")
164+
visualizer("<img>What is shown in this image?")
167165
```
168166

167+
## Notes
168+
169+
- Use [`Gemma3ForConditionalGeneration`] for image-and-text and image-only inputs.
170+
- Gemma 3 supports multiple input images, but make sure the images are correctly batched before passing them to the processor. Each batch should be a list of one or more images.
171+
172+
```py
173+
url_cow = "https://media.istockphoto.com/id/1192867753/photo/cow-in-berchida-beach-siniscola.jpg?s=612x612&w=0&k=20&c=v0hjjniwsMNfJSuKWZuIn8pssmD5h5bSN1peBd1CmH4="
174+
url_cat = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
175+
176+
messages =[
177+
{
178+
"role": "system",
179+
"content": [
180+
{"type": "text", "text": "You are a helpful assistant."}
181+
]
182+
},
183+
{
184+
"role": "user",
185+
"content": [
186+
{"type": "image", "url": url_cow},
187+
{"type": "image", "url": url_cat},
188+
{"type": "text", "text": "Which image is cuter?"},
189+
]
190+
},
191+
]
192+
```
193+
- Text passed to the processor should have a `<start_of_image>` token wherever an image should be inserted.
194+
- The processor has its own [`~ProcessorMixin.apply_chat_template`] method to convert chat messages to model inputs.
195+
- By default, images aren't cropped and only the base image is forwarded to the model. In high resolution images or images with non-square aspect ratios, artifacts can result because the vision encoder uses a fixed resolution of 896x896. To prevent these artifacts and improve performance during inference, set `do_pan_and_scan=True` to crop the image into multiple smaller patches and concatenate them with the base image embedding. You can disable pan and scan for faster inference.
196+
197+
```diff
198+
inputs = processor.apply_chat_template(
199+
messages,
200+
tokenize=True,
201+
return_dict=True,
202+
return_tensors="pt",
203+
add_generation_prompt=True,
204+
+ do_pan_and_scan=True,
205+
).to("cuda")
206+
```
207+
- For text-only inputs, use [`AutoModelForCausalLM`] instead to skip loading the vision components and save resources.
208+
209+
```py
210+
import torch
211+
from transformers import AutoModelForCausalLM, AutoTokenizer
212+
213+
tokenizer = AutoTokenizer.from_pretrained(
214+
"google/gemma-3-1b-pt",
215+
)
216+
model = AutoModelForCausalLM.from_pretrained(
217+
"google/gemma-3-1b-pt",
218+
torch_dtype=torch.bfloat16,
219+
device_map="auto",
220+
attn_implementation="sdpa"
221+
)
222+
input_ids = tokenizer("Plants create energy through a process known as", return_tensors="pt").to("cuda")
223+
224+
output = model.generate(**input_ids, cache_implementation="static")
225+
print(tokenizer.decode(output[0], skip_special_tokens=True))
226+
```
169227

170228
## Gemma3ImageProcessor
171229

0 commit comments

Comments
 (0)