You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/en/model_doc/bert.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -28,7 +28,7 @@ rendered properly in your Markdown viewer.
28
28
29
29
[BERT](https://huggingface.co/papers/1810.04805) is a bidirectional transformer pretrained on unlabeled text to predict masked tokens in a sentence and to predict whether one sentence follows another. The main idea is that by randomly masking some tokens, the model can train on text to the left and right, giving it a more thorough understanding. BERT is also very versatile because its learned language representations can be adapted for other NLP tasks by fine-tuning an additional layer or head.
30
30
31
-
You can find all the original BERT checkpoints under the BERT [collection](https://huggingface.co/collections/google/bert-release-64ff5e7a4be99045d1896dbc).
31
+
You can find all the original BERT checkpoints under the [BERT](https://huggingface.co/collections/google/bert-release-64ff5e7a4be99045d1896dbc) collection.
32
32
33
33
> [!TIP]
34
34
> Click on the BERT models in the right sidebar for more examples of how to apply BERT to different language tasks.
Copy file name to clipboardExpand all lines: docs/source/en/model_doc/gemma3.md
+11-4Lines changed: 11 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -24,9 +24,9 @@ rendered properly in your Markdown viewer.
24
24
25
25
# Gemma 3
26
26
27
-
[Gemma 3](https://goo.gle/Gemma3Report) is a multimodal model, available in pretrained and instruction-tuned variants, available in 1B, 13B, and 27B parameters. The architecture is mostly the same as the previous Gemma versions. The key differences are alternating 5 local sliding window self-attention layers for every global self-attention layer, support for a longer context length of 128K tokens, and a [SigLip](./siglip) encoder that can "pan & scan" high-resolution images to prevent information in images from disappearing.
27
+
[Gemma 3](https://goo.gle/Gemma3Report) is a multimodal model with pretrained and instruction-tuned variants, available in 1B, 13B, and 27B parameters. The architecture is mostly the same as the previous Gemma versions. The key differences are alternating 5 local sliding window self-attention layers for every global self-attention layer, support for a longer context length of 128K tokens, and a [SigLip](./siglip) encoder that can "pan & scan" high-resolution images to prevent information from disappearing in high resolution images or images with non-square aspect ratios.
28
28
29
-
The instruction-tuned Gemma 3 model was post-trained with knowledge distillation and reinforcement learning.
29
+
The instruction-tuned variant was post-trained with knowledge distillation and reinforcement learning.
30
30
31
31
You can find all the original Gemma 3 checkpoints under the [Gemma 3](https://huggingface.co/collections/meta-llama/llama-2-family-661da1f90a9d678b6f55773b) release.
Use the [`~transformers.utils.AttentionMaskVisualizer`] to better understand what tokens the model can and cannot attend to.
158
+
Use the [AttentionMaskVisualizer](https://github.com/huggingface/transformers/blob/beb9b5b02246b9b7ee81ddf938f93f44cfeaad19/src/transformers/utils/attention_visualizer.py#L139) to better understand what tokens the model can and cannot attend to.
152
159
153
160
```py
154
161
from transformers.utils.attention_visualizer import AttentionMaskVisualizer
@@ -185,7 +192,7 @@ visualizer("<img>What is shown in this image?")
185
192
```
186
193
- Text passed to the processor should have a `<start_of_image>` token wherever an image should be inserted.
187
194
- The processor has its own [`~ProcessorMixin.apply_chat_template`] method to convert chat messages to model inputs.
188
-
- By default, the images aren't cropped and only the base image is forwarded to the model. In high resolution images or images with non-square aspect ratios, artifacts can result because the vision encoder uses a fixed resolution of 896x896. To prevent these artifacts and improve performance during inference, set `do_pan_and_scan=True` to crop the image into multiple smaller patches and concatenate them with the base image embedding. You can disable pan and scan for faster inference.
195
+
- By default, images aren't cropped and only the base image is forwarded to the model. In high resolution images or images with non-square aspect ratios, artifacts can result because the vision encoder uses a fixed resolution of 896x896. To prevent these artifacts and improve performance during inference, set `do_pan_and_scan=True` to crop the image into multiple smaller patches and concatenate them with the base image embedding. You can disable pan and scan for faster inference.
Use the [`~transformers.utils.AttentionMaskVisualizer`] utility to better understand what tokens the model can and cannot attend to.
110
+
Use the [AttentionMaskVisualizer](https://github.com/huggingface/transformers/blob/beb9b5b02246b9b7ee81ddf938f93f44cfeaad19/src/transformers/utils/attention_visualizer.py#L139) to better understand what tokens the model can and cannot attend to.
111
111
112
112
```py
113
113
from transformers.utils.attention_visualizer import AttentionMaskVisualizer
Use the [`~transformers.utils.AttentionMaskVisualizer`] to better understand what tokens the model can and cannot attend to.
110
+
Use the [AttentionMaskVisualizer](https://github.com/huggingface/transformers/blob/beb9b5b02246b9b7ee81ddf938f93f44cfeaad19/src/transformers/utils/attention_visualizer.py#L139) to better understand what tokens the model can and cannot attend to.
111
111
112
112
```py
113
113
from transformers.utils.attention_visualizer import AttentionMaskVisualizer
Copy file name to clipboardExpand all lines: docs/source/en/model_doc/paligemma.md
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -24,7 +24,7 @@ rendered properly in your Markdown viewer.
24
24
25
25
# PaliGemma
26
26
27
-
[PaliGemma](https://huggingface.co/papers/2407.07726) is a family of vision-language models (VLMs), combining [SigLIP](./siglip) with [Gemma 2](./gemma2), that is available in 3B, 10B, and 28B parameters. The main purpose of PaliGemma is to provide an adaptable base VLM that is easy to transfer to other tasks. The SigLIP vision encoder is a "shape optimized" contrastively pretrained [ViT](./vit) that converts an image into a sequence of tokens and prepended to an optional prompt. The Gemma 2B model is used as the decoder. PaliGemma uses full attention on all image and text tokens to maximize its capacity.
27
+
[PaliGemma](https://huggingface.co/papers/2407.07726) is a family of vision-language models (VLMs), combining [SigLIP](./siglip) with the [Gemma](./gemma) 2B model. PaliGemma is available in 3B, 10B, and 28B parameters. The main purpose of PaliGemma is to provide an adaptable base VLM that is easy to transfer to other tasks. The SigLIP vision encoder is a "shape optimized" contrastively pretrained [ViT](./vit) that converts an image into a sequence of tokens and prepended to an optional prompt. The Gemma 2B model is used as the decoder. PaliGemma uses full attention on all image and text tokens to maximize its capacity.
28
28
29
29
[PaliGemma 2](https://huggingface.co/papers/2412.03555) improves on the first model by using Gemma 2 (2B, 9B, and 27B parameter variants) as the decoder. These are available as **pt** or **mix** variants. The **pt** checkpoints are intended for further fine-tuning and the **mix** checkpoints are ready for use out of the box.
Use the [`~transformers.utils.AttentionMaskVisualizer`] to better understand what tokens the model can and cannot attend to.
119
+
Use the [AttentionMaskVisualizer](https://github.com/huggingface/transformers/blob/beb9b5b02246b9b7ee81ddf938f93f44cfeaad19/src/transformers/utils/attention_visualizer.py#L139) to better understand what tokens the model can and cannot attend to.
120
120
121
121
```py
122
122
from transformers.utils.attention_visualizer import AttentionMaskVisualizer
@@ -128,7 +128,7 @@ visualizer("<img> What is in this image?")
128
128
## Notes
129
129
130
130
- PaliGemma is not a conversational model and works best when fine-tuned for specific downstream tasks such as image captioning, visual question answering (VQA), object detection, and document understanding.
131
-
-[`PaliGemmaProcessor`] can prepare images, text, and optional labels for the model. When fine-tuning PaliGemma, pass the `suffix` parameter to the processor to create labels for the model.
131
+
-[`PaliGemmaProcessor`] can prepare images, text, and optional labels for the model. Pass the `suffix` parameter to the processor to create labels for the model during fine-tuning.
Copy file name to clipboardExpand all lines: docs/source/en/model_doc/vit.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -26,7 +26,7 @@ rendered properly in your Markdown viewer.
26
26
27
27
# Vision Transformer (ViT)
28
28
29
-
[Vision Transformer (ViT)](https://huggingface.co/papers/2010.11929) is a transformer adapted for computer vision tasks unlike traditional convolutional architectures. An image is split into smaller fixed-sized patches which are treated as a sequence of tokens, similar to words for NLP tasks. ViT requires less resources to pretrain compared to convolutional architectures and its performance on large datasets can be transferred to smaller downstream tasks.
29
+
[Vision Transformer (ViT)](https://huggingface.co/papers/2010.11929) is a transformer adapted for computer vision tasks. An image is split into smaller fixed-sized patches which are treated as a sequence of tokens, similar to words for NLP tasks. ViT requires less resources to pretrain compared to convolutional architectures and its performance on large datasets can be transferred to smaller downstream tasks.
30
30
31
31
You can find all the original ViT checkpoints under the [Google](https://huggingface.co/google?search_models=vit) organization.
32
32
@@ -88,7 +88,7 @@ print(f"The predicted class label is: {predicted_class_label}")
88
88
89
89
## Notes
90
90
91
-
- The best results are obtained with supervised pretraining, and during fine-tuning, it can be better to use images with a resolution higher than 224x224.
91
+
- The best results are obtained with supervised pretraining, and during fine-tuning, it may be better to use images with a resolution higher than 224x224.
92
92
- Use [`ViTImageProcessorFast`] to resize (or rescale) and normalize images to the expected size.
93
93
- The patch and image resolution are reflected in the checkpoint name. For example, google/vit-base-patch16-224, is the **base-sized** architecture with a patch resolution of 16x16 and fine-tuning resolution of 224x224.
0 commit comments