Skip to content

Commit 7c93f33

Browse files
committed
small changes
1 parent e578a2e commit 7c93f33

File tree

6 files changed

+19
-12
lines changed

6 files changed

+19
-12
lines changed

docs/source/en/model_doc/bert.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@ rendered properly in your Markdown viewer.
2828

2929
[BERT](https://huggingface.co/papers/1810.04805) is a bidirectional transformer pretrained on unlabeled text to predict masked tokens in a sentence and to predict whether one sentence follows another. The main idea is that by randomly masking some tokens, the model can train on text to the left and right, giving it a more thorough understanding. BERT is also very versatile because its learned language representations can be adapted for other NLP tasks by fine-tuning an additional layer or head.
3030

31-
You can find all the original BERT checkpoints under the BERT [collection](https://huggingface.co/collections/google/bert-release-64ff5e7a4be99045d1896dbc).
31+
You can find all the original BERT checkpoints under the [BERT](https://huggingface.co/collections/google/bert-release-64ff5e7a4be99045d1896dbc) collection.
3232

3333
> [!TIP]
3434
> Click on the BERT models in the right sidebar for more examples of how to apply BERT to different language tasks.

docs/source/en/model_doc/gemma3.md

Lines changed: 11 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -24,9 +24,9 @@ rendered properly in your Markdown viewer.
2424

2525
# Gemma 3
2626

27-
[Gemma 3](https://goo.gle/Gemma3Report) is a multimodal model, available in pretrained and instruction-tuned variants, available in 1B, 13B, and 27B parameters. The architecture is mostly the same as the previous Gemma versions. The key differences are alternating 5 local sliding window self-attention layers for every global self-attention layer, support for a longer context length of 128K tokens, and a [SigLip](./siglip) encoder that can "pan & scan" high-resolution images to prevent information in images from disappearing.
27+
[Gemma 3](https://goo.gle/Gemma3Report) is a multimodal model with pretrained and instruction-tuned variants, available in 1B, 13B, and 27B parameters. The architecture is mostly the same as the previous Gemma versions. The key differences are alternating 5 local sliding window self-attention layers for every global self-attention layer, support for a longer context length of 128K tokens, and a [SigLip](./siglip) encoder that can "pan & scan" high-resolution images to prevent information from disappearing in high resolution images or images with non-square aspect ratios.
2828

29-
The instruction-tuned Gemma 3 model was post-trained with knowledge distillation and reinforcement learning.
29+
The instruction-tuned variant was post-trained with knowledge distillation and reinforcement learning.
3030

3131
You can find all the original Gemma 3 checkpoints under the [Gemma 3](https://huggingface.co/collections/meta-llama/llama-2-family-661da1f90a9d678b6f55773b) release.
3232

@@ -98,6 +98,13 @@ output = model.generate(**inputs, max_new_tokens=50, cache_implementation="stati
9898
print(processor.decode(output[0], skip_special_tokens=True))
9999
```
100100

101+
</hfoption>
102+
<hfoption id="transformers-cli">
103+
104+
```bash
105+
echo -e "Plants create energy through a process known as" | transformers-cli run --task text-generation --model google/gemma-3-1b-pt --device 0
106+
```
107+
101108
</hfoption>
102109
</hfoptions>
103110

@@ -148,7 +155,7 @@ output = model.generate(**inputs, max_new_tokens=50, cache_implementation="stati
148155
print(processor.decode(output[0], skip_special_tokens=True))
149156
```
150157

151-
Use the [`~transformers.utils.AttentionMaskVisualizer`] to better understand what tokens the model can and cannot attend to.
158+
Use the [AttentionMaskVisualizer](https://github.com/huggingface/transformers/blob/beb9b5b02246b9b7ee81ddf938f93f44cfeaad19/src/transformers/utils/attention_visualizer.py#L139) to better understand what tokens the model can and cannot attend to.
152159

153160
```py
154161
from transformers.utils.attention_visualizer import AttentionMaskVisualizer
@@ -185,7 +192,7 @@ visualizer("<img>What is shown in this image?")
185192
```
186193
- Text passed to the processor should have a `<start_of_image>` token wherever an image should be inserted.
187194
- The processor has its own [`~ProcessorMixin.apply_chat_template`] method to convert chat messages to model inputs.
188-
- By default, the images aren't cropped and only the base image is forwarded to the model. In high resolution images or images with non-square aspect ratios, artifacts can result because the vision encoder uses a fixed resolution of 896x896. To prevent these artifacts and improve performance during inference, set `do_pan_and_scan=True` to crop the image into multiple smaller patches and concatenate them with the base image embedding. You can disable pan and scan for faster inference.
195+
- By default, images aren't cropped and only the base image is forwarded to the model. In high resolution images or images with non-square aspect ratios, artifacts can result because the vision encoder uses a fixed resolution of 896x896. To prevent these artifacts and improve performance during inference, set `do_pan_and_scan=True` to crop the image into multiple smaller patches and concatenate them with the base image embedding. You can disable pan and scan for faster inference.
189196

190197
```diff
191198
inputs = processor.apply_chat_template(

docs/source/en/model_doc/llama.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -107,7 +107,7 @@ output = model.generate(**input_ids, cache_implementation="static")
107107
print(tokenizer.decode(output[0], skip_special_tokens=True))
108108
```
109109

110-
Use the [`~transformers.utils.AttentionMaskVisualizer`] utility to better understand what tokens the model can and cannot attend to.
110+
Use the [AttentionMaskVisualizer](https://github.com/huggingface/transformers/blob/beb9b5b02246b9b7ee81ddf938f93f44cfeaad19/src/transformers/utils/attention_visualizer.py#L139) to better understand what tokens the model can and cannot attend to.
111111

112112
```py
113113
from transformers.utils.attention_visualizer import AttentionMaskVisualizer

docs/source/en/model_doc/llama2.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -107,7 +107,7 @@ output = model.generate(**input_ids, cache_implementation="static")
107107
print(tokenizer.decode(output[0], skip_special_tokens=True))
108108
```
109109

110-
Use the [`~transformers.utils.AttentionMaskVisualizer`] to better understand what tokens the model can and cannot attend to.
110+
Use the [AttentionMaskVisualizer](https://github.com/huggingface/transformers/blob/beb9b5b02246b9b7ee81ddf938f93f44cfeaad19/src/transformers/utils/attention_visualizer.py#L139) to better understand what tokens the model can and cannot attend to.
111111

112112
```py
113113
from transformers.utils.attention_visualizer import AttentionMaskVisualizer

docs/source/en/model_doc/paligemma.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ rendered properly in your Markdown viewer.
2424

2525
# PaliGemma
2626

27-
[PaliGemma](https://huggingface.co/papers/2407.07726) is a family of vision-language models (VLMs), combining [SigLIP](./siglip) with [Gemma 2](./gemma2), that is available in 3B, 10B, and 28B parameters. The main purpose of PaliGemma is to provide an adaptable base VLM that is easy to transfer to other tasks. The SigLIP vision encoder is a "shape optimized" contrastively pretrained [ViT](./vit) that converts an image into a sequence of tokens and prepended to an optional prompt. The Gemma 2B model is used as the decoder. PaliGemma uses full attention on all image and text tokens to maximize its capacity.
27+
[PaliGemma](https://huggingface.co/papers/2407.07726) is a family of vision-language models (VLMs), combining [SigLIP](./siglip) with the [Gemma](./gemma) 2B model. PaliGemma is available in 3B, 10B, and 28B parameters. The main purpose of PaliGemma is to provide an adaptable base VLM that is easy to transfer to other tasks. The SigLIP vision encoder is a "shape optimized" contrastively pretrained [ViT](./vit) that converts an image into a sequence of tokens and prepended to an optional prompt. The Gemma 2B model is used as the decoder. PaliGemma uses full attention on all image and text tokens to maximize its capacity.
2828

2929
[PaliGemma 2](https://huggingface.co/papers/2412.03555) improves on the first model by using Gemma 2 (2B, 9B, and 27B parameter variants) as the decoder. These are available as **pt** or **mix** variants. The **pt** checkpoints are intended for further fine-tuning and the **mix** checkpoints are ready for use out of the box.
3030

@@ -116,7 +116,7 @@ output = model.generate(**inputs, max_new_tokens=50, cache_implementation="stati
116116
print(processor.decode(output[0], skip_special_tokens=True))
117117
```
118118

119-
Use the [`~transformers.utils.AttentionMaskVisualizer`] to better understand what tokens the model can and cannot attend to.
119+
Use the [AttentionMaskVisualizer](https://github.com/huggingface/transformers/blob/beb9b5b02246b9b7ee81ddf938f93f44cfeaad19/src/transformers/utils/attention_visualizer.py#L139) to better understand what tokens the model can and cannot attend to.
120120

121121
```py
122122
from transformers.utils.attention_visualizer import AttentionMaskVisualizer
@@ -128,7 +128,7 @@ visualizer("<img> What is in this image?")
128128
## Notes
129129

130130
- PaliGemma is not a conversational model and works best when fine-tuned for specific downstream tasks such as image captioning, visual question answering (VQA), object detection, and document understanding.
131-
- [`PaliGemmaProcessor`] can prepare images, text, and optional labels for the model. When fine-tuning PaliGemma, pass the `suffix` parameter to the processor to create labels for the model.
131+
- [`PaliGemmaProcessor`] can prepare images, text, and optional labels for the model. Pass the `suffix` parameter to the processor to create labels for the model during fine-tuning.
132132

133133
```py
134134
prompt = "What is in this image?"

docs/source/en/model_doc/vit.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@ rendered properly in your Markdown viewer.
2626

2727
# Vision Transformer (ViT)
2828

29-
[Vision Transformer (ViT)](https://huggingface.co/papers/2010.11929) is a transformer adapted for computer vision tasks unlike traditional convolutional architectures. An image is split into smaller fixed-sized patches which are treated as a sequence of tokens, similar to words for NLP tasks. ViT requires less resources to pretrain compared to convolutional architectures and its performance on large datasets can be transferred to smaller downstream tasks.
29+
[Vision Transformer (ViT)](https://huggingface.co/papers/2010.11929) is a transformer adapted for computer vision tasks. An image is split into smaller fixed-sized patches which are treated as a sequence of tokens, similar to words for NLP tasks. ViT requires less resources to pretrain compared to convolutional architectures and its performance on large datasets can be transferred to smaller downstream tasks.
3030

3131
You can find all the original ViT checkpoints under the [Google](https://huggingface.co/google?search_models=vit) organization.
3232

@@ -88,7 +88,7 @@ print(f"The predicted class label is: {predicted_class_label}")
8888

8989
## Notes
9090

91-
- The best results are obtained with supervised pretraining, and during fine-tuning, it can be better to use images with a resolution higher than 224x224.
91+
- The best results are obtained with supervised pretraining, and during fine-tuning, it may be better to use images with a resolution higher than 224x224.
9292
- Use [`ViTImageProcessorFast`] to resize (or rescale) and normalize images to the expected size.
9393
- The patch and image resolution are reflected in the checkpoint name. For example, google/vit-base-patch16-224, is the **base-sized** architecture with a patch resolution of 16x16 and fine-tuning resolution of 224x224.
9494

0 commit comments

Comments
 (0)