Skip to content

Commit 5fc3e60

Browse files
NielsRoggeamyeroberts
authored andcommitted
[SigLIP] Don't pad by default (#28578)
First draft
1 parent 5ee9fcb commit 5fc3e60

File tree

3 files changed

+8
-6
lines changed

3 files changed

+8
-6
lines changed

docs/source/en/model_doc/siglip.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@ The abstract from the paper is the following:
2828

2929
- Usage of SigLIP is similar to [CLIP](clip). The main difference is the training loss, which does not require a global view of all the pairwise similarities of images and texts within a batch. One needs to apply the sigmoid activation function to the logits, rather than the softmax.
3030
- Training is not yet supported. If you want to fine-tune SigLIP or train from scratch, refer to the loss function from [OpenCLIP](https://github.com/mlfoundations/open_clip/blob/73ad04ae7fb93ede1c02dc9040a828634cb1edf1/src/open_clip/loss.py#L307), which leverages various `torch.distributed` utilities.
31-
- When using the standalone [`SiglipTokenizer`], make sure to pass `padding="max_length"` as that's how the model was trained. The multimodal [`SiglipProcessor`] takes care of this behind the scenes.
31+
- When using the standalone [`SiglipTokenizer`] or [`SiglipProcessor`], make sure to pass `padding="max_length"` as that's how the model was trained.
3232

3333
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/siglip_table.jpeg"
3434
alt="drawing" width="600"/>
@@ -82,7 +82,8 @@ If you want to do the pre- and postprocessing yourself, here's how to do that:
8282
>>> image = Image.open(requests.get(url, stream=True).raw)
8383

8484
>>> texts = ["a photo of 2 cats", "a photo of 2 dogs"]
85-
>>> inputs = processor(text=texts, images=image, return_tensors="pt")
85+
>>> # important: we pass `padding=max_length` since the model was trained with this
86+
>>> inputs = processor(text=texts, images=image, padding="max_length", return_tensors="pt")
8687

8788
>>> with torch.no_grad():
8889
... outputs = model(**inputs)

src/transformers/models/siglip/modeling_siglip.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1123,7 +1123,8 @@ def forward(
11231123
>>> image = Image.open(requests.get(url, stream=True).raw)
11241124
11251125
>>> texts = ["a photo of 2 cats", "a photo of 2 dogs"]
1126-
>>> inputs = processor(text=texts, images=image, return_tensors="pt")
1126+
>>> # important: we pass `padding=max_length` since the model was trained with this
1127+
>>> inputs = processor(text=texts, images=image, padding="max_length", return_tensors="pt")
11271128
11281129
>>> with torch.no_grad():
11291130
... outputs = model(**inputs)

src/transformers/models/siglip/processing_siglip.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -50,9 +50,9 @@ def __call__(
5050
self,
5151
text: Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]] = None,
5252
images: ImageInput = None,
53-
padding: Union[bool, str, PaddingStrategy] = "max_length",
53+
padding: Union[bool, str, PaddingStrategy] = False,
5454
truncation: Union[bool, str, TruncationStrategy] = None,
55-
max_length=None,
55+
max_length: int = None,
5656
return_tensors: Optional[Union[str, TensorType]] = TensorType.PYTORCH,
5757
) -> BatchFeature:
5858
"""
@@ -71,7 +71,7 @@ def __call__(
7171
The image or batch of images to be prepared. Each image can be a PIL image, NumPy array or PyTorch
7272
tensor. In case of a NumPy array/PyTorch tensor, each image should be of shape (C, H, W), where C is a
7373
number of channels, H and W are image height and width.
74-
padding (`bool`, `str` or [`~utils.PaddingStrategy`], *optional*, defaults to `max_length`):
74+
padding (`bool`, `str` or [`~utils.PaddingStrategy`], *optional*, defaults to `False`):
7575
Select a strategy to pad the returned sequences (according to the model's padding side and padding
7676
index) among:
7777
- `True` or `'longest'`: Pad to the longest sequence in the batch (or no padding if only a single

0 commit comments

Comments
 (0)