Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add initial design for uniform processors + align model #31197

Merged
merged 49 commits into from
Jun 13, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
49 commits
Select commit Hold shift + click to select a range
b85036f
add initial design for uniform processors + align model
molbap Jun 3, 2024
bb8ac70
fix mutable default :eyes:
molbap Jun 3, 2024
cd8c601
add configuration test
molbap Jun 3, 2024
f00c852
handle structured kwargs w defaults + add test
molbap Jun 3, 2024
693036f
protect torch-specific test
molbap Jun 3, 2024
766da3a
fix style
molbap Jun 3, 2024
844394d
fix
molbap Jun 3, 2024
c19bbc6
fix assertEqual
molbap Jun 4, 2024
3c38119
move kwargs merging to processing common
molbap Jun 4, 2024
81ae819
rework kwargs for type hinting
molbap Jun 5, 2024
ce4abcd
just get Unpack from extensions
molbap Jun 7, 2024
3acdf28
run-slow[align]
molbap Jun 7, 2024
404239f
handle kwargs passed as nested dict
molbap Jun 7, 2024
603be40
add from_pretrained test for nested kwargs handling
molbap Jun 7, 2024
71c9d6c
[run-slow]align
molbap Jun 7, 2024
26383c5
update documentation + imports
molbap Jun 7, 2024
4521f4f
update audio inputs
molbap Jun 7, 2024
b96eb64
protect audio types, silly
molbap Jun 7, 2024
9c5c01c
try removing imports
molbap Jun 7, 2024
3ccb505
make things simpler
molbap Jun 7, 2024
142acf3
simplerer
molbap Jun 7, 2024
60a5730
move out kwargs test to common mixin
molbap Jun 10, 2024
be6c141
[run-slow]align
molbap Jun 10, 2024
84135d7
skip tests for old processors
molbap Jun 10, 2024
ce967ac
[run-slow]align, clip
molbap Jun 10, 2024
f78ec52
!$#@!! protect imports, darn it
molbap Jun 10, 2024
52fd5ad
[run-slow]align, clip
molbap Jun 10, 2024
8f21abe
Merge branch 'main' into uniform_processors_1
molbap Jun 10, 2024
d510030
[run-slow]align, clip
molbap Jun 10, 2024
fd43bcd
update doc
molbap Jun 11, 2024
b2cd7c9
improve documentation for default values
molbap Jun 11, 2024
bcbd646
add model_max_length testing
molbap Jun 11, 2024
39c1587
Raise if kwargs are specified in two places
molbap Jun 11, 2024
1f73bdf
fix
molbap Jun 11, 2024
b3f98ba
Merge branch 'main' into uniform_processors_1
molbap Jun 11, 2024
e4d6d12
expand VideoInput
molbap Jun 12, 2024
1e09e4a
fix
molbap Jun 12, 2024
d4232f0
fix style
molbap Jun 12, 2024
162b1a7
remove defaults values
molbap Jun 12, 2024
0da1dc3
add comment to indicate documentation on adding kwargs
molbap Jun 12, 2024
f955510
Merge branch 'main' into uniform_processors_1
molbap Jun 12, 2024
f6f1dac
protect imports
molbap Jun 12, 2024
c4b7e84
[run-slow]align
molbap Jun 12, 2024
3ce3608
fix
molbap Jun 12, 2024
6b83e39
remove set() that breaks ordering
molbap Jun 13, 2024
3818b86
test more
molbap Jun 13, 2024
31b7a60
removed unused func
molbap Jun 13, 2024
4072336
[run-slow]align
molbap Jun 13, 2024
bcce007
Merge branch 'main' into uniform_processors_1
molbap Jun 13, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 10 additions & 1 deletion src/transformers/image_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -81,7 +81,16 @@
] # noqa


VideoInput = Union[np.ndarray, "torch.Tensor", List[np.ndarray], List["torch.Tensor"]] # noqa
VideoInput = Union[
List["PIL.Image.Image"],
"np.ndarray",
"torch.Tensor",
List["np.ndarray"],
List["torch.Tensor"],
List[List["PIL.Image.Image"]],
List[List["np.ndarrray"]],
List[List["torch.Tensor"]],
] # noqa


class ChannelDimension(ExplicitEnum):
Expand Down
90 changes: 67 additions & 23 deletions src/transformers/models/align/processing_align.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,30 @@
Image/Text processor class for ALIGN
"""

from ...processing_utils import ProcessorMixin
from ...tokenization_utils_base import BatchEncoding
from typing import List, Union


try:
from typing import Unpack
except ImportError:
from typing_extensions import Unpack

from ...image_utils import ImageInput
from ...processing_utils import (
ProcessingKwargs,
ProcessorMixin,
)
from ...tokenization_utils_base import BatchEncoding, PreTokenizedInput, TextInput


class AlignProcessorKwargs(ProcessingKwargs, total=False):
# see processing_utils.ProcessingKwargs documentation for usage.
_defaults = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
_defaults = {
padding: "max_length"
max_lenght: 64

should work no? Or does it not update the default for type-hints?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes it works for sure, this was to have a structured dict for defaults. Can change :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, now I remember, it actually can't work like that since Typed Dicts don't support default values, they are made to hold the layout. They can have any attributes however, but it won't pass a value as default -like a dataclass would, but in this case we'd lose typing-, hence the manual operation

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok got it thanks! Let's maybe comment about this!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a comment for future code inspectors? I'm assuming here isn't the best place (we don't want it for all models) but didn't find a corresponding one elsewhere on a quick skim

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On that: there's doc in processing_utils.ProcessingKwargs, I added a comment nudging users to check there for documentation!

"text_kwargs": {
"padding": "max_length",
"max_length": 64,
},
}


class AlignProcessor(ProcessorMixin):
Expand All @@ -26,12 +48,28 @@ class AlignProcessor(ProcessorMixin):
[`BertTokenizer`]/[`BertTokenizerFast`] into a single processor that interits both the image processor and
tokenizer functionalities. See the [`~AlignProcessor.__call__`] and [`~OwlViTProcessor.decode`] for more
information.
The preferred way of passing kwargs is as a dictionary per modality, see usage example below.
```python
from transformers import AlignProcessor
from PIL import Image
model_id = "kakaobrain/align-base"
processor = AlignProcessor.from_pretrained(model_id)

processor(
images=your_pil_image,
text=["What is that?"],
images_kwargs = {"crop_size": {"height": 224, "width": 224}},
text_kwargs = {"padding": "do_not_pad"},
common_kwargs = {"return_tensors": "pt"},
)
```

Args:
image_processor ([`EfficientNetImageProcessor`]):
The image processor is a required input.
tokenizer ([`BertTokenizer`, `BertTokenizerFast`]):
The tokenizer is a required input.

"""

attributes = ["image_processor", "tokenizer"]
Expand All @@ -41,11 +79,18 @@ class AlignProcessor(ProcessorMixin):
def __init__(self, image_processor, tokenizer):
super().__init__(image_processor, tokenizer)

def __call__(self, text=None, images=None, padding="max_length", max_length=64, return_tensors=None, **kwargs):
def __call__(
self,
text: Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]] = None,
images: ImageInput = None,
audio=None,
videos=None,
**kwargs: Unpack[AlignProcessorKwargs],
) -> BatchEncoding:
"""
Main method to prepare text(s) and image(s) to be fed as input to the model. This method forwards the `text`
and `kwargs` arguments to BertTokenizerFast's [`~BertTokenizerFast.__call__`] if `text` is not `None` to encode
the text. To prepare the image(s), this method forwards the `images` and `kwargs` arguments to
arguments to BertTokenizerFast's [`~BertTokenizerFast.__call__`] if `text` is not `None` to encode
the text. To prepare the image(s), this method forwards the `images` arguments to
EfficientNetImageProcessor's [`~EfficientNetImageProcessor.__call__`] if `images` is not `None`. Please refer
to the doctsring of the above two methods for more information.

Expand All @@ -57,20 +102,12 @@ def __call__(self, text=None, images=None, padding="max_length", max_length=64,
images (`PIL.Image.Image`, `np.ndarray`, `torch.Tensor`, `List[PIL.Image.Image]`, `List[np.ndarray]`, `List[torch.Tensor]`):
The image or batch of images to be prepared. Each image can be a PIL image, NumPy array or PyTorch
tensor. Both channels-first and channels-last formats are supported.
padding (`bool`, `str` or [`~utils.PaddingStrategy`], *optional*, defaults to `max_length`):
Activates and controls padding for tokenization of input text. Choose between [`True` or `'longest'`,
`'max_length'`, `False` or `'do_not_pad'`]
max_length (`int`, *optional*, defaults to `max_length`):
Maximum padding value to use to pad the input text during tokenization.

return_tensors (`str` or [`~utils.TensorType`], *optional*):
If set, will return tensors of a particular framework. Acceptable values are:

- `'tf'`: Return TensorFlow `tf.constant` objects.
- `'pt'`: Return PyTorch `torch.Tensor` objects.
- `'np'`: Return NumPy `np.ndarray` objects.
- `'jax'`: Return JAX `jnp.ndarray` objects.

- `'tf'`: Return TensorFlow `tf.constant` objects.
- `'pt'`: Return PyTorch `torch.Tensor` objects.
- `'np'`: Return NumPy `np.ndarray` objects.
- `'jax'`: Return JAX `jnp.ndarray` objects.
Returns:
[`BatchEncoding`]: A [`BatchEncoding`] with the following fields:

Expand All @@ -81,15 +118,22 @@ def __call__(self, text=None, images=None, padding="max_length", max_length=64,
- **pixel_values** -- Pixel values to be fed to a model. Returned when `images` is not `None`.
"""
if text is None and images is None:
raise ValueError("You have to specify either text or images. Both cannot be none.")

raise ValueError("You must specify either text or images.")
output_kwargs = self._merge_kwargs(
AlignProcessorKwargs,
tokenizer_init_kwargs=self.tokenizer.init_kwargs,
**kwargs,
)
# then, we can pass correct kwargs to each processor
if text is not None:
encoding = self.tokenizer(
text, padding=padding, max_length=max_length, return_tensors=return_tensors, **kwargs
)
encoding = self.tokenizer(text, **output_kwargs["text_kwargs"])

if images is not None:
image_features = self.image_processor(images, return_tensors=return_tensors, **kwargs)
image_features = self.image_processor(images, **output_kwargs["images_kwargs"])

# BC for explicit return_tensors
if "return_tensors" in output_kwargs["common_kwargs"]:
return_tensors = output_kwargs["common_kwargs"].pop("return_tensors", None)

if text is not None and images is not None:
encoding["pixel_values"] = image_features.pixel_values
Expand Down
Loading
Loading