Description
Feature request
This will track general plans on VLM and composite models so that we can align with work in TGI and other libraries. I already have some trackers so in this one I'll lay out a more bigger picture with links to respective discussions/topics
Motivation
We already have a pretty good working standards when it comes to language models, and when adding a new model usually a few "copy from" statements will do the work. We also cover most cases for LMs in out test suite. But for wave of multimodal models we still lack any form of standardization and uniform API. Each new model added to the library introduces something new, that forces us to accept it as is until we figure out how to handle it later
So we need to try to standardize those models, currently starting from VLMs. VLMs are the most commonly added models currently, but we may have more audio+text or pure multimodal ones in the future. For now we start off by working on VLM and see how things fit in the general API
Your contribution
The major changes we are working on and planning to work are:
-
Standardization for Processors:
- We have ongoing work on uniform processor kwargs which currently will help us enable pipelines for VLMs and thus we can have correct automodel tag on the hub. The work is under progress by @yonigozlan and @molbap
- Parallel to that I will work on separating out video models under a new class (VideoProcessor) and handling a whole lot of deprecation cycle for the processing config files. At the end we should have separate file/separate class for video processing and save its params in its own config file. That will be tracked in Video Processor as a separate class #33504 and has discussions with Amy in the linked issue under that
-
Standardization in terms of modeling code:
- One major thing was to get rid of buggy
merge_embeds
method and cover VLMs with more generation related tests, as we were getting many issues after a small change. Slow tests unfortunately don't cover everything and are not run every time a PR is merged. That is being tracked in Track progress for VLMs refactoring #33374 - Another major topic is setting attention implementation for composite models (not only VLMs) which will fix red CI and add uniformity to how we work with composite models in general. After that PR we should enforce each composite model to have a separate PreTrainedConfig for each model backbone in its architecture. And each sub-config should be part of one major ModelConfig which may hold specific attr for the composte model only (not its sub-backbones). See Attn implementation for composite models #32238
- Separate out
get_image_features
method for all VLMs so we can have more modularity and prob make the code much cleaner. Was proposed by one of the community contributor and I'll handle propagating the change in all models. See Refactor image features selection in LlaVa #33696
- One major thing was to get rid of buggy
-
Standardization for chat templates:
- We can support
(tokenize=True, return_tensors="pt")
kwargs in processor's apply_chat_template, so that the method returns already vectorized outputs. Similar to tokenizers, the main point is to feed in a chat history and get tensor inputs ready for generation/train. The only difference is that users will have to explicitly add image file/url orImageInput
so we can process it internally and turn intopixel_values
. Below is the general design. No work started yet, I am planning to make a PR some time in October
- We can support
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": {"url": "https://...."}}},
{"type": "text", "text": "What do you see here?"},
]
},
{
"role": "assistant",
"content": [
{"type": "text", "text": "Stop sign [...]"},
]
},
{
"role": "user",
"content": [
{"type": "image", "image": {"path": "my_image.png"}}},
{"type": "text", "text": "What color is the cat?"},
]
},
]
- Standardization for tokenizers:
- We can have new special tokens added to the tokinizers if they are loaded from a VLM model repo. Currently I have a plan to add at lest 3 new special tokens (image, boi and eoi), but given a wave of new models I might expand that list. I had a PR prev but that was a very basic design (Make special image tokens attribute of tokenizer #31967). Currently working on making
SpecialTokenMixin
more flexible so that we can simply change the class attributeSPECIAL_TOKENS_ATTRIBUTES
and everything else will work out-of-the-box. Seems to me the easiest way to expand special tokens for multimodal cases without flooding simple language model tokenizers.
- We can have new special tokens added to the tokinizers if they are loaded from a VLM model repo. Currently I have a plan to add at lest 3 new special tokens (image, boi and eoi), but given a wave of new models I might expand that list. I had a PR prev but that was a very basic design (Make special image tokens attribute of tokenizer #31967). Currently working on making