Track progress for VLMs refactoring

This issue tracks the progress on improving the handling and testing of Vision-Language Models. The main goals are to enhance/enable generation tests, handle other generation techniques like assisted decoding and ensure all models pass CI checks. 

I already started working on it and merged/opened some PRs. This issue should help us track how much is left until VLMs are standardized from modeling code perspective.  

- [x] **Enable Generation Tests for VLMs**
  - [x] Merged a PR to calculate and expand text with "image" tokens in processing. VLMs currently add only one placeholder per visual. During the modeling phase, we expand the inputs to match the actual length of image embeddings. This approach limits the functionality of `generate()` , especially in enabling other cache formats and torch.compile and introduces hidden bugs. (https://github.com/huggingface/transformers/pull/30962)
  - [ ] Verify that the addition of `processor_config.json` on the hub does not break existing functionality. Related discussion on slack: https://huggingface.slack.com/archives/C01N44FJDHT/p171957701917237). TL;DR: we can't avoid breaking BC but we still want the feature as it has so many benefits. So we'll just try again and hope that users don't use the old version anymore

- [x] **Fix Failing Edge Cases in Current VLMs**
  - [x] Identified edge cases involving multi-image inputs and cache position preparation after merging the above PR (https://github.com/huggingface/transformers/pull/32907)
  - [x] Introduce `num_image_tokens` attribute for specifying image sequence length. It ensures text expansion to the correct length based on the image backbone, otherwise we can't currently use the same processing class for different image backbones. https://github.com/huggingface/transformers/pull/33424

- [x] **Add Generation Tests to VLM Classes**
  - [x] Already added in LLaVA-Onevision and Qwen2-VL (https://github.com/huggingface/transformers/pull/32673, https://github.com/huggingface/transformers/pull/33354)
  - [x] Implement `GenerationTesterMixin` to include tests with both image and text inputs. Current tests accept only text as input. Enable for all models except BLIP ([draft available locally](https://github.com/huggingface/transformers/pull/33533))
  - [x] Add tests for Idefics models and fix Mllama tests which are a bit different from llava style https://github.com/huggingface/transformers/pull/34062
  
  - [x] **Special Case for BLIP**
    - [x] Create a PR to adapt testing suite for BLIP's `main_input_name` which is not `input_ids` like in other model, but is `pixel_values`. Check that we don't cause red CI if we rely on model's `main_input_name` for tests (related or fixed by https://github.com/huggingface/transformers/pull/33685)
    - [x] Remove (optionally) BLIP's custom generation logic and enable generation tests, that should also help us get rid of extra hacks for handling maximum length or `BOS` token in modeling code (https://github.com/huggingface/transformers/pull/34174)
  
  - [ ] **Finalizing CI for VLMs**
    - [x] Resolve `attention_Implementation` related failures to make CI fully happy for VLMs (https://github.com/huggingface/transformers/pull/32238)
    - [ ] Ensure all VLMs pass all CI checks, including slow tests. Identify the reason and fix if there are failures (most probably failure is related to torch version, but need double check)


### Motivation

,

### Your contribution

.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Track progress for VLMs refactoring #33374

Motivation

Your contribution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development