Correct check for SDPA in Vision Language Models

### System Info

In current implementation of VLMs, the "_supports_sdpa" attribute checks and activates SDPA attention only for the language model. For example in [Llava](https://github.com/huggingface/transformers/blob/ae49b218c3d718df90d8e4a109016450fb8f0632/src/transformers/models/llava/modeling_llava.py#L159-L164)

It should also check and if available use SDPA attention for vision tower.
- CLIP SDPA has an open PR: https://github.com/huggingface/transformers/pull/30390
- SigLip SDPA is merged: https://github.com/huggingface/transformers/pull/31499 

We can raise a warning for composite models if only one part support sdpa, but other does not, and activate SDPA for the supported part. That waythe user knows what is happening in the background. 

### Verified models

- [ ] BLIP-2
- [ ] InstructBLIP
- [ ] InstructBLIPVideo
- [ ] KOSMOS-2
- [ ] LLaVa
- [ ] LLaVa-NeXT
- [ ] LLaVa-NeXT-Video
- [ ] VipLLaVa
- [ ] Video-LLaVa
- [ ] Idefics
- [ ] Idefics2
- [ ] PaliGemma

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Correct check for SDPA in Vision Language Models #30565

System Info

Verified models

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Correct check for SDPA in Vision Language Models #30565

Description

System Info

Verified models

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions