[Video-LLaVA] `Video-LLaVA-7B-hf` video_tower is missing temporal attention AND shares nearly identical weights with image_tower

### System Info



### Problem

The HF-converted model `LanguageBind/Video-LLaVA-7B-hf` has two critical problems in its video tower:

1. **Missing `temporal_attn` layers**: The original `LanguageBind/Video-LLaVA-7B` video tower contains per-layer temporal attention for cross-frame reasoning. These are completely absent in the `-hf` version.
2. **video_tower and image_tower have nearly identical weights**: Only 3 out of ~300 parameter tensors differ between the two towers. This should not be the case — the original model uses separately pretrained LanguageBind-Video and LanguageBind-Image encoders with distinct weights.

### Evidence

**1. Original model has temporal attention in the video tower**

In `LanguageBind/Video-LLaVA-7B` ([model.safetensors.index.json](https://huggingface.co/LanguageBind/Video-LLaVA-7B/blob/main/model.safetensors.index.json)), the video tower contains `temporal_attn` layers per encoder block:

```
model.video_tower.video_tower.encoder.layers.X.temporal_attn.k_proj.weight
model.video_tower.video_tower.encoder.layers.X.temporal_attn.v_proj.weight
model.video_tower.video_tower.encoder.layers.X.temporal_attn.q_proj.weight
model.video_tower.video_tower.encoder.layers.X.temporal_attn.out_proj.weight
model.video_tower.video_tower.encoder.layers.X.temporal_layer_norm.weight
model.video_tower.video_tower.encoder.layers.X.temporal_layer_norm.bias
```

The `-hf` version uses `CLIPVisionModel` for both towers — no temporal attention exists.

**2. Weight comparison: video_tower ≈ image_tower**

```python
import torch

video_params = dict(model.video_tower.named_parameters())
image_params = dict(model.image_tower.named_parameters())

same, diff = [], []
for name in video_params:
    if name in image_params:
        if torch.equal(video_params[name], image_params[name]):
            same.append(name)
        else:
            diff.append(name)

print(f"Same: {len(same)}, Different: {len(diff)}")
```

**Result:**
- **Different (only 3):**
  - `vision_model.embeddings.class_embedding`
  - `vision_model.post_layernorm.weight`
  - `vision_model.post_layernorm.bias`
- **Same: all remaining ~300 parameters**

This means the video tower and image tower are effectively the **same model**. In the original `Video-LLaVA-7B`, these towers should have substantially different weights because they were pretrained separately (LanguageBind-Video on video-text pairs, LanguageBind-Image on image-text pairs).

### Impact

- The `-hf` model is **not a faithful conversion** of the original Video-LLaVA.
- Users loading the model via `VideoLlavaForConditionalGeneration.from_pretrained("LanguageBind/Video-LLaVA-7B-hf")` are running inference with what is essentially **two copies of the same image encoder** with no temporal modeling.
- Benchmark results from the `-hf` version do not reflect the actual Video-LLaVA architecture or performance described in the paper.
- Follow-up research using this model as a baseline is comparing against a degraded, incorrectly converted model.



### cc @zucchini-nlp 

Could you clarify how `Video-LLaVA-7B-hf` was converted? Specifically:
- How were the video tower weights handled during conversion, given that the original model contains `temporal_attn` layers that don't exist in `CLIPVisionModel`?
- Were the LanguageBind-Video weights intentionally mapped to a standard CLIP architecture, or were the image tower weights duplicated into the video tower?
- Were the temporal attention weights simply discarded, or was there a merging/distillation step?

The weight comparison suggests the video tower may have received the same (or nearly the same) weights as the image tower, which would mean the conversion lost both the architectural differences and the distinct pretrained representations.

### Environment

- transformers version: 4.46
- Model: `LanguageBind/Video-LLaVA-7B-hf`

---
---
@zucchini-nlp 

### Who can help?

@zucchini-nlp 

Could you clarify how `Video-LLaVA-7B-hf` was converted? Specifically:
- How were the video tower weights handled during conversion, given that the original model contains `temporal_attn` layers that don't exist in `CLIPVisionModel`?
- Were the LanguageBind-Video weights intentionally mapped to a standard CLIP architecture, or were the image tower weights duplicated into the video tower?
- Were the temporal attention weights simply discarded, or was there a merging/distillation step?

The weight comparison suggests the video tower may have received the same (or nearly the same) weights as the image tower, which would mean the conversion lost both the architectural differences and the distinct pretrained representations.

### Information

- [x] The official example scripts
- [ ] My own modified scripts

### Tasks

- [x] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction


### Reproduction

```python
import torch
from transformers import VideoLlavaForConditionalGeneration

model = VideoLlavaForConditionalGeneration.from_pretrained(
    "LanguageBind/Video-LLaVA-7B-hf",
    torch_dtype=torch.float16
)

# Architecture check: both are plain CLIPVisionModel
print(model.video_tower)
print(model.image_tower)

# Weight check
video_params = dict(model.video_tower.named_parameters())
image_params = dict(model.image_tower.named_parameters())

diff = [name for name in video_params
        if name in image_params and not torch.equal(video_params[name], image_params[name])]

print(f"Only {len(diff)} parameters differ:")
for n in diff:
    print(f"  {n}")
# Expected: most weights should differ
# Actual: only 3 differ
```

### Expected behavior

### Expected Behavior

The `-hf` conversion should:
1. Include temporal attention layers in the video tower matching the original architecture
2. Load the correct LanguageBind-Video pretrained weights (distinct from LanguageBind-Image)

Or at minimum, the model card should clearly state that this is **not equivalent** to the original model.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Video-LLaVA] `Video-LLaVA-7B-hf` video_tower is missing temporal attention AND shares nearly identical weights with image_tower #43818

System Info

Problem

Evidence

Impact

cc @zucchini-nlp

Environment

Who can help?

Information

Tasks

Reproduction

Reproduction

Expected behavior

Expected Behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Video-LLaVA] Video-LLaVA-7B-hf video_tower is missing temporal attention AND shares nearly identical weights with image_tower #43818

Description

System Info

Problem

Evidence

Impact

cc @zucchini-nlp

Environment

Who can help?

Information

Tasks

Reproduction

Reproduction

Expected behavior

Expected Behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[Video-LLaVA] `Video-LLaVA-7B-hf` video_tower is missing temporal attention AND shares nearly identical weights with image_tower #43818