-
Notifications
You must be signed in to change notification settings - Fork 32k
Description
System Info
Problem
The HF-converted model LanguageBind/Video-LLaVA-7B-hf has two critical problems in its video tower:
- Missing
temporal_attnlayers: The originalLanguageBind/Video-LLaVA-7Bvideo tower contains per-layer temporal attention for cross-frame reasoning. These are completely absent in the-hfversion. - video_tower and image_tower have nearly identical weights: Only 3 out of ~300 parameter tensors differ between the two towers. This should not be the case — the original model uses separately pretrained LanguageBind-Video and LanguageBind-Image encoders with distinct weights.
Evidence
1. Original model has temporal attention in the video tower
In LanguageBind/Video-LLaVA-7B (model.safetensors.index.json), the video tower contains temporal_attn layers per encoder block:
model.video_tower.video_tower.encoder.layers.X.temporal_attn.k_proj.weight
model.video_tower.video_tower.encoder.layers.X.temporal_attn.v_proj.weight
model.video_tower.video_tower.encoder.layers.X.temporal_attn.q_proj.weight
model.video_tower.video_tower.encoder.layers.X.temporal_attn.out_proj.weight
model.video_tower.video_tower.encoder.layers.X.temporal_layer_norm.weight
model.video_tower.video_tower.encoder.layers.X.temporal_layer_norm.bias
The -hf version uses CLIPVisionModel for both towers — no temporal attention exists.
2. Weight comparison: video_tower ≈ image_tower
import torch
video_params = dict(model.video_tower.named_parameters())
image_params = dict(model.image_tower.named_parameters())
same, diff = [], []
for name in video_params:
if name in image_params:
if torch.equal(video_params[name], image_params[name]):
same.append(name)
else:
diff.append(name)
print(f"Same: {len(same)}, Different: {len(diff)}")Result:
- Different (only 3):
vision_model.embeddings.class_embeddingvision_model.post_layernorm.weightvision_model.post_layernorm.bias
- Same: all remaining ~300 parameters
This means the video tower and image tower are effectively the same model. In the original Video-LLaVA-7B, these towers should have substantially different weights because they were pretrained separately (LanguageBind-Video on video-text pairs, LanguageBind-Image on image-text pairs).
Impact
- The
-hfmodel is not a faithful conversion of the original Video-LLaVA. - Users loading the model via
VideoLlavaForConditionalGeneration.from_pretrained("LanguageBind/Video-LLaVA-7B-hf")are running inference with what is essentially two copies of the same image encoder with no temporal modeling. - Benchmark results from the
-hfversion do not reflect the actual Video-LLaVA architecture or performance described in the paper. - Follow-up research using this model as a baseline is comparing against a degraded, incorrectly converted model.
cc @zucchini-nlp
Could you clarify how Video-LLaVA-7B-hf was converted? Specifically:
- How were the video tower weights handled during conversion, given that the original model contains
temporal_attnlayers that don't exist inCLIPVisionModel? - Were the LanguageBind-Video weights intentionally mapped to a standard CLIP architecture, or were the image tower weights duplicated into the video tower?
- Were the temporal attention weights simply discarded, or was there a merging/distillation step?
The weight comparison suggests the video tower may have received the same (or nearly the same) weights as the image tower, which would mean the conversion lost both the architectural differences and the distinct pretrained representations.
Environment
- transformers version: 4.46
- Model:
LanguageBind/Video-LLaVA-7B-hf
Who can help?
Could you clarify how Video-LLaVA-7B-hf was converted? Specifically:
- How were the video tower weights handled during conversion, given that the original model contains
temporal_attnlayers that don't exist inCLIPVisionModel? - Were the LanguageBind-Video weights intentionally mapped to a standard CLIP architecture, or were the image tower weights duplicated into the video tower?
- Were the temporal attention weights simply discarded, or was there a merging/distillation step?
The weight comparison suggests the video tower may have received the same (or nearly the same) weights as the image tower, which would mean the conversion lost both the architectural differences and the distinct pretrained representations.
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
Reproduction
import torch
from transformers import VideoLlavaForConditionalGeneration
model = VideoLlavaForConditionalGeneration.from_pretrained(
"LanguageBind/Video-LLaVA-7B-hf",
torch_dtype=torch.float16
)
# Architecture check: both are plain CLIPVisionModel
print(model.video_tower)
print(model.image_tower)
# Weight check
video_params = dict(model.video_tower.named_parameters())
image_params = dict(model.image_tower.named_parameters())
diff = [name for name in video_params
if name in image_params and not torch.equal(video_params[name], image_params[name])]
print(f"Only {len(diff)} parameters differ:")
for n in diff:
print(f" {n}")
# Expected: most weights should differ
# Actual: only 3 differExpected behavior
Expected Behavior
The -hf conversion should:
- Include temporal attention layers in the video tower matching the original architecture
- Load the correct LanguageBind-Video pretrained weights (distinct from LanguageBind-Image)
Or at minimum, the model card should clearly state that this is not equivalent to the original model.