Skip to content

[Video-LLaVA] Video-LLaVA-7B-hf video_tower is missing temporal attention AND shares nearly identical weights with image_tower #43818

@jong980812

Description

@jong980812

System Info

Problem

The HF-converted model LanguageBind/Video-LLaVA-7B-hf has two critical problems in its video tower:

  1. Missing temporal_attn layers: The original LanguageBind/Video-LLaVA-7B video tower contains per-layer temporal attention for cross-frame reasoning. These are completely absent in the -hf version.
  2. video_tower and image_tower have nearly identical weights: Only 3 out of ~300 parameter tensors differ between the two towers. This should not be the case — the original model uses separately pretrained LanguageBind-Video and LanguageBind-Image encoders with distinct weights.

Evidence

1. Original model has temporal attention in the video tower

In LanguageBind/Video-LLaVA-7B (model.safetensors.index.json), the video tower contains temporal_attn layers per encoder block:

model.video_tower.video_tower.encoder.layers.X.temporal_attn.k_proj.weight
model.video_tower.video_tower.encoder.layers.X.temporal_attn.v_proj.weight
model.video_tower.video_tower.encoder.layers.X.temporal_attn.q_proj.weight
model.video_tower.video_tower.encoder.layers.X.temporal_attn.out_proj.weight
model.video_tower.video_tower.encoder.layers.X.temporal_layer_norm.weight
model.video_tower.video_tower.encoder.layers.X.temporal_layer_norm.bias

The -hf version uses CLIPVisionModel for both towers — no temporal attention exists.

2. Weight comparison: video_tower ≈ image_tower

import torch

video_params = dict(model.video_tower.named_parameters())
image_params = dict(model.image_tower.named_parameters())

same, diff = [], []
for name in video_params:
    if name in image_params:
        if torch.equal(video_params[name], image_params[name]):
            same.append(name)
        else:
            diff.append(name)

print(f"Same: {len(same)}, Different: {len(diff)}")

Result:

  • Different (only 3):
    • vision_model.embeddings.class_embedding
    • vision_model.post_layernorm.weight
    • vision_model.post_layernorm.bias
  • Same: all remaining ~300 parameters

This means the video tower and image tower are effectively the same model. In the original Video-LLaVA-7B, these towers should have substantially different weights because they were pretrained separately (LanguageBind-Video on video-text pairs, LanguageBind-Image on image-text pairs).

Impact

  • The -hf model is not a faithful conversion of the original Video-LLaVA.
  • Users loading the model via VideoLlavaForConditionalGeneration.from_pretrained("LanguageBind/Video-LLaVA-7B-hf") are running inference with what is essentially two copies of the same image encoder with no temporal modeling.
  • Benchmark results from the -hf version do not reflect the actual Video-LLaVA architecture or performance described in the paper.
  • Follow-up research using this model as a baseline is comparing against a degraded, incorrectly converted model.

cc @zucchini-nlp

Could you clarify how Video-LLaVA-7B-hf was converted? Specifically:

  • How were the video tower weights handled during conversion, given that the original model contains temporal_attn layers that don't exist in CLIPVisionModel?
  • Were the LanguageBind-Video weights intentionally mapped to a standard CLIP architecture, or were the image tower weights duplicated into the video tower?
  • Were the temporal attention weights simply discarded, or was there a merging/distillation step?

The weight comparison suggests the video tower may have received the same (or nearly the same) weights as the image tower, which would mean the conversion lost both the architectural differences and the distinct pretrained representations.

Environment

  • transformers version: 4.46
  • Model: LanguageBind/Video-LLaVA-7B-hf


@zucchini-nlp

Who can help?

@zucchini-nlp

Could you clarify how Video-LLaVA-7B-hf was converted? Specifically:

  • How were the video tower weights handled during conversion, given that the original model contains temporal_attn layers that don't exist in CLIPVisionModel?
  • Were the LanguageBind-Video weights intentionally mapped to a standard CLIP architecture, or were the image tower weights duplicated into the video tower?
  • Were the temporal attention weights simply discarded, or was there a merging/distillation step?

The weight comparison suggests the video tower may have received the same (or nearly the same) weights as the image tower, which would mean the conversion lost both the architectural differences and the distinct pretrained representations.

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Reproduction

import torch
from transformers import VideoLlavaForConditionalGeneration

model = VideoLlavaForConditionalGeneration.from_pretrained(
    "LanguageBind/Video-LLaVA-7B-hf",
    torch_dtype=torch.float16
)

# Architecture check: both are plain CLIPVisionModel
print(model.video_tower)
print(model.image_tower)

# Weight check
video_params = dict(model.video_tower.named_parameters())
image_params = dict(model.image_tower.named_parameters())

diff = [name for name in video_params
        if name in image_params and not torch.equal(video_params[name], image_params[name])]

print(f"Only {len(diff)} parameters differ:")
for n in diff:
    print(f"  {n}")
# Expected: most weights should differ
# Actual: only 3 differ

Expected behavior

Expected Behavior

The -hf conversion should:

  1. Include temporal attention layers in the video tower matching the original architecture
  2. Load the correct LanguageBind-Video pretrained weights (distinct from LanguageBind-Image)

Or at minimum, the model card should clearly state that this is not equivalent to the original model.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions