Failed to reproduce LongVA-7B after training from scratch

I tried to reproduce the model. Below are the steps I followed:
1. pretrain
    First, I ran the scripts/pretrain.sh, which producs the projector. The pretrain data comes from https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain. And I added some lines in model/llava_arch.py(prepare_inputs_labels_for_multimodal), because of the incorrect input dimension when I directly use liuhaotian/LLaVA-Pretrain. Specifically, I unsqueeze the image tensor to match the requested "5-dimension input" and use a batch size of 1 in case the modifications result in unwanted errors.
2. finetune
    Then, I executed the scripts/fintune.sh, using  the projector from step 1 and Qwen-224k LLM from the huggingface. https://huggingface.co/datasets/lmms-lab/LLaVA-NeXT-Data is the dataset I use.



Then, I obtained the “LongVA-7B”, I think.(I didn't run dpo.sh)
However, the test results are much different from those in the paper(may due to lmms-eval) and  from the released chekpoints from hf.
![image](https://github.com/user-attachments/assets/d62296bd-a300-4d16-b169-122af491bef1)

I  noticed there are some private data in LLaVA-NeXT-Data, which was mentioned in #10 and the hf datasets repo .
Is it because the private data used during training that accounts for the difference?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed to reproduce LongVA-7B after training from scratch #37

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Failed to reproduce LongVA-7B after training from scratch #37

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions