-
Notifications
You must be signed in to change notification settings - Fork 19
Open
Description
I tried to reproduce the model. Below are the steps I followed:
- pretrain
First, I ran the scripts/pretrain.sh, which producs the projector. The pretrain data comes from https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain. And I added some lines in model/llava_arch.py(prepare_inputs_labels_for_multimodal), because of the incorrect input dimension when I directly use liuhaotian/LLaVA-Pretrain. Specifically, I unsqueeze the image tensor to match the requested "5-dimension input" and use a batch size of 1 in case the modifications result in unwanted errors. - finetune
Then, I executed the scripts/fintune.sh, using the projector from step 1 and Qwen-224k LLM from the huggingface. https://huggingface.co/datasets/lmms-lab/LLaVA-NeXT-Data is the dataset I use.
Then, I obtained the “LongVA-7B”, I think.(I didn't run dpo.sh)
However, the test results are much different from those in the paper(may due to lmms-eval) and from the released chekpoints from hf.

I noticed there are some private data in LLaVA-NeXT-Data, which was mentioned in #10 and the hf datasets repo .
Is it because the private data used during training that accounts for the difference?
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels