- The images pretraining dataset is from LLaVA.
- The images tuning dataset is from LLaVA.
- The videos pretraining dataset is from Valley.
- The videos tuning dataset is from Video-ChatGPT.
- Download the training annotations. You can download from Baidu Disk, Google Disk or Peking University Disk
We provide the processed data on Hugging Face, or you also can download from Baidu Disk as follows.
After downloading all of them, organize the data as follows in DATA_ROOT
.
DATA_ROOT
├── llava_image
├── llava_image_tune
├── valley
└── videochatgpt_tune
- For image, follow LLaVA's instructions. You MUST first download eval.zip. It contains custom annotations, scripts, and the prediction files with LLaVA v1.5. Extract to
eval
. This also provides a general structure for all datasets. - For video, videos and annotations can be downloaded from Video-ChatGPT. We also provide the processed data as follows.
Datasets | Baidu Disk | Google Disk | Peking University Disk |
---|---|---|---|
Activitynet_Zero_Shot_QA | Link | - | - |
MSRVTT_Zero_Shot_QA | Link | Link | - |
MSVD_Zero_Shot_QA | Link | Link | Link |
TGIF_Zero_Shot_QA | Link | Link | Link |
After downloading all of them, organize the data as follows in eval
.
eval
├── GPT_Zero_Shot_QA
│ ├── Activitynet_Zero_Shot_QA
│ ├── MSRVTT_Zero_Shot_QA
│ ├── MSVD_Zero_Shot_QA
│ └── TGIF_Zero_Shot_QA
├── gqa
│ ├── answers
│ ├── data
│ └── llava_gqa_testdev_balanced.jsonl
├── llava-bench-in-the-wild
│ ├── answers
│ ├── answers_gpt4.jsonl
│ ├── bard_0718.jsonl
│ ├── bing_chat_0629.jsonl
│ ├── context.jsonl
│ ├── images
│ ├── questions.jsonl
│ ├── README.md
│ └── reviews
├── mmbench
│ ├── answers
│ ├── answers_upload
│ ├── mmbench_dev_20230712.tsv
│ └── mmbench_dev_en_20231003.tsv
├── MME
│ ├── answers
│ ├── convert_answer_to_mme.py
│ └── llava_mme.jsonl
├── mm-vet
│ ├── answers
│ ├── bard_set.json
│ ├── convert_answers.py
│ ├── images
│ ├── llava-mm-vet.jsonl
│ ├── mm-vet.json
│ └── results
├── pope
│ ├── answers
│ ├── coco
│ ├── llava_pope_test.jsonl
│ └── val2014
├── scienceqa
│ ├── answers
│ ├── images
│ ├── llava_test_CQM-A.json
│ ├── pid_splits.json
│ └── problems.json
├── seed_bench
│ ├── answers
│ ├── answers_upload
│ ├── extract_video_frames.py
│ └── llava-seed-bench.jsonl
├── textvqa
│ ├── answers
│ ├── llava_textvqa_val_v051_ocr.jsonl
│ ├── TextVQA_0.5.1_val.json
│ └── train_images
├── vizwiz
│ ├── answers
│ ├── answers_upload
│ ├── llava_test.jsonl
│ ├── test
│ ├── test.json
│ ├── train.json
│ └── val.json
└── vqav2
├── answers
├── answers_upload
├── llava_vqav2_mscoco_test2015.jsonl
├── llava_vqav2_mscoco_test-dev2015.jsonl
└── test2015
Specify your DATA_ROOT
according to the data preparation.
- Stage 1 pretraining script: pretrain.sh.
- Stage 2 tuning script: finetune.sh or finetune_lora.sh.
Our image validation code comes from LLaVA and our video validation code comes from Video-ChatGPT, thanks for their contribution!
You can refer to the official repository for validation, but we also provide off-the-shelf scripts.
To load unmerged LoRA weights, you simply need to pass an additional argument --model-base
, which is the base LLM that is used to train the LoRA weights.
- Inference to get the result.
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/run_qa_msrvtt.sh
- GPT-Assistant evaluation.
bash scripts/v1_5/eval/eval_qa_msrvtt.sh
- Inference to get the result.
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/run_qa_msvd.sh
- GPT-Assistant evaluation.
bash scripts/v1_5/eval/eval_qa_msvd.sh
- Inference to get the result.
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/run_qa_tgif.sh
- GPT-Assistant evaluation.
bash scripts/v1_5/eval/eval_qa_tgif.sh
- Inference to get the result.
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/run_qa_activitynet.sh
- GPT-Assistant evaluation.
bash scripts/v1_5/eval/eval_qa_activitynet.sh
- Download
test2015
and put it undereval/vqav2
. - Multi-GPU inference.
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/v1_5/eval/eval_image_vqav2.sh
- Submit the results to the evaluation server:
eval/vqav2/answers_upload
.
- Download the data following the official instructions here and put under
eval/gqa/data
. - Multi-GPU inference.
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/v1_5/eval/eval_image_gqa.sh
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/eval_image_vizwiz.sh
- Submit the results to the evaluation server:
eval/vizwiz/answers_upload
.
- Under
eval/scienceqa
, downloadimages
,pid_splits.json
,problems.json
from thedata/scienceqa
folder of the ScienceQA repo. - Single-GPU inference and evaluate.
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/eval_image_sqa.sh
- Download
TextVQA_0.5.1_val.json
and images and extract toeval/textvqa
. - Single-GPU inference and evaluate.
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/eval_image_textvqa.sh
- Download
coco
from POPE and put undereval/pope
. - Single-GPU inference and evaluate.
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/eval_image_pope.sh
- Download
mmbench_dev_20230712.tsv
and put undereval/mmbench
. - Single-GPU inference.
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/eval_image_mmbench.sh
- Submit the results to the evaluation server:
eval/mmbench/answers_upload/mmbench_dev_20230712
.
- Extract contents of
llava-bench-in-the-wild
toeval/llava-bench-in-the-wild
. - Single-GPU inference and evaluate.
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/eval_image_llavabench.sh
- Extract
mm-vet.zip
toeval/mmvet
. - Single-GPU inference.
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/eval_image_mmvet.sh