Skip to content

Latest commit

 

History

History

README.md

Tiny LVLM Evaluation

Environments

conda create -n lvlm_eval python=3.8 -y
pip install -r requirements.txt

Model checkpoints

Most weights and checkpoint files will be downloaded automatically when initialing the corresponding testers. However, there are some files you should download personally and put in a directory. Then please replace the variable DATA_DIR in the models/__init__.py with the directory you save these files. Please note that the downloaded files should be organized as follows:

/path/to/DATA_DIR
├── llama_checkpoints
│   ├── 7B
│   │   ├── checklist.chk
│   │   ├── consolidated.00.pth
│   │   └── params.json
│   └── tokenizer.model
├── MiniGPT-4
│   ├── alignment.txt
│   └── pretrained_minigpt4_7b.pth
├── VPGTrans_Vicuna
├── otter-9b-hf
└── PandaGPT
    ├── imagebind_ckpt
    ├── vicuna_ckpt
    └── pandagpt_ckpt
  • For LLaMA-Adapter-v2, please obtain the LLaMA backbone weights using this form.

  • For MiniGPT-4, please download alignment.txt and pretrained_minigpt4_7b.pth.

  • For VPGTrans, please download VPGTrans_Vicuna.

  • For Otter, you can download the version we used in our evaluation from this repo. However, please note that the authors of Otter have updated their model, which is better than the version we used in evaluations, please check their github repo for the newest version.

  • For PandaGPT, please follow the instructions in here to prepare the weights of imagebind, vicuna and pandagpt.

Prompt Engineering

The table below shows prompts used for each dataset and across all multimodal models under study.

Prompt Dataset
Classify the main object in the image. ImageNet1K, CIFAR10
What breed is the flower in the image? Flowers102
What breed is the pet in the image? OxfordIIITPet
What is written in the image? All 12 OCR datasets
Question: {question}\nChoose the best answer from the following choices:\n- option#1\n- option#2\n- option#3\n IconQA
Context:\n{context}\n\nQuestion: {question}\nChoose the best answer from the following choices:\n- option#1\n- option#2\n- option#3 ScienceQA
Question: {question}\n\nChoose the single most likely answer from the following choices <choice>:\n- Yes\n- No\n\nThe output format follows exactly as below:\nAnswer: <choice> MSCOCO_MCI, VCR_MCI
Question: Is the caption "{caption}" correctly describing the image?\n\nChoose the single most likely answer from the following choices <choice>:\n- Yes\n- No\n\nThe output format follows exactly as below:\nAnswer: <choice> VSR
use original questions other datasets

Datasets and Evaluation

python eval_tiny.py \
--model_name $MODEL
--device $CUDA_DEVICE_INDEX \
--batch-size $EVAL_BATCH_SIZE \
--dataset-names $DATASET_NAMES \
--sampled-root $SAMPLED_DATASET_DIR
--answer_path $SAVE_DIR
--use-sampled # use it when you have prepared the all datasets in LVLM evaluation and do not download the sampled data
# please check the name of models/datasets in (models/task_datasets)/__init__.py

The datasets used in Tiny LVLM Evaluation are the subset of the datasets used in LVLM Evaluation. Therefore, you can download the sampled subset in here and use it directly. The script sample_dataset.py is used to sample the subsets used in Tiny LVLM Evaluation and save it.

Besides, the inference results on 42 datasets of all 12 multimodal models studied in Tiny LVLM-eHub, including Bard, are downloadable from Google Drive.

Ability-level Benchmark

Beyond the inclusion of a partial dataset from the Tiny LVLM Evaluation, we present an enhanced dataset segmentation. This novel division systematically categorizes the datasets featured in the Tiny LVLM Evaluation according to their specific targeted abilities. Subsequently, we curate a subset of datasets that align with the evaluative criteria of the LVLM model and aggregate these subsets into an ability-level subset, excluding those related to embodied intelligence. Furthermore, this benchmark includes recently released models to bolster its comprehensiveness.

You can download the ability-level subset from here, and the inference results of all 20 multimodal models included in our benchmark can be found in here.

Here is an example command for using this benchmark:

python updated_eval_tiny.py
--model-name $MODEL
--device $CUDA_DEVICE_INDEX
--batch-size $EVAL_BATCH_SIZE
--sampled-root $ROOT_DIR_OF_SAMPLED_SUBSETS
--answer_path $SAVE_DIR

For detailed performance metrics, please refer to following tables.

Overall Score

Rank Model Version Score
🏅️ InternVL InternVL-Chat 327.61
🥈 InternLM-XComposer-VL InternLM-XComposer-VL-7B 322.51
🥉 Bard Bard 319.59
4 Qwen-VL-Chat Qwen-VL-Chat 316.81
5 LLaVA-1.5 Vicuna-7B 307.17
6 InstructBLIP Vicuna-7B 300.64
7 InternLM-XComposer InternLM-XComposer-7B 288.89
8 BLIP2 FlanT5xl 284.72
9 BLIVA Vicuna-7B 284.17
10 Lynx Vicuna-7B 279.24
11 Cheetah Vicuna-7B 258.91
12 LLaMA-Adapter-v2 LLaMA-7B 229.16
13 VPGTrans Vicuna-7B 218.91
14 Otter-Image Otter-9B-LA-InContext 216.43
15 VisualGLM-6B VisualGLM-6B 211.98
16 mPLUG-Owl LLaMA-7B 209.40
17 LLaVA Vicuna-7B 200.93
18 MiniGPT-4 Vicuna-7B 192.62
19 Otter Otter-9B 180.87
20 OFv2_4BI RedPajama-INCITE-Instruct-3B-v1 176.37
21 PandaGPT Vicuna-7B 174.25
22 LaVIN LLaMA-7B 97.51
23 MIC FlanT5xl 94.09

Visual Reasoning

Rank Model Version Score
🏅️ Bard Bard 64.18
🥈 Qwen-VL-Chat Qwen-VL-Chat 62.36
🥉 InternVL InternVL-Chat 56.36
4 InternLM-XComposer-VL InternLM-XComposer-VL-7B 55.82
5 LLaVA-1.5 Vicuna-7B 55.64
6 Lynx Vicuna-7B 52.18
7 InternLM-XComposer InternLM-XComposer-7B 48.00
8 InstructBLIP Vicuna-7B 46.73
9 BLIP2 FlanT5xl 44.91
10 LLaVA Vicuna-7B 44.36
11 LLaMA-Adapter-v2 LLaMA-7B 43.45
12 Otter-Image Otter-9B-LA-InContext 41.64
13 mPLUG-Owl LLaMA-7B 40.91
14 Cheetah Vicuna-7B 40.00
15 BLIVA Vicuna-7B 38.73
16 MiniGPT-4 Vicuna-7B 37.64
17 VisualGLM-6B VisualGLM-6B 37.27
18 OFv2_4BI RedPajama-INCITE-Instruct-3B-v1 33.64
19 PandaGPT Vicuna-7B 33.45
20 Otter Otter-9B 29.82
21 VPGTrans Vicuna-7B 27.27
22 LaVIN LLaMA-7B 20.36
23 MIC FlanT5xl 11.09

Visual Perception

Rank Model Version Score
🏅️ Lynx Vicuna-7B 65.75
🥈 Bard Bard 57.00
🥉 InternLM-XComposer InternLM-XComposer-7B 56.25
4 Qwen-VL-Chat Qwen-VL-Chat 54.50
5 InternVL InternVL-Chat 52.25
6 InternLM-XComposer-VL InternLM-XComposer-VL-7B 53.75
7 BLIP2 FlanT5xl 49.00
8 LLaVA-1.5 Vicuna-7B 49.00
9 InstructBLIP Vicuna-7B 48.00
10 BLIVA Vicuna-7B 46.75
11 LLaMA-Adapter-v2 LLaMA-7B 46.75
12 Cheetah Vicuna-7B 43.25
13 mPLUG-Owl LLaMA-7B 40.75
14 MiniGPT-4 Vicuna-7B 37.75
15 OFv2_4BI RedPajama-INCITE-Instruct-3B-v1 37.25
16 Otter Otter-9B 37.00
17 LLaVA Vicuna-7B 36.50
18 VisualGLM-6B VisualGLM-6B 36.25
19 Otter-Image Otter-9B-LA-InContext 33.25
20 PandaGPT Vicuna-7B 33.00
21 VPGTrans Vicuna-7B 31.25
22 LaVIN LLaMA-7B 20.00
23 MIC FlanT5xl 0.75

Visual Knowledge Acquisition

Rank Model Version Score
🏅️ Bard Bard 68.14
🥈 InternVL InternVL-Chat 68.00
🥉 InternLM-XComposer InternLM-XComposer-7B 66.57
4 BLIP2 FlanT5xl 64.14
5 InternLM-XComposer-VL InternLM-XComposer-VL-7B 64.14
6 BLIVA Vicuna-7B 63.43
7 InstructBLIP Vicuna-7B 61.71
8 LLaVA-1.5 Vicuna-7B 57.00
9 Qwen-VL-Chat Qwen-VL-Chat 55.14
10 VPGTrans Vicuna-7B 49.86
11 VisualGLM-6B VisualGLM-6B 46.86
12 Cheetah Vicuna-7B 46.86
13 LLaMA-Adapter-v2 LLaMA-7B 22.29
14 LLaVA Vicuna-7B 18.00
15 MiniGPT-4 Vicuna-7B 17.57
16 Lynx Vicuna-7B 17.57
17 OFv2_4BI RedPajama-INCITE-Instruct-3B-v1 17.29
18 mPLUG-Owl LLaMA-7B 16.14
19 Otter-Image Otter-9B-LA-InContext 15.14
20 Otter Otter-9B 12.71
21 MIC FlanT5xl 7.71
22 PandaGPT Vicuna-7B 3.00
23 LaVIN LLaMA-7B 2.14

Visual Commonsense

Rank Model Version Score
🏅️ InternVL InternVL-Chat 62.00
🥈 InternLM-XComposer-VL InternLM-XComposer-VL-7B 61.80
🥉 Bard Bard 59.60
4 InstructBLIP Vicuna-7B 59.20
5 BLIVA Vicuna-7B 58.60
6 Lynx Vicuna-7B 57.40
7 LLaVA-1.5 Vicuna-7B 57.20
8 LLaMA-Adapter-v2 LLaMA-7B 56.00
9 Qwen-VL-Chat Qwen-VL-Chat 54.80
10 Otter-Image Otter-9B-LA-InContext 52.40
11 Cheetah Vicuna-7B 51.80
12 PandaGPT Vicuna-7B 51.80
13 mPLUG-Owl LLaMA-7B 50.60
14 InternLM-XComposer InternLM-XComposer-7B 50.40
15 MiniGPT-4 Vicuna-7B 49.00
16 VPGTrans Vicuna-7B 48.20
17 Otter Otter-9B 48.00
18 LLaVA Vicuna-7B 47.40
19 BLIP2 FlanT5xl 44.00
20 OFv2_4BI RedPajama-INCITE-Instruct-3B-v1 39.20
21 VisualGLM-6B VisualGLM-6B 37.60
22 LaVIN LLaMA-7B 35.00
23 MIC FlanT5xl 24.20

Object Hallucination

Rank Model Version Score
🏅️ Qwen-VL-Chat Qwen-VL-Chat 90.00
🥈 InternVL InternVL-Chat 89.00
🥉 LLaVA-1.5 Vicuna-7B 88.33
4 InternLM-XComposer-VL InternLM-XComposer-VL-7B 87.00
5 Lynx Vicuna-7B 86.33
6 InstructBLIP Vicuna-7B 85.00
7 BLIP2 FlanT5xl 82.67
8 Cheetah Vicuna-7B 77.00
9 BLIVA Vicuna-7B 76.67
10 Otter-Image Otter-9B-LA-InContext 74.00
11 Bard Bard 70.67
12 InternLM-XComposer InternLM-XComposer-7B 67.67
13 VPGTrans Vicuna-7B 62.33
14 mPLUG-Owl LLaMA-7B 61.00
15 LLaMA-Adapter-v2 LLaMA-7B 60.67
16 LLaVA Vicuna-7B 54.67
17 VisualGLM-6B VisualGLM-6B 54.00
18 Otter Otter-9B 53.33
19 PandaGPT Vicuna-7B 53.00
20 MiniGPT-4 Vicuna-7B 50.67
21 MIC FlanT5xl 50.33
22 OFv2_4BI RedPajama-INCITE-Instruct-3B-v1 49.00
23 LaVIN LLaMA-7B 20.00