conda create -n lvlm_eval python=3.8 -y
pip install -r requirements.txtMost weights and checkpoint files will be downloaded automatically when initialing the corresponding testers. However, there are some files you should download personally and put in a directory. Then please replace the variable DATA_DIR in the models/__init__.py with the directory you save these files. Please note that the downloaded files should be organized as follows:
/path/to/DATA_DIR
├── llama_checkpoints
│ ├── 7B
│ │ ├── checklist.chk
│ │ ├── consolidated.00.pth
│ │ └── params.json
│ └── tokenizer.model
├── MiniGPT-4
│ ├── alignment.txt
│ └── pretrained_minigpt4_7b.pth
├── VPGTrans_Vicuna
├── otter-9b-hf
└── PandaGPT
├── imagebind_ckpt
├── vicuna_ckpt
└── pandagpt_ckpt-
For LLaMA-Adapter-v2, please obtain the LLaMA backbone weights using this form.
-
For MiniGPT-4, please download alignment.txt and pretrained_minigpt4_7b.pth.
-
For VPGTrans, please download VPGTrans_Vicuna.
-
For Otter, you can download the version we used in our evaluation from this repo. However, please note that the authors of Otter have updated their model, which is better than the version we used in evaluations, please check their github repo for the newest version.
-
For PandaGPT, please follow the instructions in here to prepare the weights of imagebind, vicuna and pandagpt.
The table below shows prompts used for each dataset and across all multimodal models under study.
| Prompt | Dataset |
|---|---|
| Classify the main object in the image. | ImageNet1K, CIFAR10 |
| What breed is the flower in the image? | Flowers102 |
| What breed is the pet in the image? | OxfordIIITPet |
| What is written in the image? | All 12 OCR datasets |
| Question: {question}\nChoose the best answer from the following choices:\n- option#1\n- option#2\n- option#3\n | IconQA |
| Context:\n{context}\n\nQuestion: {question}\nChoose the best answer from the following choices:\n- option#1\n- option#2\n- option#3 | ScienceQA |
| Question: {question}\n\nChoose the single most likely answer from the following choices <choice>:\n- Yes\n- No\n\nThe output format follows exactly as below:\nAnswer: <choice> | MSCOCO_MCI, VCR_MCI |
| Question: Is the caption "{caption}" correctly describing the image?\n\nChoose the single most likely answer from the following choices <choice>:\n- Yes\n- No\n\nThe output format follows exactly as below:\nAnswer: <choice> | VSR |
| use original questions | other datasets |
python eval_tiny.py \
--model_name $MODEL
--device $CUDA_DEVICE_INDEX \
--batch-size $EVAL_BATCH_SIZE \
--dataset-names $DATASET_NAMES \
--sampled-root $SAMPLED_DATASET_DIR
--answer_path $SAVE_DIR
--use-sampled # use it when you have prepared the all datasets in LVLM evaluation and do not download the sampled data
# please check the name of models/datasets in (models/task_datasets)/__init__.pyThe datasets used in Tiny LVLM Evaluation are the subset of the datasets used in LVLM Evaluation. Therefore, you can download the sampled subset in here and use it directly. The script sample_dataset.py is used to sample the subsets used in Tiny LVLM Evaluation and save it.
Besides, the inference results on 42 datasets of all 12 multimodal models studied in Tiny LVLM-eHub, including Bard, are downloadable from Google Drive.
Beyond the inclusion of a partial dataset from the Tiny LVLM Evaluation, we present an enhanced dataset segmentation. This novel division systematically categorizes the datasets featured in the Tiny LVLM Evaluation according to their specific targeted abilities. Subsequently, we curate a subset of datasets that align with the evaluative criteria of the LVLM model and aggregate these subsets into an ability-level subset, excluding those related to embodied intelligence. Furthermore, this benchmark includes recently released models to bolster its comprehensiveness.
You can download the ability-level subset from here, and the inference results of all 20 multimodal models included in our benchmark can be found in here.
Here is an example command for using this benchmark:
python updated_eval_tiny.py
--model-name $MODEL
--device $CUDA_DEVICE_INDEX
--batch-size $EVAL_BATCH_SIZE
--sampled-root $ROOT_DIR_OF_SAMPLED_SUBSETS
--answer_path $SAVE_DIRFor detailed performance metrics, please refer to following tables.
| Rank | Model | Version | Score |
|---|---|---|---|
| 🏅️ | InternVL | InternVL-Chat | 327.61 |
| 🥈 | InternLM-XComposer-VL | InternLM-XComposer-VL-7B | 322.51 |
| 🥉 | Bard | Bard | 319.59 |
| 4 | Qwen-VL-Chat | Qwen-VL-Chat | 316.81 |
| 5 | LLaVA-1.5 | Vicuna-7B | 307.17 |
| 6 | InstructBLIP | Vicuna-7B | 300.64 |
| 7 | InternLM-XComposer | InternLM-XComposer-7B | 288.89 |
| 8 | BLIP2 | FlanT5xl | 284.72 |
| 9 | BLIVA | Vicuna-7B | 284.17 |
| 10 | Lynx | Vicuna-7B | 279.24 |
| 11 | Cheetah | Vicuna-7B | 258.91 |
| 12 | LLaMA-Adapter-v2 | LLaMA-7B | 229.16 |
| 13 | VPGTrans | Vicuna-7B | 218.91 |
| 14 | Otter-Image | Otter-9B-LA-InContext | 216.43 |
| 15 | VisualGLM-6B | VisualGLM-6B | 211.98 |
| 16 | mPLUG-Owl | LLaMA-7B | 209.40 |
| 17 | LLaVA | Vicuna-7B | 200.93 |
| 18 | MiniGPT-4 | Vicuna-7B | 192.62 |
| 19 | Otter | Otter-9B | 180.87 |
| 20 | OFv2_4BI | RedPajama-INCITE-Instruct-3B-v1 | 176.37 |
| 21 | PandaGPT | Vicuna-7B | 174.25 |
| 22 | LaVIN | LLaMA-7B | 97.51 |
| 23 | MIC | FlanT5xl | 94.09 |
| Rank | Model | Version | Score |
|---|---|---|---|
| 🏅️ | Bard | Bard | 64.18 |
| 🥈 | Qwen-VL-Chat | Qwen-VL-Chat | 62.36 |
| 🥉 | InternVL | InternVL-Chat | 56.36 |
| 4 | InternLM-XComposer-VL | InternLM-XComposer-VL-7B | 55.82 |
| 5 | LLaVA-1.5 | Vicuna-7B | 55.64 |
| 6 | Lynx | Vicuna-7B | 52.18 |
| 7 | InternLM-XComposer | InternLM-XComposer-7B | 48.00 |
| 8 | InstructBLIP | Vicuna-7B | 46.73 |
| 9 | BLIP2 | FlanT5xl | 44.91 |
| 10 | LLaVA | Vicuna-7B | 44.36 |
| 11 | LLaMA-Adapter-v2 | LLaMA-7B | 43.45 |
| 12 | Otter-Image | Otter-9B-LA-InContext | 41.64 |
| 13 | mPLUG-Owl | LLaMA-7B | 40.91 |
| 14 | Cheetah | Vicuna-7B | 40.00 |
| 15 | BLIVA | Vicuna-7B | 38.73 |
| 16 | MiniGPT-4 | Vicuna-7B | 37.64 |
| 17 | VisualGLM-6B | VisualGLM-6B | 37.27 |
| 18 | OFv2_4BI | RedPajama-INCITE-Instruct-3B-v1 | 33.64 |
| 19 | PandaGPT | Vicuna-7B | 33.45 |
| 20 | Otter | Otter-9B | 29.82 |
| 21 | VPGTrans | Vicuna-7B | 27.27 |
| 22 | LaVIN | LLaMA-7B | 20.36 |
| 23 | MIC | FlanT5xl | 11.09 |
| Rank | Model | Version | Score |
|---|---|---|---|
| 🏅️ | Lynx | Vicuna-7B | 65.75 |
| 🥈 | Bard | Bard | 57.00 |
| 🥉 | InternLM-XComposer | InternLM-XComposer-7B | 56.25 |
| 4 | Qwen-VL-Chat | Qwen-VL-Chat | 54.50 |
| 5 | InternVL | InternVL-Chat | 52.25 |
| 6 | InternLM-XComposer-VL | InternLM-XComposer-VL-7B | 53.75 |
| 7 | BLIP2 | FlanT5xl | 49.00 |
| 8 | LLaVA-1.5 | Vicuna-7B | 49.00 |
| 9 | InstructBLIP | Vicuna-7B | 48.00 |
| 10 | BLIVA | Vicuna-7B | 46.75 |
| 11 | LLaMA-Adapter-v2 | LLaMA-7B | 46.75 |
| 12 | Cheetah | Vicuna-7B | 43.25 |
| 13 | mPLUG-Owl | LLaMA-7B | 40.75 |
| 14 | MiniGPT-4 | Vicuna-7B | 37.75 |
| 15 | OFv2_4BI | RedPajama-INCITE-Instruct-3B-v1 | 37.25 |
| 16 | Otter | Otter-9B | 37.00 |
| 17 | LLaVA | Vicuna-7B | 36.50 |
| 18 | VisualGLM-6B | VisualGLM-6B | 36.25 |
| 19 | Otter-Image | Otter-9B-LA-InContext | 33.25 |
| 20 | PandaGPT | Vicuna-7B | 33.00 |
| 21 | VPGTrans | Vicuna-7B | 31.25 |
| 22 | LaVIN | LLaMA-7B | 20.00 |
| 23 | MIC | FlanT5xl | 0.75 |
| Rank | Model | Version | Score |
|---|---|---|---|
| 🏅️ | Bard | Bard | 68.14 |
| 🥈 | InternVL | InternVL-Chat | 68.00 |
| 🥉 | InternLM-XComposer | InternLM-XComposer-7B | 66.57 |
| 4 | BLIP2 | FlanT5xl | 64.14 |
| 5 | InternLM-XComposer-VL | InternLM-XComposer-VL-7B | 64.14 |
| 6 | BLIVA | Vicuna-7B | 63.43 |
| 7 | InstructBLIP | Vicuna-7B | 61.71 |
| 8 | LLaVA-1.5 | Vicuna-7B | 57.00 |
| 9 | Qwen-VL-Chat | Qwen-VL-Chat | 55.14 |
| 10 | VPGTrans | Vicuna-7B | 49.86 |
| 11 | VisualGLM-6B | VisualGLM-6B | 46.86 |
| 12 | Cheetah | Vicuna-7B | 46.86 |
| 13 | LLaMA-Adapter-v2 | LLaMA-7B | 22.29 |
| 14 | LLaVA | Vicuna-7B | 18.00 |
| 15 | MiniGPT-4 | Vicuna-7B | 17.57 |
| 16 | Lynx | Vicuna-7B | 17.57 |
| 17 | OFv2_4BI | RedPajama-INCITE-Instruct-3B-v1 | 17.29 |
| 18 | mPLUG-Owl | LLaMA-7B | 16.14 |
| 19 | Otter-Image | Otter-9B-LA-InContext | 15.14 |
| 20 | Otter | Otter-9B | 12.71 |
| 21 | MIC | FlanT5xl | 7.71 |
| 22 | PandaGPT | Vicuna-7B | 3.00 |
| 23 | LaVIN | LLaMA-7B | 2.14 |
| Rank | Model | Version | Score |
|---|---|---|---|
| 🏅️ | InternVL | InternVL-Chat | 62.00 |
| 🥈 | InternLM-XComposer-VL | InternLM-XComposer-VL-7B | 61.80 |
| 🥉 | Bard | Bard | 59.60 |
| 4 | InstructBLIP | Vicuna-7B | 59.20 |
| 5 | BLIVA | Vicuna-7B | 58.60 |
| 6 | Lynx | Vicuna-7B | 57.40 |
| 7 | LLaVA-1.5 | Vicuna-7B | 57.20 |
| 8 | LLaMA-Adapter-v2 | LLaMA-7B | 56.00 |
| 9 | Qwen-VL-Chat | Qwen-VL-Chat | 54.80 |
| 10 | Otter-Image | Otter-9B-LA-InContext | 52.40 |
| 11 | Cheetah | Vicuna-7B | 51.80 |
| 12 | PandaGPT | Vicuna-7B | 51.80 |
| 13 | mPLUG-Owl | LLaMA-7B | 50.60 |
| 14 | InternLM-XComposer | InternLM-XComposer-7B | 50.40 |
| 15 | MiniGPT-4 | Vicuna-7B | 49.00 |
| 16 | VPGTrans | Vicuna-7B | 48.20 |
| 17 | Otter | Otter-9B | 48.00 |
| 18 | LLaVA | Vicuna-7B | 47.40 |
| 19 | BLIP2 | FlanT5xl | 44.00 |
| 20 | OFv2_4BI | RedPajama-INCITE-Instruct-3B-v1 | 39.20 |
| 21 | VisualGLM-6B | VisualGLM-6B | 37.60 |
| 22 | LaVIN | LLaMA-7B | 35.00 |
| 23 | MIC | FlanT5xl | 24.20 |
| Rank | Model | Version | Score |
|---|---|---|---|
| 🏅️ | Qwen-VL-Chat | Qwen-VL-Chat | 90.00 |
| 🥈 | InternVL | InternVL-Chat | 89.00 |
| 🥉 | LLaVA-1.5 | Vicuna-7B | 88.33 |
| 4 | InternLM-XComposer-VL | InternLM-XComposer-VL-7B | 87.00 |
| 5 | Lynx | Vicuna-7B | 86.33 |
| 6 | InstructBLIP | Vicuna-7B | 85.00 |
| 7 | BLIP2 | FlanT5xl | 82.67 |
| 8 | Cheetah | Vicuna-7B | 77.00 |
| 9 | BLIVA | Vicuna-7B | 76.67 |
| 10 | Otter-Image | Otter-9B-LA-InContext | 74.00 |
| 11 | Bard | Bard | 70.67 |
| 12 | InternLM-XComposer | InternLM-XComposer-7B | 67.67 |
| 13 | VPGTrans | Vicuna-7B | 62.33 |
| 14 | mPLUG-Owl | LLaMA-7B | 61.00 |
| 15 | LLaMA-Adapter-v2 | LLaMA-7B | 60.67 |
| 16 | LLaVA | Vicuna-7B | 54.67 |
| 17 | VisualGLM-6B | VisualGLM-6B | 54.00 |
| 18 | Otter | Otter-9B | 53.33 |
| 19 | PandaGPT | Vicuna-7B | 53.00 |
| 20 | MiniGPT-4 | Vicuna-7B | 50.67 |
| 21 | MIC | FlanT5xl | 50.33 |
| 22 | OFv2_4BI | RedPajama-INCITE-Instruct-3B-v1 | 49.00 |
| 23 | LaVIN | LLaMA-7B | 20.00 |