Name	Name	Last commit message	Last commit date
parent directory ..
chatgpt_ensemble_evaluation	chatgpt_ensemble_evaluation
models	models
task_datasets	task_datasets
utils	utils
utils_data	utils_data
README.md	README.md
bard_RandomSeed20230719_50samples_per_dataset.jsonl	bard_RandomSeed20230719_50samples_per_dataset.jsonl
eval_tiny.py	eval_tiny.py
multi_turn.py	multi_turn.py
multi_turn_example.jpg	multi_turn_example.jpg
requirements.txt	requirements.txt
sample_dataset.py	sample_dataset.py
updated_eval_tiny.py	updated_eval_tiny.py

Tiny LVLM Evaluation

Environments

conda create -n lvlm_eval python=3.8 -y
pip install -r requirements.txt

Model checkpoints

Most weights and checkpoint files will be downloaded automatically when initialing the corresponding testers. However, there are some files you should download personally and put in a directory. Then please replace the variable DATA_DIR in the models/__init__.py with the directory you save these files. Please note that the downloaded files should be organized as follows:

/path/to/DATA_DIR
├── llama_checkpoints
│   ├── 7B
│   │   ├── checklist.chk
│   │   ├── consolidated.00.pth
│   │   └── params.json
│   └── tokenizer.model
├── MiniGPT-4
│   ├── alignment.txt
│   └── pretrained_minigpt4_7b.pth
├── VPGTrans_Vicuna
├── otter-9b-hf
└── PandaGPT
    ├── imagebind_ckpt
    ├── vicuna_ckpt
    └── pandagpt_ckpt

For LLaMA-Adapter-v2, please obtain the LLaMA backbone weights using this form.
For MiniGPT-4, please download alignment.txt and pretrained_minigpt4_7b.pth.
For VPGTrans, please download VPGTrans_Vicuna.
For Otter, you can download the version we used in our evaluation from this repo. However, please note that the authors of Otter have updated their model, which is better than the version we used in evaluations, please check their github repo for the newest version.
For PandaGPT, please follow the instructions in here to prepare the weights of imagebind, vicuna and pandagpt.

Prompt Engineering

The table below shows prompts used for each dataset and across all multimodal models under study.

Prompt	Dataset
Classify the main object in the image.	ImageNet1K, CIFAR10
What breed is the flower in the image?	Flowers102
What breed is the pet in the image?	OxfordIIITPet
What is written in the image?	All 12 OCR datasets
Question: {question}\nChoose the best answer from the following choices:\n- option#1\n- option#2\n- option#3\n	IconQA
Context:\n{context}\n\nQuestion: {question}\nChoose the best answer from the following choices:\n- option#1\n- option#2\n- option#3	ScienceQA
Question: {question}\n\nChoose the single most likely answer from the following choices <choice>:\n- Yes\n- No\n\nThe output format follows exactly as below:\nAnswer: <choice>	MSCOCO_MCI, VCR_MCI
Question: Is the caption "{caption}" correctly describing the image?\n\nChoose the single most likely answer from the following choices <choice>:\n- Yes\n- No\n\nThe output format follows exactly as below:\nAnswer: <choice>	VSR
use original questions	other datasets

Datasets and Evaluation

python eval_tiny.py \
--model_name $MODEL
--device $CUDA_DEVICE_INDEX \
--batch-size $EVAL_BATCH_SIZE \
--dataset-names $DATASET_NAMES \
--sampled-root $SAMPLED_DATASET_DIR
--answer_path $SAVE_DIR
--use-sampled # use it when you have prepared the all datasets in LVLM evaluation and do not download the sampled data
# please check the name of models/datasets in (models/task_datasets)/__init__.py

The datasets used in Tiny LVLM Evaluation are the subset of the datasets used in LVLM Evaluation. Therefore, you can download the sampled subset in here and use it directly. The script sample_dataset.py is used to sample the subsets used in Tiny LVLM Evaluation and save it.

Besides, the inference results on 42 datasets of all 12 multimodal models studied in Tiny LVLM-eHub, including Bard, are downloadable from Google Drive.

Ability-level Benchmark

Beyond the inclusion of a partial dataset from the Tiny LVLM Evaluation, we present an enhanced dataset segmentation. This novel division systematically categorizes the datasets featured in the Tiny LVLM Evaluation according to their specific targeted abilities. Subsequently, we curate a subset of datasets that align with the evaluative criteria of the LVLM model and aggregate these subsets into an ability-level subset, excluding those related to embodied intelligence. Furthermore, this benchmark includes recently released models to bolster its comprehensiveness.

You can download the ability-level subset from here, and the inference results of all 20 multimodal models included in our benchmark can be found in here.

Here is an example command for using this benchmark:

python updated_eval_tiny.py
--model-name $MODEL
--device $CUDA_DEVICE_INDEX
--batch-size $EVAL_BATCH_SIZE
--sampled-root $ROOT_DIR_OF_SAMPLED_SUBSETS
--answer_path $SAVE_DIR

For detailed performance metrics, please refer to following tables.

Overall Score

Rank	Model	Version	Score
🏅️	InternVL	InternVL-Chat	327.61
🥈	InternLM-XComposer-VL	InternLM-XComposer-VL-7B	322.51
🥉	Bard	Bard	319.59
4	Qwen-VL-Chat	Qwen-VL-Chat	316.81
5	LLaVA-1.5	Vicuna-7B	307.17
6	InstructBLIP	Vicuna-7B	300.64
7	InternLM-XComposer	InternLM-XComposer-7B	288.89
8	BLIP2	FlanT5xl	284.72
9	BLIVA	Vicuna-7B	284.17
10	Lynx	Vicuna-7B	279.24
11	Cheetah	Vicuna-7B	258.91
12	LLaMA-Adapter-v2	LLaMA-7B	229.16
13	VPGTrans	Vicuna-7B	218.91
14	Otter-Image	Otter-9B-LA-InContext	216.43
15	VisualGLM-6B	VisualGLM-6B	211.98
16	mPLUG-Owl	LLaMA-7B	209.40
17	LLaVA	Vicuna-7B	200.93
18	MiniGPT-4	Vicuna-7B	192.62
19	Otter	Otter-9B	180.87
20	OFv2_4BI	RedPajama-INCITE-Instruct-3B-v1	176.37
21	PandaGPT	Vicuna-7B	174.25
22	LaVIN	LLaMA-7B	97.51
23	MIC	FlanT5xl	94.09

Visual Reasoning

Rank	Model	Version	Score
🏅️	Bard	Bard	64.18
🥈	Qwen-VL-Chat	Qwen-VL-Chat	62.36
🥉	InternVL	InternVL-Chat	56.36
4	InternLM-XComposer-VL	InternLM-XComposer-VL-7B	55.82
5	LLaVA-1.5	Vicuna-7B	55.64
6	Lynx	Vicuna-7B	52.18
7	InternLM-XComposer	InternLM-XComposer-7B	48.00
8	InstructBLIP	Vicuna-7B	46.73
9	BLIP2	FlanT5xl	44.91
10	LLaVA	Vicuna-7B	44.36
11	LLaMA-Adapter-v2	LLaMA-7B	43.45
12	Otter-Image	Otter-9B-LA-InContext	41.64
13	mPLUG-Owl	LLaMA-7B	40.91
14	Cheetah	Vicuna-7B	40.00
15	BLIVA	Vicuna-7B	38.73
16	MiniGPT-4	Vicuna-7B	37.64
17	VisualGLM-6B	VisualGLM-6B	37.27
18	OFv2_4BI	RedPajama-INCITE-Instruct-3B-v1	33.64
19	PandaGPT	Vicuna-7B	33.45
20	Otter	Otter-9B	29.82
21	VPGTrans	Vicuna-7B	27.27
22	LaVIN	LLaMA-7B	20.36
23	MIC	FlanT5xl	11.09

Visual Perception

Rank	Model	Version	Score
🏅️	Lynx	Vicuna-7B	65.75
🥈	Bard	Bard	57.00
🥉	InternLM-XComposer	InternLM-XComposer-7B	56.25
4	Qwen-VL-Chat	Qwen-VL-Chat	54.50
5	InternVL	InternVL-Chat	52.25
6	InternLM-XComposer-VL	InternLM-XComposer-VL-7B	53.75
7	BLIP2	FlanT5xl	49.00
8	LLaVA-1.5	Vicuna-7B	49.00
9	InstructBLIP	Vicuna-7B	48.00
10	BLIVA	Vicuna-7B	46.75
11	LLaMA-Adapter-v2	LLaMA-7B	46.75
12	Cheetah	Vicuna-7B	43.25
13	mPLUG-Owl	LLaMA-7B	40.75
14	MiniGPT-4	Vicuna-7B	37.75
15	OFv2_4BI	RedPajama-INCITE-Instruct-3B-v1	37.25
16	Otter	Otter-9B	37.00
17	LLaVA	Vicuna-7B	36.50
18	VisualGLM-6B	VisualGLM-6B	36.25
19	Otter-Image	Otter-9B-LA-InContext	33.25
20	PandaGPT	Vicuna-7B	33.00
21	VPGTrans	Vicuna-7B	31.25
22	LaVIN	LLaMA-7B	20.00
23	MIC	FlanT5xl	0.75

Visual Knowledge Acquisition

Rank	Model	Version	Score
🏅️	Bard	Bard	68.14
🥈	InternVL	InternVL-Chat	68.00
🥉	InternLM-XComposer	InternLM-XComposer-7B	66.57
4	BLIP2	FlanT5xl	64.14
5	InternLM-XComposer-VL	InternLM-XComposer-VL-7B	64.14
6	BLIVA	Vicuna-7B	63.43
7	InstructBLIP	Vicuna-7B	61.71
8	LLaVA-1.5	Vicuna-7B	57.00
9	Qwen-VL-Chat	Qwen-VL-Chat	55.14
10	VPGTrans	Vicuna-7B	49.86
11	VisualGLM-6B	VisualGLM-6B	46.86
12	Cheetah	Vicuna-7B	46.86
13	LLaMA-Adapter-v2	LLaMA-7B	22.29
14	LLaVA	Vicuna-7B	18.00
15	MiniGPT-4	Vicuna-7B	17.57
16	Lynx	Vicuna-7B	17.57
17	OFv2_4BI	RedPajama-INCITE-Instruct-3B-v1	17.29
18	mPLUG-Owl	LLaMA-7B	16.14
19	Otter-Image	Otter-9B-LA-InContext	15.14
20	Otter	Otter-9B	12.71
21	MIC	FlanT5xl	7.71
22	PandaGPT	Vicuna-7B	3.00
23	LaVIN	LLaMA-7B	2.14

Visual Commonsense

Rank	Model	Version	Score
🏅️	InternVL	InternVL-Chat	62.00
🥈	InternLM-XComposer-VL	InternLM-XComposer-VL-7B	61.80
🥉	Bard	Bard	59.60
4	InstructBLIP	Vicuna-7B	59.20
5	BLIVA	Vicuna-7B	58.60
6	Lynx	Vicuna-7B	57.40
7	LLaVA-1.5	Vicuna-7B	57.20
8	LLaMA-Adapter-v2	LLaMA-7B	56.00
9	Qwen-VL-Chat	Qwen-VL-Chat	54.80
10	Otter-Image	Otter-9B-LA-InContext	52.40
11	Cheetah	Vicuna-7B	51.80
12	PandaGPT	Vicuna-7B	51.80
13	mPLUG-Owl	LLaMA-7B	50.60
14	InternLM-XComposer	InternLM-XComposer-7B	50.40
15	MiniGPT-4	Vicuna-7B	49.00
16	VPGTrans	Vicuna-7B	48.20
17	Otter	Otter-9B	48.00
18	LLaVA	Vicuna-7B	47.40
19	BLIP2	FlanT5xl	44.00
20	OFv2_4BI	RedPajama-INCITE-Instruct-3B-v1	39.20
21	VisualGLM-6B	VisualGLM-6B	37.60
22	LaVIN	LLaMA-7B	35.00
23	MIC	FlanT5xl	24.20

Object Hallucination

Rank	Model	Version	Score
🏅️	Qwen-VL-Chat	Qwen-VL-Chat	90.00
🥈	InternVL	InternVL-Chat	89.00
🥉	LLaVA-1.5	Vicuna-7B	88.33
4	InternLM-XComposer-VL	InternLM-XComposer-VL-7B	87.00
5	Lynx	Vicuna-7B	86.33
6	InstructBLIP	Vicuna-7B	85.00
7	BLIP2	FlanT5xl	82.67
8	Cheetah	Vicuna-7B	77.00
9	BLIVA	Vicuna-7B	76.67
10	Otter-Image	Otter-9B-LA-InContext	74.00
11	Bard	Bard	70.67
12	InternLM-XComposer	InternLM-XComposer-7B	67.67
13	VPGTrans	Vicuna-7B	62.33
14	mPLUG-Owl	LLaMA-7B	61.00
15	LLaMA-Adapter-v2	LLaMA-7B	60.67
16	LLaVA	Vicuna-7B	54.67
17	VisualGLM-6B	VisualGLM-6B	54.00
18	Otter	Otter-9B	53.33
19	PandaGPT	Vicuna-7B	53.00
20	MiniGPT-4	Vicuna-7B	50.67
21	MIC	FlanT5xl	50.33
22	OFv2_4BI	RedPajama-INCITE-Instruct-3B-v1	49.00
23	LaVIN	LLaMA-7B	20.00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Tiny LVLM Evaluation

Environments

Model checkpoints

Prompt Engineering

Datasets and Evaluation

Ability-level Benchmark

Overall Score

Visual Reasoning

Visual Perception

Visual Knowledge Acquisition

Visual Commonsense

Object Hallucination

FilesExpand file tree

tiny_lvlm_evaluation

Directory actions

More options

Directory actions

More options

Latest commit

History

tiny_lvlm_evaluation

Folders and files

parent directory

README.md

Tiny LVLM Evaluation

Environments

Model checkpoints

Prompt Engineering

Datasets and Evaluation

Ability-level Benchmark

Overall Score

Visual Reasoning

Visual Perception

Visual Knowledge Acquisition

Visual Commonsense

Object Hallucination