VASparse: Towards Efficient Visual Hallucination Mitigation for Large Vision-Language Model via Visual-Aware Sparsification
TODO
- Publishing our paper " VASparse: Towards Efficient Visual Hallucination Mitigation for Large Vision-Language Model via Visual-Aware Sparsification"
- Publishing the complete code for our VASparse.
- The code is continuously being updated.
To install, run the following commands to install the required packages:
conda env create -f environment.yml
conda activate vasparse
We employ Grounding DINO as the external detector to ground hallucinatory objects. To install GroundingDINO with CUDA, we simplify the installation process, where you can:
export CUDA_HOME=$CONDA_PREFIX
# install GroundingDINO
cd decoder_zoo/GroundingDINO
pip install -e .
To download pre-trained model weights for DINO:
# default directory that contains the weights
mkdir model_checkpoints
cd model_checkpoints
# download weights
wget -q https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth
The following evaluation requires for MSCOCO 2014 dataset. Please download here and extract it in your data path.
- LLaVA-1.5 7B →
eval_configs/llava-1.5_eval.yaml
[L14] - LLaMA-2 7B →
minigpt4/configs/models/minigpt4_llama2.yaml
[L15] - Vicuna 7B v1.1 →
minigpt4/configs/models/blip2_instruct_vicuna7b.yaml
[L25] - Vicuna 7B v0 →
minigpt4/configs/models/minigpt4_vicuna0.yaml
[L18]
- MiniGPT-4 →
eval_configs/minigpt4_eval.yaml
[L8] - MiniGPT-4 LLaMA-2 →
eval_configs/minigpt4_llama2_eval.yaml
[L8] - mPLUG-Owl2 →
eval_configs/mplug-owl2_eval.yaml
[L14]
--model
: MLLM model type [instructblip, minigpt4, llava-1.5] (default: None)--data-path
: Dataset path (e.g.,COCO_2014/val2014/
)--pope-type
: POPE evaluation type [random, popular, adversarial]--beam
: Beam size for global search (default: 1)
We present some examples of configuration parameters for max generated tokens = 64:
--mask_rate
: Visual token masking ratio (default: 0.5)--contrastive_rate
: Contrastive learning rate (default: 0.1)--sparse_kv_cache_rate
: Sparsity ratio for KV cache (default: 0.9)--max_sentence_lenght
: Maximum length of generated sentence (default: 32)
--k-candidate-num
: Number of generative focal fields for local search (default: 4)--expand-ratio
: Growing factor of focal fields (default: 0.6)--detector
: Detector type [dino, owlv2] (default: dino)--box_threshold
: Threshold for bounding box in GroundingDino (default: 0.4)
--scale_factor
: Scale factor for self-attention weights (default: 50)--threshold
: Threshold for attending retrospection (default: 15)--num_attn_candidates
: Number of candidates per beam (default: 5)--penalty_weights
: Weight of penalty term in decoding (default: 1)
--cd-alpha
: Amplification factor (default: 1)--cd-beta
: Truncation factor for adaptive plausibility constraint (default: 0.1)--noise-step
: Number of steps to add diffusion noise (default: 500)
Following Evaluating Object Hallucination in Large Vision-Language Models and HALC, we used "Please describe this image in detail." as the prompt to query LVLM for captions of the 500
images randomly sampled from COCO 2014 Val datast. Under root directory, run
python3 run_scripts/caption_generation_patch_vasparse.py \
--model llava-1.5 \
--data_path path/to/val2014/ \
-d vasparse_contrastive \
--max_new_tokens 64 \
--num_samples 500 \
--seed 1 \
--gpu-id 0 \
--output_dir path/to/output/ \
--debugging 1
MiniGPT-4 and mPLUG-Owl2 can be executed using the following commands:
python3 run_scripts/caption_generation_patch_vasparse_minigpt4.py \
--model minigpt4 \
--data_path path/to/val2014/ \
-d vasparse_contrastive \
--max_new_tokens 64 \
--num_samples 500 \
--seed 1 \
--gpu-id 0 \
--output_dir path/to/output/ \
--debugging 1
python3 run_scripts/caption_generation_patch_vasparse_mplugowl2.py \
--model mplug-owl2 \
--data_path path/to/val2014/ \
-d vasparse_contrastive \
--max_new_tokens 64 \
--num_samples 500 \
--seed 1 \
--gpu-id 0 \
--output_dir path/to/output/ \
--debugging 1
Additionally, our codebase supports various other methods, including DoLa, VCD, Opera, HALC, and more.
To collect samples for the conventional POPE evaluation, under root directory, run
python run_scripts/pope_eval.py \
--model [LVLM Backbone] \
--data_path [COCO_DIR] \
-d [Decoding Strategy] \
--pope_type [random/popular/adversarial] \
--num_images 100 \
--seed [SEED] \
--gpu_id [GPU_IDs] \
--output_dir path/to/output/
MME also follows the same procedure as CHAIR and OPOPE to collect samples. Alternatively, under root directory, run
python run_scripts/mme_eval.py \
--model [LVLM Backbone] \
--data_path [MME_DIR] \
-d [Decoding Strategy] \
--num_samples 30 \
--seed [SEED] \
--gpu-id [GPU_IDs] \
--output_dir path/to/output/
Under root directory, run
python run_scripts/reviser_eval.py \
-r [woodpecker/lure] \
--data_path [COCO_DIR] \
--c [PATH_TO_CAPTION] \
--seed [SEED] \
--gpu-id [GPU_IDs] \
--output_dir path/to/output/
After preparing your caption files using the above commands, you can either choose to evaluate the captions in an one-shot mode (single caption) or batch mode (all the caption files in a folder). To evaluate a single caption file,
python eval/eval_hallucination.py --metric chair --chair_input_path [PATH_TO_CAPTION_DIR] -v
To evaluate a batch of caption files, run
python eval/caption_to_chair.py -c [PATH_TO_CAPTION_FOLDER_DIR]
to convert the caption files to the format ready for CHAIR evaluation in the same directory first. Then a _chair.json
file will be produced under this folder. To further evaluate the CHAIR score as well as the generation quality scores, run
python eval/batch_eval.py -c [PATH_TO_CAPTION_FOLDER_DIR] --evaluator chair --coco_path [COCO_DIR]
Note that [COCO_DIR]
is expected to contain both images and annotation files within the annotations
subfolder. In other words, [COCO_DIR]
should the the following structure:
COCO_DIR (val2014 for example)
- annotations
- captions_val2014.json
- captions_val2014.json
- instances_train2014.json
- instances_val2014.json
- person_keypoints_train2014.json
- person_keypoints_val2014.json
- COCO_val2014_000000000042.jpg
- COCO_val2014_000000000073.jpg
...
Similarly, you can also evaluate POPE in both modes. To evaluate a single caption file,
python eval_hallucination.py --metric pope --pope_answer_path [PATH_TO_CAPTION_DIR] --pope_question_path [PATH_TO_POPE_QUESTION] -v
To evaluate a batch of caption files, run
python eval/batch_eval.py -c [PATH_TO_CAPTION_FOLDER_DIR] --evaluator pope --pope_type [random/popular/adversarial]
The evaluation results will be saved in the same directory.
To evaluate the MME scores on each chosen subset, modify the subset_dir
variable to include the list of directories of your target directories and run
python eval/MME_eval.py
SHR (Sentence-level Hallucination Ratio) is a fine-grained, diverse, and accurate evaluation benchmark of LVLM hallcuination on dense image descipription. Please refer to the SHR evaluation protocol for assessment.
This codebase is built upon the following repository: HALC, OPERA, VCD, SID, FastV, SparseVLM, HA-DPO, MME, Grounding DINO, LLaVA, etc.
We thank all the authors for their valuable contributions.