Skip to content

huofushuo/SID

Repository files navigation

Self-Introspective Decoding: Alleviating Hallucinations for Large Vision-Language Models [Paper]

Overview


Diagram of Self-Introspective Decoding.

Abstract: Hallucination remains a significant challenge in Large Vision-Language Models (LVLMs). To alleviate this issue, some methods, known as contrastive decoding, induce hallucinations by manually disturbing the raw vision or instruction inputs and then mitigate them by contrasting the outputs of the original and disturbed LVLMs. However, these holistic input disturbances sometimes induce potential noise and also double the inference cost. To tackle these issues, we propose a simple yet effective method named Self-Introspective Decoding (SID). Our empirical investigations reveal that pre-trained LVLMs can introspectively assess the importance of vision tokens based on preceding vision and text (both instruction and generated) tokens. Leveraging this insight, we develop the Context and Textaware Token Selection (CT2S) strategy, which preserves only the least important vision tokens after the early decoder layers, thereby adaptively amplify vision-and-text association hallucinations during auto-regressive decoding. This strategy ensures that multimodal knowledge absorbed in the early decoder layers induces multimodal contextual rather than aimless hallucinations, and significantly reduces computation burdens. Subsequently, the original token logits subtract the amplified fine-grained hallucinations, effectively alleviating hallucinations without compromising the LVLMs’ general ability. Extensive experiments illustrate that SID generates less-hallucination and higher-quality texts across various metrics, without much additional computation cost.

Self-Introspective Mechanism of pre-trained LVLMs. Retained vision tokens mainly focus on spurious related regions informed by vision and text (both instruction and generated texts).

Setup

As we design the LVLMs decoding strategy, it is convenient to use SID by installing our modified transformers package.

conda env create -f environment.yml
conda activate SID
python -m pip install -e transformers

Implementation

After setup the environment, you can directly use our code base to imply three LVLMs Decoding-based Hallucination Alleviation methods: Vision Contrastive Decoding (VCD), Instruction Contrastive Decoding (ICD), OPERA, and our SID:

python pope_eval.py --pope-type coco_adversarial --model llava-1.5  --use-cd  --use-fast-v  --sample  --sample-greedy  #SID_greedy

python pope_eval.py --pope-type coco_adversarial --model llava-1.5  --use-vcd  --sample  --sample-greedy  #VCD_greedy

python pope_eval.py --pope-type coco_adversarial --model llava-1.5  --use-icd  --sample  --sample-greedy  #ICD_greedy

python pope_eval.py --pope-type coco_adversarial --model llava-1.5  --beam 5  #Beam Search

python pope_eval.py --pope-type coco_adversarial --model llava-1.5  --beam 5  --opera #OPERA

The CHAIR metric utilizes the same configuration.

Evaluation

We provide extensive evaluation metrics including GPT-4V eval_utils/gpt4v_eval.py , GPT4 shr_eval.py, POPE pope_eval.py, CHAIR eval_utils/chair_eval.py

The following evaluation requires for MSCOCO 2014 / AOKVQA / GPA / Visual Genome dataset. Please download here dataset/download_cqa.py, dataset/download_ha_dpo.py, dataset/download_visual_genome_v1.2.py and extract it in the data path.

Besides, it needs you to prepare the following checkpoints of 7B base models:

Arguments

Argument Example Description
--model llava-1.5 Specify the LVLM model.
--data-path /path/to/dataset Path to the dataset file or folder.
--pope-type coco_adversarial Type for POPE evaluation.
--sample store_true Use the modified decoding strategy.
--sample-greedy store_true Use CD with sampling and greedy decoding.
--beam 5 Beam search number.
--opera store_true Use OPERA.

Acknowledgement

Some codes are based on the LVLMs codebase of OPERA, VCD, and HA-DPO . Thanks for their excellent works!