This repository contains the code of LVLM-Stethoscope, which can be used for diagnosing Large Vision-langurage models predictions, either in production or while developing models.
🤔️ Why Building this Project?
- Recent promising Large Vision-Language Models (LVLMs) are notorious for generating outputs that are inconsistent with the visual content, a challenge known as hallucination. However, our research team found that a powerful analysis tool is still absent to investigate what happens when these hallucinated decisions are made. This motivates us to present an interactive application to understand the internal mechanisms of LVLMs. Therefore, LVLM-Stethoscope is born👶.
🔨 What can I do with LVLM-Stethoscope?
The proposed LVLM-Stethoscope contains a series of ensembled functions that can help you to understand the under lying decision making process of the recent LVLMs. It can be used as but not limited to:
- A powerful visualization tool for model diagnosing:
- A useful evalutation tool:
- A hallucination warning tool:
⚙️ Tested and supporting models
We have tested LVLM-Stethoscope on a series of recent relased LVLMs. More models are coming soon...
Models | Link | Status |
---|---|---|
LLaVA | https://huggingface.co/docs/transformers/model_doc/llava | ✅ |
BLIP-2 | https://huggingface.co/docs/transformers/model_doc/blip-2 | ✅ |
MiniGPT-4 | https://github.com/Vision-CAIR/MiniGPT-4 | ✅ |
Conv-LLaVA | https://github.com/alibaba/conv-llava | ✅ |
InstructBLIP | https://huggingface.co/docs/transformers/model_doc/instructblip | TODO |
mPLUG-Owl2 | https://github.com/X-PLUG/mPLUG-Owl/tree/main/mPLUG-Owl2 | TODO |
🧑💻Your own | See how to visualize your own LVLMs | - |
🍇 Updates
2024/07/12
: We release the demo code based on Gradio, four LVLMs are supported!
Git clone our repository, creating a python environment and activate it via the following command.
git clone https://github.com/Ziwei-Zheng/LVLM-Stethoscope.git
cd LVLM-Stethoscope
conda create --name lvlm-ss python=3.10
conda activate lvlm-ss
pip install -r requirements.txt
Specify the name in ["llava-hf/llava-1.5-7b-hf", "Salesforce/blip2-opt-6.7b", "minigpt4-7b", "ConvLLaVA-sft-1536/1024/768"] to run the model. Then open the local URL to start conversation (default: http://127.0.0.1:7860)
CUDA_VISIBLE_DEVICES=[GPUS] python demo_meta.py --model-name [MODEL_NAME]
Note that since not all sub-models (e.g. ViT & Q-former) in MiniGPT-4 codebases are using transformers backends, they are not able to run in multi-GPUs. You can 1) Run the whole model within one single GPU (> 40G memory), or 2) Run sub-models that are not included in transformers on CPU by parsing --part-on-cpu
and other parts (e.g. LLaMA) on single or multi-GPUs to save memory.
We have tested llava-1.5-7b & blip2-opt-6.7b & ConvLLaVA-sft-1536/1024/768 on 2 RTX 4090 GPUs with 24G memory, and minigpt4-7b on 1 RTX 4090 GPU with --part-on-cpu
enabled.
The main functionalities and the visualization interface have been encapsulated and integrated into demo_meta.py
, one only need to customize your own models in the following steps:
-
Move the model specification and necessary files in the directory. e.g.,
./minigpt4
. -
Create a folder in
./models
that indicates your own model, and creategenerate.py
. -
Define [self.model, self.tokenizer, self.image_processor, self.model.num_img_patches, self.model.num_img_tokens, self.model.num_llm_layers, self.model.lm_head] as the necessary attributes in
__init__()
. -
Define
register_hooks()
to create hooks in self/cross-attention layers for relevancy analysis. -
Define
chat()
with image and user question as inputs, return the generated per-token answer ids. -
Define
forward_with_grads()
with the generated answer inserted in the conversation template for a single parallel forward to obtain per-token answer logits. Then carefully define the indexes of<Img>
,<Qus>
and<Ans>
tokens according to your organization. These indexes are used to find out specific locations in the whole outputs for further analysis. -
Define
compute_relevancy()
to specialize how to obtain relevancy scores according to your model. We have provided computations of vanilla transformers likeViT
andLLaMA
, and architectures with mixed attentions likeQ-former
inrelevancy_utils.py
. You can also add customized functions if nedded.
If you have any questions on how to build with customized models, please feel free to open an issue or contact me at ziwei.zheng@stu.xjtu.edu.cn.
-
Transformer-MM-Explainability: Introducing relevancy scores for explainability analysis.
-
LVLM-Intrepret: Another awesome tools to interpret LVLMs.
If you find this project helpful for your research, please consider citing the following BibTeX entry.