Skip to content
/ NoteMR Public

[CVPR 2025] Code for "Notes-guided MLLM Reasoning: Enhancing MLLM with Knowledge and Visual Notes for Visual Question Answering".

Notifications You must be signed in to change notification settings

Jorffy/NoteMR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NoteMR

This is the official implementation of the paper "Notes-guided MLLM Reasoning: Enhancing MLLM with Knowledge and Visual Notes for Visual Question Answering", which is accepted by CVPR 2025.

Abstract

The knowledge-based visual question answering (KB-VQA) task involves using external knowledge about the image to assist reasoning. Building on the impressive performance of multimodal large language model (MLLM), recent methods have commenced leveraging MLLM as an implicit knowledge base for reasoning. However, the direct employment of MLLM with raw external knowledge might result in reasoning errors due to misdirected knowledge information. Additionally, MLLM may lack fine-grained perception of visual features, which can result in hallucinations during reasoning. To address these challenges, we propose Notes-guided MLLM Reasoning (NoteMR), a novel framework that guides MLLM in better reasoning by utilizing knowledge notes and visual notes. Specifically, we initially obtain explicit knowledge from an external knowledge base. Then, this explicit knowledge, combined with images, is used to assist the MLLM in generating knowledge notes. These notes are designed to filter explicit knowledge and identify relevant internal implicit knowledge within the MLLM. We then identify highly correlated regions between the images and knowledge notes, retaining them as image notes to enhance the model's fine-grained perception, thereby mitigating MLLM induced hallucinations. Finally, both notes are fed into the MLLM, enabling a more comprehensive understanding of the image-question pair and enhancing the model's reasoning capabilities. Our method achieves state-of-the-art performance on the OK-VQA and A-OKVQA datasets, demonstrating its robustness and effectiveness across diverse VQA scenarios.

Model Architecture

The framework of Notes-guided MLLM Reasoning (NoteMR).

Environment Requirements

The experiments were conducted on NVIDIA RTX A6000 GPU with 48GB memory.

  • Python 3.10.14
  • PyTorch 2.0.1
  • CUDA 11.7

To run the MLLM reasoning code, you need to install the requirements:

pip install -r requirements.txt

Data Download

We evaluate our model using two publicly available KB-VQA dataset.

  • OK-VQA
Paper OKVQA
  • A-OKVQA
Paper AOKVQA Github AOKVQA

Run Code

Step. 1-1 Retrieval (FLMR/PreFLMR)

Paper FLMR Github FLMR Paper PreFLMR Hugging Face PreFLMR

We extract the top-k passages related to the input image and the question with the knowledge retriever, using the pre-trained PreFLMR.

Step. 1-2 Generate Knowledge Notes

python .\generate_knowledge_notes.py

Step. 2 Generate Visual Notes (GradCAM)

Paper GradCAM Github GradCAM

Step. 3 Generate Output

python .\generate_output.py

Papers for the Project & How to Cite

If you use or extend our work, please cite the paper as follows:

@InProceedings{Fang_2025_CVPR,
    author    = {Fang, Wenlong and Wu, Qiaofeng and Chen, Jing and Xue, Yun},
    title     = {Notes-guided MLLM Reasoning: Enhancing MLLM with Knowledge and Visual Notes for Visual Question Answering},
    booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)},
    month     = {June},
    year      = {2025},
    pages     = {19597-19607}
}