This is the official implementation of the paper "Notes-guided MLLM Reasoning: Enhancing MLLM with Knowledge and Visual Notes for Visual Question Answering", which is accepted by CVPR 2025.
The knowledge-based visual question answering (KB-VQA) task involves using external knowledge about the image to assist reasoning. Building on the impressive performance of multimodal large language model (MLLM), recent methods have commenced leveraging MLLM as an implicit knowledge base for reasoning. However, the direct employment of MLLM with raw external knowledge might result in reasoning errors due to misdirected knowledge information. Additionally, MLLM may lack fine-grained perception of visual features, which can result in hallucinations during reasoning. To address these challenges, we propose Notes-guided MLLM Reasoning (NoteMR), a novel framework that guides MLLM in better reasoning by utilizing knowledge notes and visual notes. Specifically, we initially obtain explicit knowledge from an external knowledge base. Then, this explicit knowledge, combined with images, is used to assist the MLLM in generating knowledge notes. These notes are designed to filter explicit knowledge and identify relevant internal implicit knowledge within the MLLM. We then identify highly correlated regions between the images and knowledge notes, retaining them as image notes to enhance the model's fine-grained perception, thereby mitigating MLLM induced hallucinations. Finally, both notes are fed into the MLLM, enabling a more comprehensive understanding of the image-question pair and enhancing the model's reasoning capabilities. Our method achieves state-of-the-art performance on the OK-VQA and A-OKVQA datasets, demonstrating its robustness and effectiveness across diverse VQA scenarios.
The framework of Notes-guided MLLM Reasoning (NoteMR).The experiments were conducted on NVIDIA RTX A6000 GPU with 48GB memory.
- Python 3.10.14
- PyTorch 2.0.1
- CUDA 11.7
To run the MLLM reasoning code, you need to install the requirements:
pip install -r requirements.txt
We evaluate our model using two publicly available KB-VQA dataset.
- OK-VQA
- A-OKVQA
We extract the top-k passages related to the input image and the question with the knowledge retriever, using the pre-trained PreFLMR.
python .\generate_knowledge_notes.py
python .\generate_output.py
If you use or extend our work, please cite the paper as follows:
@InProceedings{Fang_2025_CVPR,
author = {Fang, Wenlong and Wu, Qiaofeng and Chen, Jing and Xue, Yun},
title = {Notes-guided MLLM Reasoning: Enhancing MLLM with Knowledge and Visual Notes for Visual Question Answering},
booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)},
month = {June},
year = {2025},
pages = {19597-19607}
}