Visual Attention Sink in Large Multimodal Models

This repository contains the code for See What You Are Told: Visual Attention Sink in Large Multimodal Models, an approach designed to improve the object hallucination in large multimodal models. This experimental repository builds upon the foundation of LLaVA by introducing custom modifications and enhancements.

Installation

Begin by cloning the repository and navigating into the project directory:

$ git clone 
$ cd VisAttnSink

Environment Setup

To reproduce the exact development environment, follow these steps:

Create the Conda Environment:

$ conda create -n VisAttnSink -f env_conda.txt

Install the Python Dependencies:

$ pip install -r env_pip.txt

These commands ensure that your environment matches the dependencies required for the project.

Dataset Structure

The repository expects a specific directory structure for the datasets. Ensure your datasets follow the format below:

./D_datasets
└── {{ DATASET_NAME }}
    ├── Questions
    │   └── {{ CATEGORY_NAME }}-questions.jsonl
    └── Images
        ├── {{ IMAGE_NAME_1 }}.jpg
        ├── {{ IMAGE_NAME_2 }}.jpg
        └── ...

Replace the placeholder values (e.g., {{ DATASET_NAME }}, {{ CATEGORY_NAME }}) with the actual names corresponding to your dataset.

Running Experiments

Before running an experiment, configure your experiment settings using a YAML configuration file. Below is an example configuration:

name_exp: {{ EXPERIMENT_NAME }}
name_dataset: {{ DATASET_NAME }}
name_category: {{ CATEGORY_NAME }}

path_image_dir: {{ PATH_IMAGE_DIR }}
path_question_dir: D_datasets/{{ DATASET_NAME }}/Questions
path_model: {{ PATH_MODEL }}

conv_mode: vicuna_v1

logic: 1
tau: 20
rho: 0.5
beta: 0.6
summ: 0.2

max_new_tokens: 128
except_last_layer: 1

<COMMENTS>
Make sure to update the placeholder values with your specific settings:
    - name_exp: Name of your experiment.
    - name_dataset: The dataset name.
    - name_category: The category for the questions.
    - path_image_dir: Directory path containing images.
    - path_question_dir: Directory path for question files.
    - path_model: Path to the model checkpoint.
</COMMENTS>

Evaluation

We follow the LLaVA evaluation methodology. For detailed evaluation instructions, please refer to this link.

License

All code and algorithm logic in this repository are licensed under the Apache 2.0 license.

Acknowledgements

This repository is built on the foundation provided by LLaVA and integrates several experimental enhancements aimed at optimizing visual attention in large multimodal models.

For any questions or issues, please open an issue on the repository.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
A_exps		A_exps
B_scripts		B_scripts
C_datasets		C_datasets
D_answers		D_answers
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
env_conda.txt		env_conda.txt
env_pip.txt		env_pip.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Visual Attention Sink in Large Multimodal Models

Table of Contents

Installation

Environment Setup

Dataset Structure

Running Experiments

Evaluation

License

Acknowledgements

About

Releases

Packages

Languages

License

MICV-yonsei/VisAttnSink

Folders and files

Latest commit

History

Repository files navigation

Visual Attention Sink in Large Multimodal Models

Table of Contents

Installation

Environment Setup

Dataset Structure

Running Experiments

Evaluation

License

Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages