Framework for merging a text-based Reward Model (RM) with a Large Vision-Language Model (LVLM). LVLMs excel at visual tasks, while text-based RMs struggle to provide accurate rewards without visual contexts. Our approach transfers textual preferences to vision-language understanding, resulting in a Vision-Language Reward Model (VLRM). All icons used in this figure are sourced from https://www.flaticon.com/
This repository contains the implementation for our paper Transferring Textual Preferences to Vision-Language Understanding through Model Merging. We present methods to merge text reward models with vision-language models to enhance multimodal preference alignment.
git clone https://github.com/lca0503/MergeToVLRM.git
cd MergeToVLRM
pip install -r requirements.txt
Install the mergekit library to facilitate model merging workflows.
First, extract the language model components from the vision-language model and the text reward model:
- Large Vision-Language Model: meta-llama/Llama-3.2-11B-Vision-Instruct
- Text Reward Models:
Run the extraction script:
bash scripts/merging/run_extract_tlm.sh
The extracted models will be saved in:
- LVLM text component:
./models/mllama_t/
- Text Reward Model:
./models/tulu_t/
Merging Method | Script |
---|---|
Linear | bash scripts/merging/run_linear.sh |
Task Vector | bash scripts/merging/run_task_vector.sh |
TIES | bash scripts/merging/run_ties.sh |
DARE with Task Vector | bash scripts/merging/run_dare_linear.sh |
DARE with TIES | bash scripts/merging/run_dare_ties.sh |
Note 1: Models will be saved in ./models_vlseq/
. Merging configurations are located in the scripts/merging/config/
directory.
Note 2: By default, this repo uses allenai/llama-3.1-tulu-2-8b-uf-mean-rm
. To test with allenai/Llama-3.1-Tulu-3-8B-RM
, update ./scripts/merging/run_extract_tlm.sh
and each scripts/merging/run_${method}.sh
script accordingly.
To evaluate a text RM on VL-RewardBench, run:
python3 src/vl_rewardbench.py \
--output_path ${path_to_save_output} \
--model_id ${model_name} \
--text_only
To evaluate Cascade approach on VL-RewardBench, run (you can skip the get_caption step as we provide pre-saved captions in ./caption/
):
python3 src/get_caption.py \
--output_path "./caption/vl_rewardbench.json" \
--task "vl_rewardbench" \
--model_id "meta-llama/Llama-3.2-11B-Vision-Instruct"
python3 src/vl_rewardbench.py \
--output_path ${path_to_save_output} \
--model_id ${model_name} \
--caption_path "./caption/vl_rewardbench.json" \
--caption
To evaluate a VLRM on VL-RewardBench, run:
python3 src/vl_rewardbench.py \
--output_path ${path_to_save_output} \
--model_id ${model_name}
To evaluate a VLRM on VL-RewardBench without using image input, run:
python3 src/vl_rewardbench.py \
--output_path ${path_to_save_output} \
--model_id ${model_name} \
--no_image
After merging models, we evaluate the effect of merging parameters on VL-RewardBench with the following scripts:
Merging Method | Script |
---|---|
Linear | bash scripts/vl_rewardbench/search_linear.sh |
Task Vector | bash scripts/vl_rewardbench/search_task_vector.sh |
TIES | bash scripts/vl_rewardbench/search_ties.sh |
DARE with Task Vector | bash scripts/vl_rewardbench/search_dare_linear.sh |
DARE with TIES | bash scripts/vl_rewardbench/search_dare_ties.sh |
Results will be saved in ./results/VL_RewardBench/
python3 src/vl_rewardbench_results.py --input_path ${path_to_jsonl_with_scores}
This guide shows how to run Best-of-N evaluation for:
textvqa_val
mmmu_pro_standard_cot
mmmu_pro_vision_cot
Results for the first three steps are pre-saved in ./best_of_n/
.
Install the lmms-eval library.
Generate N=8 responses using the meta-llama/Llama-3.2-11B-Vision-Instruct model. Save outputs to ./logs
.
bash scripts/generation/gen_${task_name}.sh ./logs
python3 src/aggregate_generation.py \
--input_dir ./logs/${task_name} \
--output_path ./best_of_n/${task_name}.jsonl \
--task ${task_name}
To evaluate a text RM using Best-of-N, run:
python3 src/get_scores.py \
--input_path ${path_to_responses} \
--output_path ${path_to_save_output} \
--task ${task_name} \
--model_id ${model_name} \
--text_only
To evaluate Cascade approach using Best-of-N, run (you can skip the get_caption step as we provide pre-saved captions in ./caption/
):
python3 src/get_caption.py \
--output_path "./caption/${task_name}.json" \
--task ${task_name} \
--model_id "meta-llama/Llama-3.2-11B-Vision-Instruct"
python3 src/get_scores.py \
--input_path ${path_to_responses} \
--output_path ${path_to_save_output} \
--task ${task_name} \
--model_id ${model_name}
--caption_path "./caption/${task_name}.json" \
--caption
To evaluate a VLRM using Best-of-N, run:
python3 src/get_scores.py \
--input_path ${path_to_responses} \
--output_path ${path_to_save_output} \
--task ${task_name} \
--model_id ${model_name}
To evaluate a VLRM using Best-of-N without using image input, run:
python3 src/get_scores.py \
--input_path ${path_to_responses} \
--output_path ${output_path} \
--task ${task_name} \
--model_id ${model_name} \
--no_image
After merging models, we evaluate the effect of the merged parameters using Best-of-N with the following scripts:
Merging Method | Script |
---|---|
Linear | bash scripts/${task_name}/search_linear.sh |
Task Vector | bash scripts/${task_name}/search_task_vector.sh |
TIES | bash scripts/${task_name}/search_ties.sh |
DARE with Task Vector | bash scripts/${task_name}/search_dare_linear.sh |
DARE with TIES | bash scripts/${task_name}/search_dare_ties.sh |
Results will be saved in ./results/${task_name}/
python3 src/get_results.py --input_path ${path_to_jsonl_with_scores}
We sample 400 instances from the RLAIF-V training set to create our validation set. The dataset is available at lca0503/rlaif_v_train_400
To evaluate a VLRM on the sampled RLAIF-V set, run:
python3 src/rlaif_v.py \
--output_path ${path_to_save_output} \
--model_id ${model_name}
After merging models, we evaluate the effect of merging parameters on the sampled RLAIF-V set with the following scripts:
Merging Method | Script |
---|---|
Linear | bash scripts/rlaif_v/search_linear.sh |
Task Vector | bash scripts/rlaif_v/search_task_vector.sh |
TIES | bash scripts/rlaif_v/search_ties.sh |
DARE with Task Vector | bash scripts/rlaif_v/search_dare_linear.sh |
DARE with TIES | bash scripts/rlaif_v/search_dare_ties.sh |
Results will be saved in ./results/RLAIF_V/
python3 src/rlaif_v_results.py --input_path ${path_to_jsonl_with_scores}
- DogeRM: https://github.com/MiuLab/DogeRM
- mergekit: https://github.com/arcee-ai/mergekit
- lmms-eval: https://github.com/EvolvingLMMs-Lab/lmms-eval
- VL_RewardBench: https://github.com/vl-rewardbench/VL_RewardBench
- MMMU: https://github.com/MMMU-Benchmark/MMMU
If you find our code or models helpful, please consider citing our paper using the following BibTeX:
@article{li2025transferring,
title={Transferring Textual Preferences to Vision-Language Understanding through Model Merging},
author={Li, Chen-An and Lin, Tzu-Han and Chen, Yun-Nung and Lee, Hung-yi},
journal={arXiv preprint arXiv:2502.13487},
year={2025}
}