Figure 1. Overview of Region-R1: Given an image-question pair, Region-R1 adaptively decides whether to retain the full image or crop a question-relevant region before scoring retrieved candidates, improving multi-modal re-ranking.
- [2026/04/07] We released Region-R1, a reinforcement learning framework for query-side region cropping in multi-modal re-ranking. Check out our paper for more details.
- [2026/04/06] 🎊 Our paper has been accepted to ACL 2026 (Findings)!!!
-
Query-Side Region Cropping: Formulates region selection as a decision-making problem — the model learns whether to retain the full image or focus on a question-relevant region to remove visual distractors before re-ranking.
-
r-GRPO: Proposes region-aware group relative policy optimization with decision-balanced group sampling, handling the structured action space where the crop/no-crop decision is discrete while bounding boxes are continuous.
-
Candidate-Agnostic RL Framework: Optimized directly for re-ranking objectives (MRR, NDCG, Rank, Margin) without requiring access to candidate images at decision time.
-
State-of-the-Art Results: Achieves conditional Recall@1 improvements of up to 20% on E-VQA and 8% on InfoSeek, outperforming EchoSight and OMGM baselines.
pip install torch torchvision
pip install transformers peft trl
pip install qwen-vl-utils
pip install datasets
pip install wandb
pip install Pillow matplotlib pandas numpy
pip install faiss-gpu # or faiss-cpu- Python 3.10+
- CUDA-capable GPU (tested on A100/H100)
Create a re-ranking dataset from InfoSeek or E-VQA knowledge base with EVA-CLIP hard negatives:
# Set dataset name: "InfoSeek" (default) or "EVQA"
export DATASET_NAME=InfoSeek
# Set external data path (optional, defaults to ~/Desktop/<DATASET_NAME>-data)
export DATA_ROOT=/path/to/dataset-data
python src/prepare_data.py --num_entries 1000 --visualize_samples 20| Argument | Default | Description |
|---|---|---|
--num_entries |
1000 | Number of dataset entries to create |
--similarity_tolerance |
0.05 | Hard negative similarity tolerance |
--visualize_samples |
20 | Number of sample visualizations to generate |
--device |
cuda:0 |
CUDA device |
Output: Reranker_Dataset/<DATASET_NAME>_top10_parquet/ (train.parquet, val.parquet)
Region-R1 uses the following models:
| Component | Model | Role |
|---|---|---|
| Policy Network | Qwen2.5-VL-3B-Instruct | Decides crop/no-crop and predicts bounding boxes (fine-tuned with LoRA) |
| Scoring Model | EVA-CLIP-8B | Computes cosine similarity for re-ranking reward (frozen) |
python main.py --reward_type mixture| Argument | Default | Description |
|---|---|---|
--reward_type |
mixture |
Reward function: mixture (recommended) or absolute |
--resume_from_checkpoint |
None | Path to checkpoint directory to resume from |
--log_level |
INFO |
Logging level: DEBUG, INFO, WARNING, ERROR |
--weight_mrr |
0.2 | Weight for delta-MRR in mixture reward |
--weight_ndcg |
0.2 | Weight for delta-NDCG |
--weight_rank |
0.2 | Weight for delta-rank |
--weight_margin |
0.4 | Weight for delta-margin |
--initial_encouragement |
0.0 | Initial encouragement bonus for cropping |
--encouragement_decay_steps |
5000 | Steps for encouragement to decay to zero |
Training logs to both TensorBoard and Weights & Biases. Checkpoints are saved to runs/<RUN_NAME>/checkpoints/.
Absolute (--reward_type absolute): Direct MRR/NDCG score of the cropped image against candidates.
Mixture (--reward_type mixture): Weighted combination of four delta metrics (crop vs. baseline):
- Delta-MRR: Change in Mean Reciprocal Rank
- Delta-NDCG: Change in Normalized Discounted Cumulative Gain
- Delta-Rank: Improvement in rank of the correct candidate (log scale)
- Delta-Margin: Change in similarity gap between positive and hardest negative
Evaluate a single checkpoint:
python src/evaluate_cropping.py --checkpoint runs/<run_name>/checkpoints/checkpoint-XXXOr use the convenience wrapper:
python src/run_eval.py --checkpoint runs/<run_name>/checkpoints/checkpoint-XXX --num_samples 100| Argument | Default | Description |
|---|---|---|
--checkpoint |
Final model | Path to checkpoint to evaluate |
--num_samples |
All | Number of validation samples |
--split |
val |
Dataset split (val or test) |
--output |
Auto | Output CSV path |
--viz_dir |
Auto | Directory for improvement visualizations |
--wandb |
Off | Log results to W&B |
Example with shell script:
export CUDA_VISIBLE_DEVICES=7
bash src/run_eval.sh| Method | E-VQA MRR | E-VQA R@1 | InfoSeek MRR | InfoSeek R@1 |
|---|---|---|---|---|
| EVA-CLIP | 0.224 | 14.2% | 0.553 | 46.3% |
| EchoSight | 0.402 | 36.5% | 0.586 | 53.2% |
| OMGM | 0.473 | 42.8% | 0.681 | 64.0% |
| Region-R1 | 0.473 | 44.7% | 0.706 | 66.5% |
| Method | E-VQA CondR@1 | InfoSeek CondR@1 |
|---|---|---|
| EchoSight | 0.75 | 0.68 |
| OMGM | 0.73 | 0.75 |
| Region-R1 | 0.90 | 0.81 |
Region-R1/
├── main.py # Training entry point
├── config.py # Configuration (paths, hyperparameters, prompts)
├── utils.py # CLIP scoring, bbox parsing, MRR/NDCG metrics
└── src/
├── model.py # Model loading (Qwen2.5-VL-3B + LoRA, EVA-CLIP-8B)
├── data.py # Lazy-loading dataset from parquet
├── train.py # r-GRPO training loop
├── rewards.py # Reward functions (absolute, mixture)
├── inference.py # Post-training inference/testing
├── logging_config.py # Logging setup
├── prepare_data.py # Dataset preparation (InfoSeek or E-VQA)
├── evaluate_cropping.py # Checkpoint evaluation (MRR improvement)
├── evaluation_callback.py # Training callback for eval during training
├── training_logger_callback.py # Training callback for metrics logging
├── run_eval.py # Convenience evaluation wrapper
└── run_eval.sh # Shell script for evaluation
All configuration is centralized in config.py:
| Setting | Description |
|---|---|
RUN_NAME |
Experiment name (determines output directory) |
MODEL_ID |
Base VLM model (Qwen/Qwen2.5-VL-3B-Instruct) |
EVACLIP_MODEL_ID |
Reward model (BAAI/EVA-CLIP-8B) |
TRAINING_CONFIG |
Learning rate, epochs, batch size, etc. |
LORA_CONFIG |
LoRA rank, alpha, dropout, target modules |
WANDB_CONFIG |
W&B project, entity, tags |
SYSTEM_PROMPT |
Prompt template for the cropping assistant |
If you find this work helpful, please consider citing our paper:
@article{hu2026region,
title={Region-R1: Reinforcing Query-Side Region Cropping for Multi-Modal Re-Ranking},
author={Hu, Chan-Wei and Tu, Zhengzhong},
journal={arXiv preprint arXiv:2604.05268},
year={2026}
}