🪐 SenseNova-MARS: Empowering Multimodal Agentic Reasoning and Search via Reinforcement Learning

News

2026.01.01: We release the training and inference code.
2026.01.01: We release the Paper, SenseNova-MARS-Data and HR-MMSearch Benchmark.

Overview

While Vision-Language Models (VLMs) can solve complex tasks through agentic reasoning, their capabilities remain largely constrained to text-oriented chain-of-thought or isolated tool invocation. They fail to exhibit the human-like proficiency required to seamlessly interleave dynamic tool manipulation with continuous reasoning, particularly in knowledge-intensive and visually complex scenarios that demand coordinated external tools such as search and image cropping. In this work, we introduce SenseNova-MARS, a novel Multimodal Agentic Reasoning and Search framework that empowers VLMs with interleaved visual reasoning and tool-use capabilities via reinforcement learning (RL). Specifically, SenseNova-MARS dynamically integrates the image search, text search, and image crop tools to tackle fine-grained and knowledge-intensive visual understanding challenges. In the RL stage, we propose the Batch-Normalized Group Sequence Policy Optimization (BN-GSPO) algorithm to improve the training stability and advance the model’s ability to invoke tools and reason effectively. To comprehensively evaluate the agentic VLMs on complex visual tasks, we introduce the HR-MMSearch benchmark, the first search-oriented benchmark composed of high-resolution images with knowledge-intensive and search-driven questions. Experiments demonstrate that SenseNova-MARS achieves state-of-the-art performance on open-source search and fine-grained image understanding benchmarks. Specifically, on search-oriented benchmarks, SenseNovaMARS-8B scores 67.84 on MMSearch and 41.64 on HR-MMSearch, surpassing proprietary models such as Gemini-3-Flash and GPT-5. SenseNova-MARS represents a promising step toward agentic VLMs by providing effective and robust tool-use capabilities. To facilitate further research in this field, we have publicly released all code, models, and datasets.

Overall performance of SenseNova-MARS-8B compares to other models across six benchmarks.

SenseNova-MARS can tackle the challenging visual task by leveraging an integrated suite of text search, image search, and image crop tools within the reasoning process. This is a demo example.

Release Information

Datasets

Datasets	HuggingFace	Google Drive
SenseNova-MARS-Data	🤗 link	📁 link

HR-MMSearch Benchmark

HR-MMSearch Benchmark is the first search-oriented benchmark composed of high-resolution images with knowledge-intensive and search-driven questions.

Benchmark	HuggingFace
HR-MMSearch	🤗 link

Quick Start

Hardware Requirements

Our RL training setup for SenseNova-MARS requires 3 separate nodes with 8x NVIDIA H100 GPUs (80GB) each:

Node	Purpose	Services
Node 1	Training	RL training with veRL framework
Node 2	Infrastructure	Web Search Server (port 8000) + Local Wikipedia Database Server (port 8001) + Summarizer LLM (port 8123)
Node 3	LLM Judge	Qwen3-VL-32B-Instruct judge server (port 8181)

Evaluation only: If you only want to run evaluation (not training), you need 2 nodes:

Node 1 for running the evaluation script
Node 2 for infrastructure services (Web Search, Local Database, Summarizer)

Installation

Step 1: Download Data

Download the required data from Google Drive:

Download Link

After downloading, extract and place the directories so your project structure looks like:

SenseNova-MARS/
├── wiki_20250901/              # Local Wikipedia database (download required)
├── Search-R1/                  # Local retrieval server (download required)
├── data/
│   ├── eval/                   # Evaluation datasets (download required)
│   └── train_qwen3_vl_8b/      # Training data (download required)
├── verl/                       # Modified veRL framework
├── web_search_server/          # Web search server
├── config/                     # Tool configurations
├── assets/                     # Images for documentation
├── Dockerfile
├── train_multi_node.sh
├── eval_single_node.sh
├── train_qwen3_vl_8b.json      # Training data manifest
├── test_subset.json            # Validation subset
├── test_all.json               # Complete test set
└── README.md

Step 2: Build Docker Environment

Build the Docker image:

docker build -t verl-mars:latest .

# If in China, use mirrors for faster download:
docker build --build-arg USE_MIRROR=true -t verl-mars:latest .

Step 3: Run Docker Container

docker run -it --gpus all --shm-size=64g \
    -v /path/to/SenseNova-MARS:/workspace/SenseNova-MARS \
    -p 8000:8000 -p 8001:8001 -p 8123:8123 -p 8181:8181 -p 8265:8265 \
    verl-mars:latest

# Inside container, navigate to the project directory
cd /workspace/SenseNova-MARS

Infrastructure Setup

Before training or evaluation, you need to launch the following services:

1. Web Search Server (Text Search)

The web search server provides text search capabilities. It supports two modes:

Training: Uses local Wikipedia database (local_retrieval)
Validation/Evaluation: Uses Google Serper API (google_serper)

Getting a Serper API Key:

Go to serper.dev and create an account
Navigate to your dashboard to get your API key

Note: A free Serper account includes 2,500 queries. Training requires more than this, but you can test evaluation with the free tier first.

cd web_search_server

# Set environment variables
export WEBSEARCH_GOOGLE_SERPER_KEY="<YOUR_SERPER_API_KEY>"  # Get from https://serper.dev
export AZURE_OPENAI_API_KEY="<YOUR_AZURE_OPENAI_KEY>"
export WEB_SERVER_CONFIG_FILE="./config.json"
export WEB_SERVER_CACHE_DIR="./search_cache"
export SUMMARIZER_BASE_URL="http://localhost:8123/v1"

# Install dependencies (if not using Docker)
pip install -r requirements.txt
playwright install-deps
playwright install

# Start the server (default port 8000)
uvicorn server:app --host 0.0.0.0 --port 8000

2. Local Wikipedia Database Server

The local database server provides fast local text retrieval using the Wikipedia dump (required for training).

Step 1: Set up Faiss Environment

# Create faiss conda environment
conda create -n faiss python=3.12 -y
conda activate faiss

# Install dependencies
conda install -c pytorch -c nvidia -c rapidsai -c conda-forge \
  libnvjitlink pytorch-cuda=12.4 pytorch transformers datasets \
  faiss-gpu-cuvs=1.12.0 numpy==1.26.4 uvicorn fastapi -y

Step 2: Launch Local Database Server

To set up the retrieval backend, we use the Search-R1 framework. You will need to download the repository and launch the local retrieval server to handle incoming search queries.

Clone the Repository:

git clone https://github.com/PeterGriffinJin/Search-R1
cd Search-R1

Run the Server: Execute the following command to start the FAISS-based retrieval service:

conda activate faiss

python -u Search-R1/search_r1/search/retrieval_server.py \
  --index_path wiki_20250901/e5_Flat.index \
  --corpus_path wiki_20250901/formatted_wiki.jsonl \
  --topk 3 \
  --retriever_name e5 \
  --retriever_model intfloat/e5-large-v2 \
  --faiss_gpu \
  --port 8001

3. Summarizer LLM Server

The summarizer LLM processes and summarizes search results. Launch using SGLang on 8x H100 GPUs (can run on the same node as the search servers):

python -m sglang.launch_server \
    --model-path Qwen/Qwen3-32B \
    --served-model-name Qwen/Qwen3-32B \
    --host 0.0.0.0 --port 8123 \
    --dtype bfloat16 \
    --tp-size 4 --dp-size 2 \
    --mem-fraction-static 0.9 \
    --max-total-tokens 262144 \
    --max-prefill-tokens 65536 \
    --chunked-prefill-size 16384 \
    --max-running-requests 1024

4. LLM Judge Server (Required for RL Training)

The LLM judge evaluates model outputs during training:

Training: Uses Qwen3-VL-32B-Instruct (self-hosted via SGLang)
Validation: Uses GPT-4o via Azure OpenAI API

Azure OpenAI Setup: The validation judge uses Azure OpenAI client by default. You need an AZURE_OPENAI_API_KEY for validation scoring. If you prefer to use the standard OpenAI API instead, modify verl/verl/workers/reward_manager/tool.py to use openai.OpenAI client.

Launch the training judge using SGLang on 8x H100 GPUs:

export SGLANG_VLM_CACHE_SIZE_MB=8192

python -m sglang.launch_server \
    --model-path Qwen/Qwen3-VL-32B-Instruct \
    --host 0.0.0.0 --port 8181 \
    --dtype bfloat16 \
    --served-model-name Qwen3-VL-32B-Instruct \
    --tp 4 --dp 2 \
    --mem-fraction-static 0.6 \
    --context-length 40960 \
    --max-running-requests 1024 \
    --chunked-prefill-size 2048 \
    --enable-torch-compile \
    --torch-compile-max-bs 64

Note: This configuration uses tp=4 (tensor parallel) and dp=2 (data parallel), requiring 8 GPUs total.

Training

Prerequisites

Ensure all infrastructure services are running:

Web Search Server (port 8000)
Local Wikipedia Database Server (port 8001)
Summarizer LLM Server (port 8123)
LLM Judge Server (port 8181)

Configure Training

Edit train_multi_node.sh and set the following variables:

# ==================== USER CONFIGURATION (Edit these) ====================
export TEXT_SEARCH_ADDRESS="<INFRA_SERVER_IP>:8000"
export LOCAL_DATABASE_ADDRESS="<INFRA_SERVER_IP>:8001"
export AZURE_OPENAI_API_KEY="<YOUR_AZURE_OPENAI_KEY>"
export WANDB_API_KEY="<YOUR_WANDB_API_KEY>"
LLM_JUDGE_URL="<LLM_JUDGE_SERVER_IP>:8181"
# ==========================================================================

Single-Node Training (8 GPUs)

NODE_RANK=0 NNODES=1 bash train_multi_node.sh

Multi-Node Training

For distributed training across multiple nodes (e.g., 4 nodes):

# On head node (Node 0):
NODE_RANK=0 NNODES=4 bash train_multi_node.sh

# On worker nodes:
NODE_RANK=1 MASTER_ADDR=<head_node_ip> NNODES=4 bash train_multi_node.sh
NODE_RANK=2 MASTER_ADDR=<head_node_ip> NNODES=4 bash train_multi_node.sh
NODE_RANK=3 MASTER_ADDR=<head_node_ip> NNODES=4 bash train_multi_node.sh

Training logs are saved to logs/ and rollout data to rollout_data/.

Evaluation

Prerequisites

Ensure the following services are running:

Web Search Server (port 8000)
Local Wikipedia Database Server (port 8001)
Summarizer LLM Server (port 8123)

Configure Evaluation

Edit eval_single_node.sh:

# ==================== USER CONFIGURATION (Edit these) ====================
export MODEL_PATH="<YOUR_TRAINED_MODEL_PATH>"
export TEXT_SEARCH_ADDRESS="<INFRA_SERVER_IP>:8000"
export LOCAL_DATABASE_ADDRESS="<INFRA_SERVER_IP>:8001"
export AZURE_OPENAI_API_KEY="<YOUR_AZURE_OPENAI_KEY>"
# ==========================================================================

Run Evaluation

bash eval_single_node.sh

Results are saved to logs/ and rollout_data/.

Configuration Checklist

Before running training or evaluation, ensure you have:

Logs

Training logs: logs/
Rollout data: rollout_data/
Web search server logs: web_search_server/logs/

Benchmark Performance

Search-oriented benchmarks

Type	Model	Average	MMSearch	HR-MMSearch	FVQA-test	InfoSeek	SimpleVQA	LiveVQA	MAT-Search
Direct Answer
Open-source	Qwen2.5-VL-7B-Instruct	27.70	7.60	0.58	26.28	31.95	47.88	19.63	60.00
	Qwen2.5-VL-32B-Instruct	32.01	11.70	3.93	30.50	36.65	48.57	21.40	71.33
	Qwen3-VL-8B-Instruct	29.24	11.70	12.13	24.22	23.15	42.94	23.18	67.33
Proprietary	GPT-4o-mini	33.08	15.79	1.31	36.83	35.95	44.42	24.63	72.66
	Gemini-2.5-Flash	40.87	21.64	7.54	43.78	44.10	55.48	31.57	82.00
	GPT-4o	42.38	23.39	13.11	48.00	52.90	51.73	28.18	79.33
	GPT-5	50.24	35.09	22.62	54.39	61.70	54.15	44.39	79.33
	Gemini-3-Flash	53.68	57.31	21.97	56.50	63.57	54.85	38.90	82.67
Agentic Model (zero-shot)
Open-source	Qwen2.5-VL-7B-Instruct	35.50	32.16	19.34	36.00	28.80	42.35	22.52	67.33
	Qwen2.5-VL-32B-Instruct	53.45	49.71	33.44	52.22	50.10	65.15	42.17	81.33
	Qwen3-VL-8B-Instruct	51.52	47.37	27.87	53.61	46.15	62.29	39.37	84.00
Proprietary	GPT-4o-mini	45.65	38.60	26.23	50.00	42.35	50.84	31.54	80.00
	Gemini-2.5-Flash	58.05	59.06	40.00	61.72	53.70	68.81	47.75	75.33
	GPT-4o	55.09	49.12	30.16	66.34	59.55	63.67	40.09	76.67
	GPT-5	60.12	52.63	38.36	62.61	70.58	55.95	56.02	84.67
	Gemini-3-Flash	61.26	62.57	41.64	64.89	67.92	61.10	48.06	82.67
Agentic Model
Open-source	Visual-ARFT	40.13	34.50	24.92	41.72	37.95	42.45	25.40	74.00
	DeepMMSearch-R1	-	-	-	-	47.51	55.87	-	-
	MMSearch-R1	52.49	53.80	20.33	58.40	55.10	57.40	48.40	74.00
	DeepEyeV2	-	63.70	-	60.60	51.10	59.40	-	-
	SenseNova-MARS-8B	64.20	67.84	41.64	67.11	70.19	61.70	56.22	84.67

High-resolution Benchmarks

Model	V* Bench	HR-Bench 4K	HR-Bench 8K	MME RealWorld	Avg.
Direct Answer
GPT-4o	67.5	65.0	59.6	62.8	63.7
LLaVA-onevison	75.4	63.0	59.8	57.4	63.9
Qwen2.5-VL-7B-Instruct	75.3	65.5	62.1	56.8	64.9
Qwen2.5-VL-32B-Instruct	80.6	69.3	63.6	59.1	68.2
Qwen3-VL-8B-Instruct	86.4	78.9	74.6	61.9	75.5
Agentic Model
SEAL	74.8	-	-	-	-
Monet	83.3	71.0	68.0	-	-
Pixel-Reasoner	84.3	72.6	66.1	64.4	71.9
DeepEyes	83.3	73.2	69.5	64.1	72.5
Thyme	82.2	77.0	72.0	64.8	74.0
DeepEyesV2	81.8	77.9	73.8	64.9	74.6
Mini-o3	88.2	77.5	73.3	65.5	76.1
SenseNova-MARS-8B	92.2	83.1	78.4	67.9	80.4

Acknowledgements

We would like to thank the following projects for their contributions to the development of SenseNova-MARS:

Citation

@article{SenseNova-MARS,
  title={SenseNova-MARS: Empowering Multimodal Agentic Reasoning and Search via Reinforcement Learning},
  author={Yong Xien Chng and Tao Hu and Wenwen Tong and Xueheng Li and Jiandong Chen and Haojia Yu and Jiefan Lu and Hewei Guo and Hanming Deng and Chengjun Xie and Gao Huang and Dahua Lin and Lewei Lu},
  journal={arXiv preprint arXiv:2512.24330},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
assets		assets
config/tool_config		config/tool_config
verl		verl
web_search_server		web_search_server
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
eval_single_node.sh		eval_single_node.sh
test_all.json		test_all.json
test_subset.json		test_subset.json
train_multi_node.sh		train_multi_node.sh
train_qwen3_vl_8b.json		train_qwen3_vl_8b.json

License

OpenSenseNova/SenseNova-MARS

Folders and files

Latest commit

History

Repository files navigation

🪐 SenseNova-MARS: Empowering Multimodal Agentic Reasoning and Search via Reinforcement Learning

News

Contents

Overview

Release Information

Datasets

HR-MMSearch Benchmark

Quick Start

Hardware Requirements

Installation

Step 1: Download Data

Step 2: Build Docker Environment

Step 3: Run Docker Container

Infrastructure Setup

1. Web Search Server (Text Search)

2. Local Wikipedia Database Server

Step 1: Set up Faiss Environment

Step 2: Launch Local Database Server

3. Summarizer LLM Server

4. LLM Judge Server (Required for RL Training)

Training

Prerequisites

Configure Training

Single-Node Training (8 GPUs)

Multi-Node Training

Evaluation

Prerequisites

Configure Evaluation

Run Evaluation

Configuration Checklist

Logs

Benchmark Performance

Search-oriented benchmarks

High-resolution Benchmarks

Acknowledgements

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages