Reasoning in Space via Grounding in the World

We present Grounded-Spatial Reasoner (GS-Reasoner), the first 3D-LLM that bridges 3D visual grounding and spatial reasoning.

Our model achieves state-of-the-art performance on VSI-Bench! 🚀

Yiming Chen, Zekun Qi, Wenyao Zhang, Xin Jin, Li Zhang, Peidong Liu

Setup

conda create -n gs-reasoner python=3.11 -y
conda activate gs-reasoner

git clone git@github.com:WU-CVGL/GS-Reasoner.git
cd GS-Reasoner

# install package for GS-Reasoner
pip install -e .

# (optional) opencv-python extral dependency
sudo apt update
sudo apt install -y libgl1 libglib2.0-0 libsm6 libxext6 libxrender1

# (optional) install gcc
conda install -c conda-forge gcc=13.2 gxx=13.2 -y

# (optional) install cuda toolkit 12.4
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-4

# install package for Sonata
cd llava/submodules/sonata
pip install -r requirements.txt
cd ../../..

# install package for VSI-Bench Evaluation
cd llava/submodules/lmms_eval
pip install -r requirements.txt
cd ../../..

Model Weights

We provide two pretrained model checkpoints:

GS-Reasoner – the main model used in our paper, producing more deterministic chain-of-thought reasoning.
GS-Reasoner-Diverse – a variant that generates more diverse chain-of-thought outputs with only a minor performance drop (less than 1.0 on VSI-Bench).

To use them, download the checkpoints and place them under the data/models directory.

Data Preprocessing

For data preprocessing, please refer to this instruction for detail.

VSI-Bench Evaluation data

In addition, we provide the depth, camera parameters, axis alignment matrices, and global scale factors (for metric evaluation) estimated using VGGT-SLAM and MoGe-2. The processed data can be accessed at ymccccc/vggt_slam_processed_data and are used for VSI-Bench evaluation. To use it, download the evaluation data and place it under the data/ directory. Before running the evaluation, please create a soft link to the color directory for each scene dataset as follows:

# example
cd data/vggt_slam_processed_data/arkitscenes
ln -s data/processed_data/arkitscenes/color ./

Training Data

The training data — including 3D visual grounding, general 3D tasks, and the Grounded Chain-of-Thought dataset — are available here. To use it, download the data and place it under data/ directory. Note: We replace the appearance order data in GCoT with that provided by ViCA, which we find leads to slightly improved performance. We sincerely thank the ViCA authors for their excellent open-source work.

Training

GS-Reasoner is trained end-to-end for next-token prediction. We first pretrain the model on subsets of 3D visual grounding datasets — including ScanRefer, Multi3DRef, SR3D, and NR3D, among others — to warm up its object grounding capability. Subsequently, we finetune the model on our proposed GCoT dataset, the remaining grounding data, and other 3D reasoning tasks such as ScanQA, SQA3D, and Scan2Cap.

To train GS-Reasoner, first prepare all data following the instructions in Data Preprocessing. Then, download the pretrained checkpoints for the following models and place them under data/models: LLaVA-Video-7B-Qwen2, Sonata and SigLIP.

Once the setup is complete, run the following scripts to train GS-Reasoner:

# Pretrain
bash train_multi_pretrain.sh

# Finetune
bash train_multi_finetune.sh

Evaluation

Run following scripts to evaluate GS-Reasoner:

# evaluate on VSI-Bench
bash eval_vsibench.sh   

# evaluate on 3d visual grounding
bash eval_3d_vg.sh

# evaluate on generate 3d tasks
bash eval_general_3d.sh

TODO

Release detailed data preprocessing scripts.
Release the Grounded Chain-of-Thought dataset.
Release processed data for general 3D tasks (e.g., ScanRefer, ScanQA, etc.).
Release training script.

Acknowledgements

Citation

@misc{chen2025reasoningspacegroundingworld,
      title={Reasoning in Space via Grounding in the World}, 
      author={Yiming Chen and Zekun Qi and Wenyao Zhang and Xin Jin and Li Zhang and Peidong Liu},
      year={2025},
      eprint={2510.13800},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2510.13800}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
assets		assets
llava		llava
scripts		scripts
trl		trl
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
eval_3d_vg.sh		eval_3d_vg.sh
eval_general_3d.sh		eval_general_3d.sh
eval_vsibench.sh		eval_vsibench.sh
pyproject.toml		pyproject.toml
train_multi_finetune.sh		train_multi_finetune.sh
train_multi_pretrain.sh		train_multi_pretrain.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Reasoning in Space via Grounding in the World

Setup

Model Weights

Data Preprocessing

VSI-Bench Evaluation data

Training Data

Training

Evaluation

TODO

Acknowledgements

Citation

About

Uh oh!

Releases

Packages

Languages

License

WU-CVGL/GS-Reasoner

Folders and files

Latest commit

History

Repository files navigation

Reasoning in Space via Grounding in the World

Setup

Model Weights

Data Preprocessing

VSI-Bench Evaluation data

Training Data

Training

Evaluation

TODO

Acknowledgements

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages