Visual Jigsaw is a generic self-supervised post-training framework designed to strengthen visual understanding in MLLMs. It is formulated as a general ordering task: visual inputs are partitioned, shuffled, and the model must reconstruct the visual information by producing the correct permutation in natural language. We provide the instantiations of Visual Jigsaw across three visual modalities, including images, videos, and 3D data.
We release the following models trained with Visual Jigsaw from Qwen2.5-VL-7B-Instruct:
Visual Jigsaw Image 7B: Qwen2.5-VL-7B-Instruct trained with image jigsaw
Visual Jigsaw Video 7B: Qwen2.5-VL-7B-Instruct trained with video jigsaw
Visual Jigsaw 3D 7B: Qwen2.5-VL-7B-Instruct trained with 3D jigsaw
Our models are based on Qwen2.5-VL-7B-Instruct. You can use the same code as it for inference.
The training data for Visual Jigsaw can be downloaded at visual_jigsaw_training_data.
For the image jigsaw task, we use the images from COCO 2017 training split.
For the video jigsaw task, we use the videos from LLaVa-Video.
For the 3D jigsaw task, we use the RGB images from ScanNet.
For training, you need to download the source data from the above source datasets.
The training scripts for Visual Jigsaw training are provided in train_scripts\.
For evaluation, please see guidelines in eval.md.
This project is under the Apache-2.0 license. See LICENSE for details.
Please consider citing our paper if you find this project helpful for your research:
@article{visual_jigsaw,
author = {Wu, Penghao and Yushan, Zhang and Haiwen, Diao and Bo, Li and Lu, Lewei and Liu, Ziwei},
title = {Visual Jigsaw Post-Training Improves MLLMs},
journal={arXiv preprint arXiv:2509.25190},
year={2025}}- This work is built upon verl.
