Visual Jigsaw Post-Training Improves MLLMs

Visual Jigsaw

Visual Jigsaw is a generic self-supervised post-training framework designed to strengthen visual understanding in MLLMs. It is formulated as a general ordering task: visual inputs are partitioned, shuffled, and the model must reconstruct the visual information by producing the correct permutation in natural language. We provide the instantiations of Visual Jigsaw across three visual modalities, including images, videos, and 3D data.

Model Checkpoints

We release the following models trained with Visual Jigsaw from Qwen2.5-VL-7B-Instruct:

Visual Jigsaw Image 7B: Qwen2.5-VL-7B-Instruct trained with image jigsaw
Visual Jigsaw Video 7B: Qwen2.5-VL-7B-Instruct trained with video jigsaw
Visual Jigsaw 3D 7B: Qwen2.5-VL-7B-Instruct trained with 3D jigsaw

Our models are based on Qwen2.5-VL-7B-Instruct. You can use the same code as it for inference.

Visual Jigsaw Training Data

The training data for Visual Jigsaw can be downloaded at visual_jigsaw_training_data.

For the image jigsaw task, we use the images from COCO 2017 training split.
For the video jigsaw task, we use the videos from LLaVa-Video.
For the 3D jigsaw task, we use the RGB images from ScanNet.

For training, you need to download the source data from the above source datasets.

Training

The training scripts for Visual Jigsaw training are provided in train_scripts\.

Evaluation

For evaluation, please see guidelines in eval.md.

License

This project is under the Apache-2.0 license. See LICENSE for details.

Citation

Please consider citing our paper if you find this project helpful for your research:

@article{visual_jigsaw,
  author    = {Wu, Penghao and Yushan, Zhang and Haiwen, Diao and Bo, Li and Lu, Lewei and Liu, Ziwei},
  title     = {Visual Jigsaw Post-Training Improves MLLMs},
  journal={arXiv preprint arXiv:2509.25190},
  year={2025}}

Acknowledgement

This work is built upon verl.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
assets		assets
docs		docs
eval		eval
scripts		scripts
train_scripts		train_scripts
verl		verl
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
Visual_Jigsaw.pdf		Visual_Jigsaw.pdf
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Visual Jigsaw Post-Training Improves MLLMs

Visual Jigsaw

Model Checkpoints

Visual Jigsaw Training Data

Training

Evaluation

License

Citation

Acknowledgement

About

Uh oh!

Releases

Packages

Languages

License

penghao-wu/visual_jigsaw

Folders and files

Latest commit

History

Repository files navigation

Visual Jigsaw Post-Training Improves MLLMs

Visual Jigsaw

Model Checkpoints

Visual Jigsaw Training Data

Training

Evaluation

License

Citation

Acknowledgement

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages