Skip to content

penghao-wu/visual_jigsaw

Repository files navigation

Project Page HF arXiv Paper PDF

Visual Jigsaw

Visual Jigsaw is a generic self-supervised post-training framework designed to strengthen visual understanding in MLLMs. It is formulated as a general ordering task: visual inputs are partitioned, shuffled, and the model must reconstruct the visual information by producing the correct permutation in natural language. We provide the instantiations of Visual Jigsaw across three visual modalities, including images, videos, and 3D data.

Overview of Visual Jigsaw

Model Checkpoints

We release the following models trained with Visual Jigsaw from Qwen2.5-VL-7B-Instruct:

Visual Jigsaw Image 7B: Qwen2.5-VL-7B-Instruct trained with image jigsaw
Visual Jigsaw Video 7B: Qwen2.5-VL-7B-Instruct trained with video jigsaw
Visual Jigsaw 3D 7B: Qwen2.5-VL-7B-Instruct trained with 3D jigsaw

Our models are based on Qwen2.5-VL-7B-Instruct. You can use the same code as it for inference.

Visual Jigsaw Training Data

The training data for Visual Jigsaw can be downloaded at visual_jigsaw_training_data.

For the image jigsaw task, we use the images from COCO 2017 training split.
For the video jigsaw task, we use the videos from LLaVa-Video.
For the 3D jigsaw task, we use the RGB images from ScanNet.

For training, you need to download the source data from the above source datasets.

Training

The training scripts for Visual Jigsaw training are provided in train_scripts\.

Evaluation

For evaluation, please see guidelines in eval.md.

License

This project is under the Apache-2.0 license. See LICENSE for details.

Citation

Please consider citing our paper if you find this project helpful for your research:

@article{visual_jigsaw,
  author    = {Wu, Penghao and Yushan, Zhang and Haiwen, Diao and Bo, Li and Lu, Lewei and Liu, Ziwei},
  title     = {Visual Jigsaw Post-Training Improves MLLMs},
  journal={arXiv preprint arXiv:2509.25190},
  year={2025}}

Acknowledgement

  • This work is built upon verl.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published