A Unified Masked Jigsaw Puzzle Framework for Vision and Language Models, TPAMI2025

Weixin Ye¹, Wei Wang¹†, Yahui Liu², Yue Song³, Bin Ren⁴, Wei Bi⁵, Rita Cucchiara⁶, Nicu Sebe⁴

†: Corresponding Author
¹Beijing Jiaotong University, Beijing, China
²Kuaishou, Beijing, China
³Caltech, USA
⁴University of Trento, Italy
⁵Hong Kong University of Science and Technology, Hong Kong, China
⁶University of Modena and Reggio Emilia, Modena, Italy

The main idea of random jigsaw shuffle algorithm and the overview the proposed MJP for image and text data. The repository offers the official implementation of our paper in PyTorch.

🔍 News (Oct 5, 2025)! Our paper is accepted by TPAMI2025!

Abstract

In federated learning, Transformer, as a popular architecture, faces critical challenges in defending against gradient attacks and improving model performance in both Computer Vision (CV) and Natural Language Processing (NLP) tasks. It has been revealed that the gradient of Position Embeddings (PEs) in Transformer contains sufficient information, which can be used to reconstruct the input data. To mitigate this issue, we introduce a Masked Jigsaw Puzzle (MJP) framework. MJP starts with random token shuffling to break the token order, and then a learnable unknown (unk) position embedding is used to mask out the PEs of the shuffled tokens. In this manner, the local spatial information which is encoded in the position embeddings is disrupted, and the models are forced to learn feature representations that are less reliant on the local spatial information. Notably, with the careful use of MJP, we can not only improve models' robustness against gradient attacks, but also boost their performance in both vision and text application scenarios, such as classification for images (e.g., ImageNet-1K) and sentiment analysis for text (e.g., Yelp and Amazon). Experimental results suggest that MJP is a unified framework for different Transformer-based models in both vision and language tasks.

中文简介

一种基于掩码拼图的视觉-语言统一框架 在联邦学习中，Transformer作为一种流行的架构，在防御梯度攻击和提高计算机视觉与自然语言处理任务的模型性能方面面临着重大挑战。研究表明，Transformer中的位置嵌入/编码（Position Embeddings，PEs）的梯度包含了足够的信息，这些信息可以被用来重建输入数据。为了解决这一问题，我们提出了一种掩码拼图MJP（Masked Jigsaw Puzzle）框架。MJP通过随机打乱输入Token序列来破坏Token的顺序，然后使用可学习的“未知（unknown，unk）”PEs来掩盖打乱Token的PEs。通过这种方式，PEs中编码的局部空间信息被打乱，模型被迫学习不依赖于局部空间信息的特征表示。值得注意的是，通过合理地应用 MJP，不仅可以提升模型对梯度攻击的鲁棒性，还能在图像分类（如 ImageNet-1K）和文本情感分析（如 Yelp 和 Amazon）等视觉与文本任务中提高模型性能。实验结果表明，MJP是一个统一的框架，可以在视觉和语言任务中应用于不同的基于Transformer的模型。

Checkpoints

You can find our pretrained checkpoints.

Datasets

Prepare Yelp and Amazon dataset (.csv files)

Training

sh train_ngram.sh

Gradient Inversion Attack

sh attack.sh

Citation

@ARTICLE{11202736,
  author={Ye, Weixin and Wang, Wei and Liu, Yahui and Song, Yue and Ren, Bin and Bi, Wei and Cucchiara, Rita and Sebe, Nicu},
  journal={IEEE Transactions on Pattern Analysis and Machine Intelligence}, 
  title={A Unified Masked Jigsaw Puzzle Framework for Vision and Language Models}, 
  year={2025},
  volume={},
  number={},
  pages={1-17},
  keywords={Transformers;Privacy;Natural language processing;Principal component analysis;Computer vision;Computational modeling;Training;Data privacy;Three-dimensional displays;Image reconstruction;Masked Jigsaw Puzzle;Natural Language Processing;Computer Vision;Gradient Inversion;Position Embedding},
  doi={10.1109/TPAMI.2025.3621246}}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
BERT		BERT
breaching		breaching
data/yelp		data/yelp
LICENSE		LICENSE
README.md		README.md
attack.sh		attack.sh
attack_main.py		attack_main.py
attack_tool.py		attack_tool.py
train_ngram.sh		train_ngram.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

A Unified Masked Jigsaw Puzzle Framework for Vision and Language Models, TPAMI2025

Abstract

中文简介

Checkpoints

Datasets

Training

Gradient Inversion Attack

Citation

About

Uh oh!

Releases

Packages

Languages

License

ywxsuperstar/transformerattack

Folders and files

Latest commit

History

Repository files navigation

A Unified Masked Jigsaw Puzzle Framework for Vision and Language Models, TPAMI2025

Abstract

中文简介

Checkpoints

Datasets

Training

Gradient Inversion Attack

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages