Skip to content

[EMNLP'24] Autoregressive Pre-Training on Pixels and Texts

License

Notifications You must be signed in to change notification settings

ernie-research/pixelgpt

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Models Datasets Datasets Paper EMNLP 2024

The official repository which contains the code and model checkpoints for our paper Autoregressive Pre-Training on Pixels and Texts (EMNLP 2024).

πŸ”₯ News

  • 21 September, 2024: πŸŽ‰ Our work has been accepted to EMNLP 2024! πŸŽ‰
  • 1 May, 2024: πŸŽ‰ We release the official codebase and model weights of PixelGPT, MonoGPT, and DualGPT . Stay tuned!πŸ”₯
image

Harnessing visual texts represents a burgeoning frontier in the evolution of language modeling. In this paper, we introduce a novel pre-training framework for a suite of pixel-based autoregressive language models, pre-training on a corpus of over 400 million documents rendered as RGB images. Our approach is characterized by a dual-modality training regimen, engaging both visual data through next patch prediction with a regression head and textual data via next token prediction with a classification head. This study is particularly focused on investigating the synergistic interplay between visual and textual modalities of language. Our comprehensive evaluation across a diverse array of benchmarks reveals that the confluence of visual and textual data substantially augments the efficacy of pixel-based language models. Notably, our findings show that a unidirectional pixel-based model, devoid of textual data during training, can match the performance levels of advanced bidirectional pixel-based models on various language understanding benchmarks. This work highlights the considerable untapped potential of integrating visual and textual information for language modeling purposes. We will release our code, data, and checkpoints to inspire further research advancement.

πŸ“• Requirements

To set up the environment and install dependencies, run:

bash run_requirements.sh

πŸ“š Fine-tuning Data

We fine-tune PixelGPT on the rendered GLUE and XNLI datasets. These rendered versions are publicly available atΒ baidu/rendered_GLUE and baidu/rendered_xnli. After downloading the datasets from HuggingFace, extract them locally:

# Extract rendered GLUE
tar -xvf rendered_glue.tar

# Extract rendered XNLI
tar -xvf rendered_xnli.tar

For the rendered GLUE dataset, the extracted files contain multiple tasks. Each task has a corresponding training set, validation set, and test set. Note that for the MNLI task, both the validation and test sets contain matched and mismatched versions. You will need to assign the local paths of these task datasets to the --train_file, --validation_file, and --test_file parameters in the fine-tuning script. For the rendered XNLI dataset, assign the local dataset path to the --data_file_dir parameter in the corresponding fine-tuning script.

πŸ“Œ Pre-trained Models

We pre-trained PixelGPT and three other models: MonoGPT, and DualGPT. We release checkpoints used in our experiment, which can be downloaded at baidu/PixelGPT, baidu/MonoGPT, and baidu/DualGPT. Before running the fine-tuning scripts bellow, download the corresponding pre-trained models from our open-source model repository above and place the file in the pre-trained model directory, e.g. pretrained_models/pixel_gpt.

πŸš€ Fine-tuning

Our main fine-tuning experiments were performed on rendered GLUE and XNLI. The scripts to run the experiments are given below.

GLUE

For example, to fine-tune on the MNLI task:

PixelGPT

bash run/pixel_gpt/ft_pixel_gpt_mnli.sh pretrained_models/PixelGPT

MonoGPT

# Text-only Fine-tuning
run/mono_gpt/ft_mono_gpt_mnli_text.sh pretrained_models/MonoGPT

# Pixel-only Fine-tuning
run/mono_gpt/ft_mono_gpt_mnli_pixel.sh pretrained_models/MonoGPT

# Pair-modality Fine-tuning
run/mono_gpt/ft_mono_gpt_mnli_pair.sh pretrained_models/MonoGPT

DualGPT

# Text-only Fine-tuning
run/dual_gpt/ft_dual_gpt_mnli_text.sh pretrained_models/DualGPT

# Pixel-only Fine-tuning
run/dual_gpt/ft_dual_gpt_mnli_pixel.sh pretrained_models/DualGPT


# Pair-modality Fine-tuning
run/dual_gpt/ft_dual_gpt_mnli_pair.sh pretrained_models/DualGPT

XNLI Training

We evaluated XNLI in two settings: (1) Translate-train-all, where the model is fine-tuned on a combination of English and machine-translated data from 14 other languages; (2) Cross-lingual Transfer settings, where the model is fine-tuned only on English data and tested on multiple languages.

1. Translate-train-all

PixelGPT
bash run/cross_lingual/xnli/train_all/pixel_gpt/ft_pixel_gpt_xnli.sh pretrained_models/PixelGPT
MonoGPT
# Text-only Fine-tuning
bash run/cross_lingual/xnli/train_all/mono_gpt/ft_mono_gpt_xnli_text.sh pretrained_models/MonoGPT

# Pixel-only Fine-tuning
bash run/cross_lingual/xnli/train_all/mono_gpt/ft_mono_gpt_xnli_image.sh pretrained_models/MonoGPT

# Pair-modality Fine-tuning
bash run/cross_lingual/xnli/train_all/mono_gpt/ft_mono_gpt_xnli_pair.sh pretrained_models/MonoGPT
DualGPT
# Text-only Fine-tuning
bash run/cross_lingual/xnli/train_all/dual_gpt/ft_dual_gpt_xnli_text.sh pretrained_models/DualGPT

# Pixel-only Fine-tuning
bash run/cross_lingual/xnli/train_all/dual_gpt/ft_dual_gpt_xnli_image.sh pretrained_models/DualGPT

# Pair-modality Fine-tuning
bash run/cross_lingual/xnli/train_all/dual_gpt/ft_dual_gpt_xnli_pair.sh pretrained_models/DualGPT

2. Cross-lingaul Transfer

PixelGPT
bash run/cross_lingual/xnli/train_en/pixel_gpt/ft_pixel_gpt_xnli.sh pretrained_models/PixelGPT
MonoGPT
# Text-only Fine-tuning
bash run/cross_lingual/xnli/train_en/mono_gpt/ft_mono_gpt_xnli_text.sh pretrained_models/MonoGPT

# Pixel-only Fine-tuning
bash run/cross_lingual/xnli/train_en/mono_gpt/ft_mono_gpt_xnli_image.sh pretrained_models/MonoGPT

# Pair-modality Fine-tuning
run/cross_lingual/xnli/train_en/mono_gpt/ft_mono_gpt_xnli_pair.sh pretrained_models/MonoGPT
DualGPT
# Text-only Fine-tuning
bash run/cross_lingual/xnli/train_en/dual_gpt/ft_dual_gpt_xnli_text.sh pretrained_models/DualGPT

# Pixel-only Fine-tuning
bash run/cross_lingual/xnli/train_en/dual_gpt/ft_dual_gpt_xnli_image.sh pretrained_models/DualGPT

# Pair-modality Fine-tuning
bash run/cross_lingual/xnli/train_en/dual_gpt/ft_dual_gpt_xnli_pair.sh pretrained_models/DualGPT

Citation

@inproceedings{chai-etal-2024-autoregressive,
    title = "Autoregressive Pre-Training on Pixels and Texts",
    author = "Chai, Yekun  and
      Liu, Qingyi  and
      Xiao, Jingwu  and
      Wang, Shuohuan  and
      Sun, Yu  and
      Wu, Hua",
    editor = "Al-Onaizan, Yaser  and
      Bansal, Mohit  and
      Chen, Yun-Nung",
    booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2024",
    address = "Miami, Florida, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.emnlp-main.182",
    pages = "3106--3125",
    abstract = "The integration of visual and textual information represents a promising direction in the advancement of language models. In this paper, we explore the dual modality of language{---}both visual and textual{---}within an autoregressive framework, pre-trained on both document images and texts. Our method employs a multimodal training strategy, utilizing visual data through next patch prediction with a regression head and/or textual data through next token prediction with a classification head. We focus on understanding the interaction between these two modalities and their combined impact on model performance. Our extensive evaluation across a wide range of benchmarks shows that incorporating both visual and textual data significantly improves the performance of pixel-based language models. Remarkably, we find that a unidirectional pixel-based model trained solely on visual data can achieve comparable results to state-of-the-art bidirectional models on several language understanding tasks. This work uncovers the untapped potential of integrating visual and textual modalities for more effective language modeling. We release our code, data, and model checkpoints at https://github.com/ernie-research/pixelgpt.",
}

About

[EMNLP'24] Autoregressive Pre-Training on Pixels and Texts

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published