Skip to content

Language modeling that treats text as images, leveraging visual structure for enhanced understanding.

License

Notifications You must be signed in to change notification settings

sign/image-latent-transformer

Image Latent Transformer (ILT)

Python License

See Motivational Examples

We present a language model that replaces conventional subword tokenization with a dual representation of text as both bytes and rendered images of "words", allowing the model to directly process the visual and symbolic structure of language. It segments text into whitespace-delimited units, encodes each as images and byte sequences, and passes them through an encoder–decoder pipeline: a bytes encoder, an image encoder, a large latent transformer, and a bytes decoder. At inference, predicted bytes are rendered back into images, closing the loop for the next prediction. This design could make learning and inference cheaper and faster for non-English languages, since the heavy latent transformer only predicts high-level token representations while the actual byte sequences are generated by a much smaller decoder.

Model Architecture

Quick Start

Clone and setup:

git clone https://github.com/sign/image-latent-transformer.git
cd image-latent-transformer

Install dependencies:

conda create -n ilt python=3.12 -y
conda activate ilt
conda install -c conda-forge pycairo pygobject manimpango -y
pip install ".[dev]"

Or using docker:

docker build -t ilt .
docker run -it --rm ilt /bin/bash

Tip

Run tests using pytest to ensure everything is working correctly.

Model Setup

  • Bytes Encoder - You can use any language model as the bytes encoder (causal or masked).
  • Image Encoder - You can use any image encoder.
  • Latent Transformer - You can use any causal LM (recommended: large).
  • Bytes Decoder - You can use any causal LM (recommended: small).

For language models, the parameter count is lower than reported, due to removing the embedding layers.

Our implementation allows for any mix-and-match. Some example setups are:

Name Bytes Encoder Image Encoder Latent Transformer Bytes Decoder Total Parameters
tiny bert-tiny (0.5m) vit-tiny-patch16-224 (5m) pythia-70m (19m) tiny-lm (3m) 28m
small ModernBERT-base (111m) swinv2-tiny-patch4-window16-256 (27m) gemma-3-270m (100m) SmolLM2-135M (106m) 346m
medium deberta-v3-large (303m) clip-vit-base-patch32 (87m) Llama-3.2-1B (973m) gpt2-medium (304m) 1,674m

To turn off bytes encoding, set bytes_encoder=False, and similarly for images, set image_encoder=False. You can also turn off a specific encoder after training has completed, for testing purposes.

Warning

In implementation of the bytes decoder, we concatenate the embeddings of the bytes of the current token with all the embeddings of the previous tokens (on the word level). This is done since not all causal LMs support cross-attention, and so we want to avoid using it, and rely on the self-attention mechanism instead.

Training

Training instructions are available in the training/README.md. There, you can select the model architectures you want to use for each component, and the dataset you want to train on.

Inference

Caution

Our text renderer relies on the computer's font rendering capabilities. Rendering on different systems may yield different results (e.g. emoji). We call the community to create a more robust renderer, decoupled from the system's font rendering, for better consistency across platforms and easier reproducibility.

Since we have two decoders, the autoregressive prediction logic is a bit more complex than the usual, and supporting decoding algorithms like beam-search is not trivial.

Thus, on the latent-transformer level, we only support greedy decoding for now. On the bytes decoder level, we support all classical decoding algorithms supported by HuggingFace Transformers.

Contributing

See open issues and TODOs in the codebase.

Warning

Training runs are experimental until core issues are resolved.

Cite

If you use this code in your research, please consider citing the work:

@misc{moryossef2025ilt,
  title={Language Modeling with Text as Images},
  author={Moryossef, Amit},
  howpublished={\url{https://github.com/sign/image-latent-transformer}},
  year={2025}
}

About

Language modeling that treats text as images, leveraging visual structure for enhanced understanding.

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Sponsor this project

 

Packages

No packages published

Contributors 2

  •  
  •