Image Latent Transformer (ILT)

We present a language model that replaces conventional subword tokenization with a dual representation of text as both bytes and rendered images of "words", allowing the model to directly process the visual and symbolic structure of language. It segments text into whitespace-delimited units, encodes each as images and byte sequences, and passes them through an encoder–decoder pipeline: a bytes encoder, an image encoder, a large latent transformer, and a bytes decoder. At inference, predicted bytes are rendered back into images, closing the loop for the next prediction. This design could make learning and inference cheaper and faster for non-English languages, since the heavy latent transformer only predicts high-level token representations while the actual byte sequences are generated by a much smaller decoder.

Quick Start

Clone and setup:

git clone https://github.com/sign/image-latent-transformer.git
cd image-latent-transformer

Install dependencies:

conda create -n ilt python=3.12 -y
conda activate ilt
conda install -c conda-forge pycairo pygobject manimpango -y
pip install ".[dev]"

Or using docker:

docker build -t ilt .
docker run -it --rm ilt /bin/bash

Tip

Run tests using pytest to ensure everything is working correctly.

Model Setup

Bytes Encoder - You can use any language model as the bytes encoder (causal or masked).
Image Encoder - You can use any image encoder.
Latent Transformer - You can use any causal LM (recommended: large).
Bytes Decoder - You can use any causal LM (recommended: small).

For language models, the parameter count is lower than reported, due to removing the embedding layers.

Our implementation allows for any mix-and-match. Some example setups are:

Name	Bytes Encoder	Image Encoder	Latent Transformer	Bytes Decoder	Total Parameters
tiny	bert-tiny (0.5m)	vit-tiny-patch16-224 (5m)	pythia-70m (19m)	tiny-lm (3m)	28m
small	ModernBERT-base (111m)	swinv2-tiny-patch4-window16-256 (27m)	gemma-3-270m (100m)	SmolLM2-135M (106m)	346m
medium	deberta-v3-large (303m)	clip-vit-base-patch32 (87m)	Llama-3.2-1B (973m)	gpt2-medium (304m)	1,674m

To turn off bytes encoding, set bytes_encoder=False, and similarly for images, set image_encoder=False. You can also turn off a specific encoder after training has completed, for testing purposes.

Warning

In implementation of the bytes decoder, we concatenate the embeddings of the bytes of the current token with all the embeddings of the previous tokens (on the word level). This is done since not all causal LMs support cross-attention, and so we want to avoid using it, and rely on the self-attention mechanism instead.

Training

Training instructions are available in the training/README.md. There, you can select the model architectures you want to use for each component, and the dataset you want to train on.

Inference

Caution

Our text renderer relies on the computer's font rendering capabilities. Rendering on different systems may yield different results (e.g. emoji). We call the community to create a more robust renderer, decoupled from the system's font rendering, for better consistency across platforms and easier reproducibility.

Since we have two decoders, the autoregressive prediction logic is a bit more complex than the usual, and supporting decoding algorithms like beam-search is not trivial.

Thus, on the latent-transformer level, we only support greedy decoding for now. On the bytes decoder level, we support all classical decoding algorithms supported by HuggingFace Transformers.

Contributing

See open issues and TODOs in the codebase.

Warning

Training runs are experimental until core issues are resolved.

Cite

If you use this code in your research, please consider citing the work:

@misc{moryossef2025ilt,
  title={Language Modeling with Text as Images},
  author={Moryossef, Amit},
  howpublished={\url{https://github.com/sign/image-latent-transformer}},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 134 Commits
.github/workflows		.github/workflows
assets		assets
examples		examples
font_configurator		font_configurator
image_latent_transformer		image_latent_transformer
tests		tests
training		training
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
MOTIVATION.md		MOTIVATION.md
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Repository files navigation

Image Latent Transformer (ILT)

Quick Start

Model Setup

Training

Inference

Contributing

Cite

About

Uh oh!

Releases

Sponsor this project

Uh oh!

Packages

Contributors 2

Uh oh!

Languages

Uh oh!

License

sign/image-latent-transformer

Folders and files

Latest commit

History

Repository files navigation

Image Latent Transformer (ILT)

Quick Start

Model Setup

Training

Inference

Contributing

Cite

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Sponsor this project

Uh oh!

Packages 0

Contributors 2

Uh oh!

Languages

Packages