Large language models (LLMs) exhibit strong performance on linguistic reasoning tasks but continue to struggle with problems requiring spatial and visual reasoning, often relying on shallow textual heuristics rather than grounded internal representations. In this work, we investigate whether equipping a transformer-based language model with an explicit internal visual latent representation improves performance on visually grounded reasoning tasks.
We propose a unified decoder-only architecture in which discrete image tokens, obtained from a pretrained vector-quantized autoencoder (VQGAN), are incorporated directly into the model’s vocabulary alongside standard text tokens. The model is trained autoregressively to generate an intermediate sequence of image tokens—interpreted as an internal “imagined” visual state—prior to producing a textual answer.
Our results demonstrate that models trained with forced internal visual intermediates outperform text-only baselines on spatial reasoning tasks and exhibit significant performance degradation when the imagined visual tokens are disrupted (Blindfold Accuracy: 57.0% vs. Imagined Accuracy: 90.5%), indicating that the visual representation is causally utilized.
We evaluated the model on a "Forced Choice" spatial reasoning task ("Does the Red Square overlap the Blue Circle?"). The results demonstrate a clear causal dependency on the internal visual state.
| Condition | Description | Accuracy | Interpretation |
|---|---|---|---|
| Teacher Forced | Ground truth visual tokens provided. | 100.0% | Upper Bound: The model can perfectly interpret clear visual data. |
| Imagined (Greedy) | Model generates its own visual latents. | 90.5% | Method: Internal simulation provides a +33.5% gain over baseline. |
| Text-Only | Visual tokens omitted entirely. | 59.0% | Baseline: Text priors alone are insufficient for this task. |
| Blindfold | Visual tokens replaced with VQ-noise. | 57.0% | Control: Performance collapses to random chance (57% majority class) without visual structure. |
Verdict: The degradation from 90.5% (Imagined) to 57.0% (Blindfold) rejects the null hypothesis that the model relies solely on text shortcuts.
This research investigates three core claims regarding multimodal reasoning in transformers:
- H1 (Inductive Bias): Explicit internal visual representations provide a beneficial inductive bias for reasoning over spatial relations and object interactions.
- H2 (Necessity): The generated visual tokens are necessary intermediates for correct reasoning, not merely auxiliary outputs.
- H3 (Generalization): Models with internal visual intermediates exhibit improved robustness to linguistic paraphrasing and prompt perturbations compared to text-only models.
.
├── dataset_imagination_balanced/ # Synthetic data generation output
├── taming/ # VQGAN dependencies (Taming Transformers)
├── train_data_final.pt # Preprocessed dataset (Tokenized Text + VQ Indices)
├── imaginer_final.pth # Trained Model Weights (30k iters)
├── vqgan_imagenet_f16_16384.* # VQGAN Checkpoint & Config
│
├── src/
│ ├── train_imaginer.py # Main autoregressive training loop
│ ├── preprocess.py # Data pipeline: Images -> VQ Indices -> .pt
│ └── datafactory.py # Synthetic dataset generation engine
│
├── evaluation/
│ ├── evaluate_rigorous.py # Statistical evaluation (Teacher vs Blindfold vs Imagined)
│ ├── verify_logic.py # Sighted inference check
│ └── verify_blindfold.py # Causal intervention check (Noise Injection)
│
├── visualization/
│ ├── visualize.py # Generate and decode internal "dreams"
│ └── diagnostic_dream.png # Sample output of internal state
│
└── requirements.txt # Project dependencies
We utilize a standard GPT-2 style decoder-only transformer. The vocabulary is expanded to include discrete codebook indices from a VQGAN trained on ImageNet.
The model is trained to minimize the negative log-likelihood over the joint sequence of Text Prompt (
where
To verify H2 (Necessity), we perform an intervention at inference time:
- Let the model generate the visual sequence
$v$ . - Replace
$v$ with noise$v_{noise}$ . - Force the model to predict
$y_{answer}$ conditioned on$v_{noise}$ . Collapse in performance confirms the answer$y_{answer}$ is downstream of the visual state$v$ .
- Python 3.9+
- PyTorch (CUDA or MPS recommended)
pip install -r requirements.txtGenerate the synthetic spatial reasoning dataset (100k samples).
python src/datafactory.py --size 100000Tokenize text and encode images into discrete VQGAN indices.
Note: Requires vqgan_imagenet_f16_16384.ckpt in root.
python src/preprocess.pyTrain the autoregressive transformer.
python src/train_imaginer.pyRun the rigorous forced-choice evaluation suite.
python evaluation/evaluate_rigorous.pyTo inspect the "imagination" of the model, run the visualization script. This decodes the generated token sequence back into pixel space using the VQGAN decoder.
python visualization/visualize.pySample Output:
Prompt: "Imagine a red square at grid (2, 2) and a blue circle at (6, 6)..."
If you use this code or methodology, please cite:
@misc{metoyer2024visualreasoning,
title={Visual Internal Reasoning: Causal Dependency on Latent Image Tokens},
author={Metoyer, Chase},
year={2024},
publisher={GitHub},
journal={GitHub repository},
howpublished={\url{https://github.com/chasemetoyer/visual-internal-reasoning}}
}