Extending HRM spatial reasoning to language by replacing integer cell state (representing color indices) with language model embedding vectors (representing semantic material) and expanding the training set

Hi folks,

I attacked ARC-AGI back in autumn and although I lost motivation (resources and budget constraints) I believe the ideas and concepts that came out of these sessions are of massive relevance to your goals with HRM. I had essentially the same idea, but I'm not as advanced with theory to build new architectures like this, and instead I started simple and went at it from the NCA angle. (Neural Cellular Automaton) But if we replace NCA with HRM, all of my research is compatible.

Regarding language models, certainly I think that ARC-AGI puzzles should never cost more than a penny or two at most... not 1000$ like o3. We have missed a major milestone somewhere along the line, all hopes and prayed with bruteforce compute. Although I think it will be interesting to see how a pure HRM language model performs, I believe this is not actually the road to AGI. It will exhibit interesting properties, but I suspect it won't be a radical evolution over the current frontier. There is a different path, one that 

There is an alternative approach that I don't think has been considered yet !

1. HRM on a 2D grid, the grid cells are LLM embeddings for universal representations, you pre-train it as a foundation model.
2. Bolt it onto an existing pre-trained decoder-only LLM, whose embedding space you've used in step 1.
3. Freeze the HRM, then apply GRPO/GSPO/etc. to RL the decoders as the main cortex that is teaching itself how to represent problems spatially and prompt the HRM spatial computer.

For ease of communication, let us call this type of model a **SAGE**, _Semantic Automaton in Geometric Embedding-space_. 

**Why semantic representations?**

SAGE effectively solves the binding problem and create an externalization surface for the LLM's implicit world model. Now, your puzzles can be representations of the world. For example on the pathfinding task, wall cells become literally the embedding vector for the `wall` token, roads become `road`, `start` is `start`, `goal` is `goal`... but you use a LLM to do **augmentations**. Goal can be end can be target. Start can be initial can be zero can be... Walls can be walls, or solid, hard, full, taken, filled, so on and so forth. This teaches the model a proto-understanding of material space. The HRM may also have a prompt like an image diffusion model which embeds context about the task, puzzle, and execution constraints or restrictions. In many cases the model may naturally exhibit a natural will to solve, treating the grid as its own context and prompt. Maze-like semantics with start and goal naturally implies the classical pathfinding task, while a more abstract representations of world-problems unseen in the training data may lead to an interesting emergent behavior.

**Unified Latent Space of Algorithms & Value for Intelligence**

Trained in this way on a very large set of environments, an HRM is possibly more attuned to algorithmic notions and complexity theory than any LLM, an intuitive control. It's a programmable latent-space processor! All algorithms in the world unified into a latent space. By extending the architecture to be prompt-conditioned similar to a diffusion model, we can essentially compose algorithmic patterns together into new exotic algorithms discovered through prompting.

The decoders may have the emergent capability (or intentionally taught by a special RL environment or task) to interpret on a moment-to-moment basis and figure out how to codify them. In this way, the model can invent new algorithms without throwing everything at the world in code, since it can actually simulate them internally in a precise enough spatial intermediate that is fundamentally compatible with the known patterns of software engineering. If the model is designed to possess a dynamic imagination surface, it may also operate on 1D data structures, Wx1 instead of WxH, making it able to simulate sorting algorithms as well. We have seen all sorts of beautiful visualizations of sorting algorithms on the internet in the past decade, which is an accurate representations of the way this model would function.

**Progression into AGI**

I call this spatial computing module not artificial intelligence, but artificial imagination, the presumed missing component for true AGI. From there things will continue to compound on themselves and lead to a Cambrian explosion of other downstream ideas iterating on the concept, as they have with existing decoder-only LLMs. The "shoggoth whisperers" will produce preliminary experiential data that will define the language and jargon, creating ambient information on the web that buffs the dataset a year later, such that once you train it again suddenly the LLM is 100x more operational and aware of the new capability, understands better when you refer to it, uses the prompting patterns and language assembled by the collaborative effort, etc. Indeed we saw this with LLMs, as initially words like "prompt" meant very little to them. The language and modalities of human/AI were simply nowhere to be seen in the data, until they were, and it didn't poison the dataset but rather had the opposite effect of defining new classifications of ambiguity and precision.

The true finality here is for this latent space to generalize and demonstrate emergent capabilities. It's cool that it can solve mazes and puzzles, but the implication of generalization is that the model has a "visual poetry" skill. Any and all linguistic scenario can be represented in a coarse 2D grid. Eventually, we may RL it to instantiate a little friendly persona like a mochi. If we can do it realtime, and with both intelligence and imagination in the loop (maybe interlaced 1:1 between each LLM token, or better yet a self-learnt policy which naturally balances linguistic reasoning with imagination-space simulation) this may be where AGI truly begins in peoples' minds, when there is a real-time soulful connection.

Recall that models trained purely on synthetic data from a teacher model learn much faster, and in some cases inexplicably perform even better than the original model. As we chart the linguistics of prompting, it cleans up the dataset, refactors language in the world to be more compatible, a cultural well. Once we have mastered it in 2D, the new capabilities and power afforded may facilitate climbing into a 3D HRM module, as a new base language interface is imagination-prompting is now set and defined by the collective. The synthesis of virtual worlds will continue to evolve with video generation, NERFs or gaussian splats, mesh generation so on and so forth. We may be able to voxelize 3D training data for a base model in 3D. It's hard to imagine what it might do as intelligence and reasoning, but the compute efficiency of such a model is certainly a brick on the road to an holodeck simulation, which will have to run locally and in real-time to be of any real value.

**Compute efficiency argument**

If we reel it back to 2D, we can more easily see why a SAGE model can increases the compute efficiency across the board for most task of today's AI: HRM acts as the missing adapter that sits between an image diffusion model and a LLM. Today, it's increasingly popular to see image models with a large language model tacked on top and taking its embedding. Tomorrow, the HRM semantic embedding grid acts as a coarse world representation. Once you have a SAGE, take the "pre-reasoned" coarse world representation output of the HRM module as input for your diffusion model. Now the diffusion model weights can trimmed massively because the work of composing a scene is shifted into a different latent space which operates naturally as morphology.

By untangling the different specialties of the mind, each respective model can focus and achieve it with far fewer degrees of freedom necessary. Thus, the classical image diffuser now specializes as an upscaler or renderer, hallucinating the fine-detail on an existing structure. It's like ControlNet but on the richest possible world representation. (superset representation of segmentation, depth, etc.) This will likely solve fine-detail composition that have puzzled ML engineers for years, like the hands and fingers challenge: using a 3D HRM grid instead as the input to an image diffuser, it flattens an already native 3D voxel-token representation. This representation has had "3D reasoning" applied to it for a number of steps, but this reasoning actually converges to a field simulation of the world with natural repellence and gravitation dynamism that is implicit to the world-domain hinted at by the meaning of the embeddings, each embedding cell spatially contextualized to the ones around it.

And now finally, we climb back down to 1D language and see the following: the decoder-only LLM may compress drastically, as the model no longer needs such a complex weight lattice to reconstruct a notion of spatial reasoning. We move out a large part of intelligence out of the decoders to specialize more as textual vocal chords for a different prior. In practice once we have figured out the perfect unification paradigm of the two modules, it will act as just one organism. Activating the module through manual <imagine> prompting of the LLM is only the start, a humble beginning. A helix twister paradigm lies ahead where both model are ticked in lockstep proportions: at first a simple 1:1 scheme (1 HRM step for 1 auto-regressive token) then later a more sophisticated learnt policy.

**Addendum**

To clarify, the SAGE architecture could technically be all HRM, replacing the decoders with also a newly trained HRM (1D HRM + 2D HRM for imagination) but I believe that by bootstrapping on our existing investments and research in transformers and decoder-only LLMs, we can take clever shortcuts that will get us to the singularity at an accelerated rate. If we make one good AGI first, then this AGI will facilitate and maybe even itself research and engineer an even more optimized AGI at an exponential self-compounding rate.

Finally, here is the tweet where I unveiled this concept of AGI this winter, you can replace NCA with HRM. (yes! I was obsessed with the mythical Q*) it's slightly unhinged and more ambiguous than what I have presented here, but provides other interesting intuitions regarding the geometry of imagination.

https://x.com/ryunuck/status/1883032334426873858

I hope this leads to new research endeavors and accelerates the timeline! Thank you for your invaluable contribution to humanity with HRM. I have faith that your team will understand everything presented here and the significance of it 🙏

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Extending HRM spatial reasoning to language by replacing integer cell state (representing color indices) with language model embedding vectors (representing semantic material) and expanding the training set #23

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Extending HRM spatial reasoning to language by replacing integer cell state (representing color indices) with language model embedding vectors (representing semantic material) and expanding the training set #23

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions