Studying Reasoning about BlocksWorld #10

kisate · 2025-02-18T17:48:28Z

kisate
Feb 18, 2025

Research Question

How do reasoning models solve BlocksWorld problems?

Can we extract their changing internal world representation from the CoT?

Can we find mechanisms for their actions search?

How do they solve the obfuscated "mystery" BlocksWorld?

Owners

Dmitrii Kharlapenko (@kisate)

Project status

Work in progress

Current findings

Semi-successful single token linear probes for the current states & planned actions
Block representations seem to have some structure in their PCA
There is a direction that can ~predict “wait” token

kisate · 2025-02-18T17:50:49Z

kisate
Feb 18, 2025
Author

Current write up:

https://docs.google.com/document/d/1-paTQJgAWC72uZ3KEAuYWsxbCDa1Fo2L9ooUoEmtixc/edit?tab=t.0#heading=h.6eifp0mqr48s

TLDR

BlocksWorld is a problem where the model needs to create a plan to stack blocks in a particular order
Existing work has shown that reasoning models are much better than regular LLMs on BlocksWorld
Main goal: study how R1 distil solves BlocksWorld
Goal 1: Extract the model’s internal world representation
Goal 2: Mechanisms for action selection/predicate checking including the “wait” mechanism
Goal 3: Compare findings with the obfuscated “mystery” blocksworld, which is related to coded reasoning and steganography
Some initial results are:
- Semi-successful single token linear probes for the current states & planned actions
- Block representations seem to have some structure in their PCA
- There is a direction that can ~predict “wait” tokens
Some next steps:
- Creating labeled datasets with LLMs for internal state probing
- More complex probe architectures (e.g. attention-like ones)
- Steering interventions
- Cross-patching representations between regular and mystery BlocksWorld

2 replies

wendlerc Feb 21, 2025
Collaborator

at which layer can you predict the wait token?

kisate Feb 22, 2025
Author

The probe in the write up was trained on the last layer. Should probably also work on last 5-10 layers.

wattenberg · 2025-02-20T22:38:55Z

wattenberg
Feb 20, 2025
Collaborator

This looks fascinating! I'm particularly struck by the graph in the write-up labeled "The confidence/self-regulation/”wait” mechanism". It looks like, if you smoothed it a bit, you'd see large-scale structure, almost like a progress measure.

3 replies

kisate Feb 20, 2025
Author

Thanks for the kind words!

This was a pretty simple probe, and I am positively surprised that it was able to pick up some upcoming "wait" direction.

Although, steering with it did not change much, so it may be a bit far away from the "confidence"/"progress" measure.

I feel like this direction heavily intersects with some other ARBOR projects, so I am focusing on the internal state probing for now.

wattenberg Feb 22, 2025
Collaborator

Do you have any idea how the system represents relationships between blocks? E.g., if block A is on top of block B, can we extract that relationship from the residual stream somehow?

kisate Feb 22, 2025
Author

I am currently working exactly on this.

Will soon publish an update with results on various current state probes.

One of the preliminary results shows that PCA's first component order may be correlated with the order of blocks in the current state. It matched the order of blocks in the goal in the early and late stages of CoT pretty well.

Will check PCA again, when I finish extracting current state for all parts of the CoT.

ARBORproject · 2025-03-18T02:17:34Z

ARBORproject
Mar 18, 2025
Maintainer

Hello - just checking in if this project is still active. To keep our project statuses accurate, otherwise we would like to switch the project status to Inactive until there is activity again!

0 replies

kisate · 2025-05-16T19:29:57Z

kisate
May 16, 2025
Author

Hello everyone!

A long time passed after the previous update, but I am excited to share some recent cool progress!

TLDR

QWQ-32B achieves 30% accuracy on Mystery Blocksworld - a task where actions and predicates are obfuscated with alternative terms (e.g., pick up → attack, on top of -> craves)
Key discovery: Reasoning models spontaneously adapt entity representations during problem-solving WITHOUT any fine-tuning
We observed this adaptation "in the wild" using unmodified QWQ-32B on 13 different obfuscation mappings ("domains")
Representation findings:
- Models naturally form common symbolic representations across domains during inference
- Similarities between action representations increase with CoT length
- Domains with higher accuracy show faster representation adaptation
Steering experiments: Injecting these naturally-formed representations into earlier parts of CoT improved accuracy by 4-5% on average
Conclusion: Reasoning models may inherently refine entity representations during problem-solving, and these refinements causally impact task performance

Full write up is here: https://docs.google.com/document/d/1ayrLFQaR58HPBU4YXk219vNTpCCBljdrE8Hq1Sbzncw/edit?tab=t.0

Would be happy to receive any feedback and suggestions!

Unfortunately, I wasn't able to get main results in time for NeurIPS, so I aim to expand the scope for ICLR.

0 replies

Studying Reasoning about BlocksWorld #10

Uh oh!

Uh oh!

kisate Feb 18, 2025

Research Question

Owners

Project status

Current findings

Replies: 4 comments · 5 replies

Uh oh!

kisate Feb 18, 2025 Author

Current write up:

TLDR

Uh oh!

wendlerc Feb 21, 2025 Collaborator

Uh oh!

kisate Feb 22, 2025 Author

Uh oh!

wattenberg Feb 20, 2025 Collaborator

Uh oh!

kisate Feb 20, 2025 Author

Uh oh!

wattenberg Feb 22, 2025 Collaborator

Uh oh!

kisate Feb 22, 2025 Author

Uh oh!

ARBORproject Mar 18, 2025 Maintainer

Uh oh!

kisate May 16, 2025 Author

TLDR

kisate
Feb 18, 2025

Replies: 4 comments 5 replies

kisate
Feb 18, 2025
Author

wendlerc Feb 21, 2025
Collaborator

kisate Feb 22, 2025
Author

wattenberg
Feb 20, 2025
Collaborator

kisate Feb 20, 2025
Author

wattenberg Feb 22, 2025
Collaborator

kisate Feb 22, 2025
Author

ARBORproject
Mar 18, 2025
Maintainer

kisate
May 16, 2025
Author