Reasoning or Performing: locating "breakthrough" in the model's reasoning #11

yc015 · 2025-02-18T20:40:01Z

yc015
Feb 18, 2025

Research Question

When asked the DeepSeek models a challenging abstract algebra question, they often generated hundreds of tokens of reasoning before providing the final answer. Yet, on some questions, if I removed their generated reasoning and asked for its answer immediately, their answers are also correct.

On these questions, when the model begins with phrases like “Wait, wait,...” in its reasoning, is it actually reasoning through the problem for itself, or is it merely performing for the user? Can we tell from the question’s representation whether reasoning is necessary for producing the correct answer? Given the representation of an ongoing CoT, can we distinguish if it reflects genuine reasoning or is simply a post-hoc explanation of a pre-planned answer?

Owner

Yida Chen (@yc015)

Contributors

Martin Wattenberg (@wattenberg), Aghyad Deeb (@aghyad-deeb)

Project status

Ongoing: I am temporarily pausing on taking new collaborators for this project, but it may reopen again in the near future!

Current Findings

On the test set of MMLU benchmark (with questions from abstract_algebra, anatomy, astronomy), the number of thinking tokens used by DeepSeek-R1-Distill-Qwen-7B varied a lot when sampling the output multiple times. Meanwhile, the number of tokens used in the thinking (CoT) has almost no correlation with the correctness of the output (point-biserial correlation ~= -0.05).

On some questions, replacing the content between the <think> (beginning of thinking) and </think> end of thinking tokens with a blank space or a sequence of random tokens has no influence on the model's choice of the answers.

Feb 24th:
Distilled a dataset of 4,688 multiple choice questions with the DeepSeek-R1-Distill-Qwen-7B's answers before/after thinking.

Have code that collects the logit of the correct answer during the reasoning progress. Currently working on creating visualization for the reasoning progress. Once this part is done, will give probing a try and potentially cluster these time series data at the same time.

March 6th:
See this website https://yc015.github.io/reasoning-progress-viz/ for a collection of visualizations on the distilled Qwen-7B model's reasoning progress on MMLU dataset.

March 8th:
Now, the visualizations for most of the questions in the first two rows of this website are ready to be viewed.
If you click on any subject, you will be redirected to a gallery of the Qwen-7B's thinking process on the questions from that subject.

How to read these visualizations: Each colored matrix is a thumbnail of the actual thinking process. The thinking tokens are colored in four colors: green color indicates the possibility that model outputs the correct choice if we stop the reasoning right on that token. The other three colors indicate the probabilities of other three (wrong) possible.

Question 12 thumbnail:

Detailed reasoning visualization for question 12:

Does the model's internal reasoning outrun its external output? Or does it speak too fast, leaving behind its "thought"?

subhashk01 · 2025-02-20T20:13:20Z

subhashk01
Feb 20, 2025

I've thought about this a bit - could be useful to look at how the probability of the answer token changes as a function of the chain of thought. for example, when asked 2+2, if R1 says "oh I think it's 4 wait let me check maybe it's 5 no I think it's 4" does the probability of the 4 token actually change when R1 does this backtracking/reflection with waits? This is measurable with LogitLens.

10 replies

wendlerc Feb 23, 2025
Collaborator

Our visualization tool supports the suggested visualization by @wattenberg https://github.com/diatkinson/rollout_viz also it supports line plots but in addition can break them into multiple lines and has the tokens on the x-axis.

wendlerc Feb 23, 2025
Collaborator

@yc015 I wonder the same thing about whether we can probe the model's progress and I love your construction with the early exit prompting to determine interesting transitions (e.g. from wrong intermediate answers to correct ones within in the CoT -> I suggested to my colleagues and maybe also in my post to leverage this to build a dataset of self-corrections).

wattenberg Feb 23, 2025
Collaborator

@wendlerc That's cool that you have the visualization tool already! @yc015 , how hard would it be to use this for the data you've collected?

wendlerc Feb 23, 2025
Collaborator

It should be as easy as creating a json file in the format we use: see test_data.jsonl in that repo.

yc015 Feb 24, 2025
Author

@wattenberg I guess it would be very easy. Will give it a try tonight! @wendlerc thank you for sharing this tool! I'm glad you find my construction useful. Please keep me posted if you eventually build that dataset for self correction! Would love to know how it works in your research question.

yc015 · 2025-02-20T20:44:04Z

yc015
Feb 20, 2025
Author

This is a very rough update on this project:

Right now, I have collected 1,400 questions. Out of these, for 740 questions the distilled Qwen-7B model produced correct answers both before and after CoT. Funny enough, for 260 questions (where answers are different before/after CoT), Qwen-7B generated correct answers before thinking but later had them wrong after the CoT.
Found a simple heuristic that can determine if we can early exit from the CoT (e.g., the model's answer will be the same if we exit the CoT rn versus let it finish).

The data collection is still ongoing

Next steps:

On 740 questions where answers are the same before/after thinking, I am trying to see if the DeepSeek model can determine if the thinking is necessary on these questions by itself. In other words, if we instruct the model to “think” only when needed, will it naturally skip the extra processing for these cases?
I am also looking for questions where the reasoning model's answers are correct only after thinking but the content of the thinking (their lengths are equal) does not change the final answer. In other words, only the continuous representation space is used as a sketch pad for model to derive the correct answer.

1 reply

yc015 Feb 20, 2025
Author

I will update this comment later tonight with more details (and visualization hopefully) 😄

Edited: See project status and my new comment for the progress of this project.

wattenberg · 2025-02-20T22:40:27Z

wattenberg
Feb 20, 2025
Collaborator

For the first visualization, what would it look like if sorted by something other than question ID? For example, sorting by the values for the different colors?

1 reply

yc015 Feb 21, 2025
Author

I allowed the model to generate at most 4,000 new tokens for its CoT + answer to each question,
Unfinished: The model's CoT did not end after generating 4,000 new tokens.
Correct: The model's CoT finished within 4,000 new tokens and the final answer after CoT is correct.
Wrong: The model's CoT finished within 4,000 new tokens but the final answer after CoT is incorrect.

marvinli-harvard · 2025-02-21T05:45:52Z

marvinli-harvard
Feb 21, 2025

It might be worth framing this within the chain of thought faithfulness literature (https://arxiv.org/abs/2307.13702, https://arxiv.org/pdf/2305.04388), which considers very similar questions, i.e. Lanham interrupts the model's generations and then asks it to answer the question directly, and studies how this probability changes through the generations. There is also this paper from Prof. Lakkaraju's lab with mixed results on using probing or finetuning for increasing faithfulness (https://arxiv.org/abs/2406.10625), though I don't believe they tested reasoning models.

I think it might be cool to investigate mechanistic questions related to ''pivotal tokens',' steps in the reasoning process where the probability of obtaining the right answer after resampling increases in a few steps.If the model responds to a math question with tokens y1 through yT, then a pivotal token is some index t' such that resampling conditioning on y1 through yt' produces the same answer with small prob and resampling conditoning on y1 through yt'+1 produces the answer with large prob.

They were used to improve the model in the Phi-4 technical report (https://arxiv.org/abs/2412.08905v1, https://arxiv.org/abs/2411.19943), and I have some work on these as well (https://arxiv.org/abs/2502.00921).

I think it could be interesting to see mechanistically what these pivotal tokens correspond to. Can we see this narrowing down of final answers through probing?

3 replies

yc015 Feb 21, 2025
Author

Very useful! Thank you for sharing these.

wattenberg Feb 22, 2025
Collaborator

Thanks for references, and I like the idea of exploring pivotal tokens.

wj210 Mar 3, 2025

I think framing it as faithfulness is quite interesting. Though following on https://arxiv.org/abs/2307.13702 , the faithfulness aspect would only be applicable to samples where the model gets a different prediction w and w/o CoT. In terms of mechansitic analysis, one potential direction could be to study the attention patterns of the important attention heads via logit lens on the answer tokens after "The answer is". Logit lens can tell us which heads writes to the answer tokens and the attention pattern may shed light on which tokens are being written from. https://arxiv.org/abs/2402.18312 may be of relevance, it decomposes a reasoning task into sub-task and study the number of important heads.

If you are still looking for collaborators, i will be happy to help!

yc015 · 2025-02-25T07:19:33Z

yc015
Feb 25, 2025
Author

Just a quick update on the visualization:

In general, there are three interesting patterns:

Productive Reasoning:

where reasoning does help the model to reach the correct answer:

I am using viridis color map, and the brighter the color, the higher the probability of correct answer (if we exit the thinking at that token).

If you open this image in a new tab and zoom in (or download the html visualization from the link below), you would see a typical milestone in the model's reasoning progress is the token "Wait". Typically, after this token, the model starts to become more confident in the actually correct answer.

These are ideal situations where the reasoning is connected with the model's choice (of the right answer).

See HTML visualizations here: https://drive.google.com/drive/folders/1DPCig7Sz3HvQ40Kq6yFrB_X1oOqfjOMX?usp=drive_link

Counterproductive Reasoning:

however, sometimes thinking too much may lead to the wrong conclusion.

It's certainly unfortunate the additional reasoning lead to the wrong answer, but I do not think this behavior is an anomaly given the model does not know the correct answer beforehand, and its reasoning did affect its final answer.

See this link for the corresponding HTML visualization: https://drive.google.com/drive/folders/1gTiLaUt0fHKa-6-Te2jHM_0QKaDM5lcg?usp=sharing

Meaningless Effort:

"meaningless" might be a strong word, but it's a little strange that the model "wait" multiple times in its long reasoning while its answer and confidence, if we decoded early, stay unchanged.

(wait appeared at the end of the third line of the reasoning)

(the model "wait"s four times in this reasoning)

See more examples here: https://drive.google.com/drive/folders/1MUMbKV6J4XKOATwiWNB5d-oXDzELr1ru?usp=sharing

Ongoing:

I am creating more visualizations like this and will upload them to the folders above. Meanwhile, I am also starting on the probing for this hidden progress.

12 replies

yc015 Feb 27, 2025
Author

Just a quick update on the @wattenberg's vis idea:

I try color-coding different answers' iesprobabilities and blending them together. Yet, instead of coloring the background, I apply it to the text. Personally, I feel the text becomes less readable with a gradient-colored background (but everyone, please let me know if you think it's the other way).

Additionally, the font size of the text is proportional to the probability of the correct choice (if exit thinking at that token). A larger word indicates the correct answer is more likely to be output by the model.

Here are two examples of Productive Reasoning for biology questions:
Note: the correct answer is always colored in green

Q1

Q2

Again, I recommend downloading and seeing the original HTML visualizations here (link), as when you hover on a token, a hover text will appear and show you the exact probability for each choice at that token.

Here are two examples of Counterproductive Reasoning for biology questions:

Q1

Q2

The HTMLs are downloadable here: https://drive.google.com/drive/folders/1-V5kqCeMapFGIQO-5WS6mvNqNwZczSTZ?usp=sharing

One weird example I found on the fly is this:
where the model arrived at the correct conclusion at the end of reasoning, but it wasn't confident which choice was right until it calculated the numbers in another paragraph of its output after reasoning.

It took longer than I expected to update on this vis, but the good news is that I now have a script that can collect this visualization on all ~8,000 multiple choice questions. This script will collect a dataset for probing as well.

I will provide a link to the visualization and dataset once my batch job is complete. At the meantime, I would love to hear more suggestions on the current visualization, too!

wattenberg Feb 27, 2025
Collaborator

These are very interesting visualizations! I notice a couple of things, although of course this is just a small sample. If I'm reading things correctly, it looks like the system usually starts out with some idea of an answer (even if totally wrong) rather than starting with a pure blend of options. I'm also interested in the "break points" where it changes answers: these don't seem quite in sync with actual words it is saying.

yc015 Feb 28, 2025
Author

Exactly! It often came in with an answer rather than being totally uncertain. Meanwhile, the reasoning (in text) often seems to fall behind its hidden guess of answer: model became very confident in a choice many tokens before it gave a reasoning for that choice.

I think it's worth investigating in this discrepancy mechanically.

wendlerc Feb 28, 2025
Collaborator

I think the visualizations based on the background colors are easier to read.

yc015 Mar 3, 2025
Author

Thank you for the second opinion! I am, in fact, going to have both versions available.

Reasoning or Performing: locating "breakthrough" in the model's reasoning #11

Uh oh!

Uh oh!

yc015 Feb 18, 2025

Research Question

Owner

Contributors

Project status

Current Findings

Replies: 5 comments · 27 replies

Uh oh!

Uh oh!

subhashk01 Feb 20, 2025

Uh oh!

wendlerc Feb 23, 2025 Collaborator

Uh oh!

wendlerc Feb 23, 2025 Collaborator

Uh oh!

wattenberg Feb 23, 2025 Collaborator

Uh oh!

wendlerc Feb 23, 2025 Collaborator

Uh oh!

yc015 Feb 24, 2025 Author

Uh oh!

Uh oh!

yc015 Feb 20, 2025 Author

Uh oh!

Uh oh!

yc015 Feb 20, 2025 Author

Uh oh!

wattenberg Feb 20, 2025 Collaborator

Uh oh!

yc015 Feb 21, 2025 Author

Uh oh!

marvinli-harvard Feb 21, 2025

Uh oh!

yc015 Feb 21, 2025 Author

Uh oh!

wattenberg Feb 22, 2025 Collaborator

Uh oh!

Uh oh!

wj210 Mar 3, 2025

Uh oh!

yc015 Feb 25, 2025 Author

Productive Reasoning:

Counterproductive Reasoning:

Meaningless Effort:

Ongoing:

Uh oh!

Uh oh!

yc015 Feb 27, 2025 Author

Q1

Q2

Q1

Q2

Uh oh!

wattenberg Feb 27, 2025 Collaborator

Uh oh!

Uh oh!

yc015 Feb 28, 2025 Author

Uh oh!

wendlerc Feb 28, 2025 Collaborator

Uh oh!

yc015 Mar 3, 2025 Author

yc015
Feb 18, 2025

Replies: 5 comments 27 replies

subhashk01
Feb 20, 2025

wendlerc Feb 23, 2025
Collaborator

wendlerc Feb 23, 2025
Collaborator

wattenberg Feb 23, 2025
Collaborator

wendlerc Feb 23, 2025
Collaborator

yc015 Feb 24, 2025
Author

yc015
Feb 20, 2025
Author

yc015 Feb 20, 2025
Author

wattenberg
Feb 20, 2025
Collaborator

yc015 Feb 21, 2025
Author

marvinli-harvard
Feb 21, 2025

yc015 Feb 21, 2025
Author

wattenberg Feb 22, 2025
Collaborator

yc015
Feb 25, 2025
Author

yc015 Feb 27, 2025
Author

wattenberg Feb 27, 2025
Collaborator

yc015 Feb 28, 2025
Author

wendlerc Feb 28, 2025
Collaborator

yc015 Mar 3, 2025
Author