Skip to content

Autoparser misclassifies all output as reasoning for templates with /no_think toggling (Nemotron-Nano-9B-v2) #20754

@janbernloehr

Description

@janbernloehr

Problem

The FORCED_OPEN workaround in chat-diff-analyzer.cpp (lines 28-42) catches templates with content.split('</think>') that lack reasoning_content, and sets reasoning_mode::FORCED_OPEN. This workaround was designed for old Qwen/DeepSeek thinking templates, but it also matches NVIDIA-Nemotron-Nano-9B-v2, which supports per-message thinking toggling via /no_think.

For Nemotron-Nano-v2, this causes 100% of streaming SSE chunks to have reasoning_content instead of content, because:

  1. The FORCED_OPEN PEG parser (optional(literal(start)) + reasoning(until(end)) + end) makes the reasoning block mandatory
  2. In lenient (streaming) mode, until("</think>") returns NEED_MORE_INPUT when </think> hasn't appeared yet
  3. NEED_MORE_INPUT propagates through the AST, tagging all accumulated output as reasoning
  4. When </think> never appears (e.g., thinking exceeds max_tokens), every token is classified as reasoning_content

This breaks OpenAI-compatible clients that don't handle reasoning_content in streaming deltas.

Affected models

Only NVIDIA-Nemotron-Nano-9B-v2. Other templates that trigger the workaround (DeepSeek-R1 variants, QwQ, rwkv-world) are unaffected because they don't have /no_think toggling.

Proposed fix

Two changes (tested and verified):

1. common/chat-diff-analyzer.cpp — Exclude /no_think templates from the FORCED_OPEN workaround. The autoparser can't reliably handle templates where thinking is toggled per-message via template logic.

if (tmpl.src.find("content.split('</think>')") != std::string::npos &&
    tmpl.src.find("reasoning_content") == std::string::npos &&
    tmpl.src.find("no_think") == std::string::npos &&  // NEW
    analysis.reasoning.mode == reasoning_mode::NONE) {

2. common/chat-auto-parser-generator.cpp — Make the FORCED_OPEN reasoning block fully optional (defensive, matches TAG_BASED behavior). Currently only the start tag is optional; the reasoning+end is mandatory. This is a no-op in the current always-lenient architecture but makes FORCED_OPEN consistent with TAG_BASED.

// Before:
return p.optional(p.literal(start)) + p.reasoning(p.until(end)) + end;
// After:
return p.optional(p.optional(p.literal(start)) + p.reasoning(p.until(end)) + end);

Testing

Verified with NVIDIA-Nemotron-Nano-9B-v2 (bf16 and q4_k_m) and DeepSeek-R1-Distill-Llama-8B (q4_k_m) on GB200:

  • Nemotron: previously 100% failure → now passes
  • DeepSeek-R1: no regression

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions