Problem
Three different prompt formats exist in the codebase creating distribution shift:
- SFT (
next_action.py): includes action history, "Thought: ..." format, step counter
- GRPO (
trainer.py _build_agent_messages): "Goal: ..." with no history or thought
- CoT warmup (
cot_warmup.py): "Instruction: ..." prefix, different history format
The model trained on SFT encounters different prompts during GRPO, degrading early GRPO performance.
Proposed Fix
All three should use a shared prompt builder with configurable components (history, thought, step counter). This ensures the distribution seen during SFT matches what the model encounters during GRPO rollouts.