Thinking Sparks!: Emergent Attention Heads in Reasoning Models During Post Training
Abstract
Post-training techniques like supervised fine-tuning and reinforcement learning lead to the emergence of specialized attention heads that support structured reasoning, with different training regimes affecting their evolution and performance.
The remarkable capabilities of modern large reasoning models are largely unlocked through post-training techniques such as supervised fine-tuning and reinforcement learning. However, the architectural mechanisms behind such improvements remain largely opaque. In this work, we use circuit analysis to demonstrate that post-training for complex reasoning sparks the emergence of novel, functionally specialized attention heads. These heads collectively support structured reasoning and computation. Our comparative analysis across Qwen families and DeepSeek-distilled model reveals that these emergent heads evolve differently under different training regimes. Distillation and SFT foster a cumulative addition of stable reasoning heads. In contrast, group relative policy optimization operates in a dynamic search mode: relatively few attention heads are iteratively activated, evaluated, and pruned, with their survival closely tracking fluctuations in the task reward signal. Furthermore, we find that controllable think on/off models do not possess dedicated thinking heads. Instead, turning off explicit reasoning triggers a broader-but less efficient-set of compensatory heads. Through ablation and qualitative analyses, we connect these circuit-level dynamics to a crucial performance trade-off: strengthened heads enable sophisticated problem-solving strategies for difficult problems but can also introduce over-thinking failure modes, such as calculation errors or logical loops on simpler tasks. These findings connect circuit-level dynamics to macro-level performance, identifying an inherent tension where complex reasoning comes at the cost of elementary computations. More broadly, our work points to future directions for training policy design, emphasizing the need to balance the development of effective reasoning strategies with the assurance of reliable, flawless execution.
Community
Modern large reasoning models boost performance via post-training methods like supervised fine-tuning and reinforcement learning—but how these gains arise internally has remained a mystery.
In our new work, we peel back the hood using circuit analysis to reveal:
(1) Post-training triggers the emergence of specialized attention heads that coordinate to carry out structured reasoning.
(2) Different training regimes steer different dynamics: SFT/distillation yield stable, cumulative reasoning heads, while policy optimization leads to iterative activation and pruning.
(3) Strong reasoning heads boost advanced problem solving—but risk “overthinking” errors on simpler tasks, revealing a tension between complexity and reliability.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Parallel-R1: Towards Parallel Thinking via Reinforcement Learning (2025)
- Reinforcement Learning Fine-Tuning Enhances Activation Intensity and Diversity in the Internal Circuitry of LLMs (2025)
- Thinking-Free Policy Initialization Makes Distilled Reasoning Models More Effective and Efficient Reasoners (2025)
- Your Models Have Thought Enough: Training Large Reasoning Models to Stop Overthinking (2025)
- Train Long, Think Short: Curriculum Learning for Efficient Reasoning (2025)
- Apriel-Nemotron-15B-Thinker (2025)
- Reasoning or Retrieval? A Study of Answer Attribution on Large Reasoning Models (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper