Test-Time Scaling in Reasoning Models Is Not Effective for Knowledge-Intensive Tasks Yet
Abstract
Test-time scaling does not consistently improve accuracy or reduce hallucinations in knowledge-intensive tasks, often leading to overconfident errors.
Test-time scaling increases inference-time computation by allowing models to generate long reasoning chains, and has shown strong performance across many domains. However, in this work, we show that this approach is not yet effective for knowledge-intensive tasks, where high factual accuracy and low hallucination rates are essential. We conduct a comprehensive evaluation of test-time scaling using 12 reasoning models on two knowledge-intensive benchmarks. Our results reveal that increasing test-time computation does not consistently improve accuracy and, in many cases, it even leads to more hallucinations. We then analyze how extended reasoning affects hallucination behavior. We find that reduced hallucinations often result from the model choosing to abstain after thinking more, rather than from improved factual recall. Conversely, for some models, longer reasoning encourages attempts on previously unanswered questions, many of which result in hallucinations. Case studies show that extended reasoning can induce confirmation bias, leading to overconfident hallucinations. Despite these limitations, we observe that compared to non-thinking, enabling thinking remains beneficial. Code and data are available at https://github.com/XuZhao0/tts-knowledge
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Inverse Scaling in Test-Time Compute (2025)
- Hop, Skip, and Overthink: Diagnosing Why Reasoning Models Fumble during Multi-Hop Analysis (2025)
- Thinking with Nothinking Calibration: A New In-Context Learning Paradigm in Reasoning Large Language Models (2025)
- OptimalThinkingBench: Evaluating Over and Underthinking in LLMs (2025)
- Deep Think with Confidence (2025)
- Demystifying Scientific Problem-Solving in LLMs by Probing Knowledge and Reasoning (2025)
- Less is More Tokens: Efficient Math Reasoning via Difficulty-Aware Chain-of-Thought Distillation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper