Revisiting Multi-Agent Debate as Test-Time Scaling: A Systematic Study of Conditional Effectiveness Paper • 2505.22960 • Published May 29 • 16
Reasoning Model is Stubborn: Diagnosing Instruction Overriding in Reasoning Models Paper • 2505.17225 • Published May 22 • 65