MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use Paper • 2509.24002 • Published Sep 28, 2025 • 174
SimpleQA Verified: A Reliable Factuality Benchmark to Measure Parametric Knowledge Paper • 2509.07968 • Published Sep 9, 2025 • 14
Test-Time Scaling in Reasoning Models Is Not Effective for Knowledge-Intensive Tasks Yet Paper • 2509.06861 • Published Sep 8, 2025 • 8
Balancing Truthfulness and Informativeness with Uncertainty-Aware Instruction Fine-Tuning Paper • 2502.11962 • Published Feb 17, 2025 • 38
SynthRL: Scaling Visual Reasoning with Verifiable Data Synthesis Paper • 2506.02096 • Published Jun 2, 2025 • 52
Beyond 'Aha!': Toward Systematic Meta-Abilities Alignment in Large Reasoning Models Paper • 2505.10554 • Published May 15, 2025 • 120