ThinkTwice: Jointly Optimizing Large Language Models for Reasoning and Self-Refinement Paper • 2604.01591 • Published 7 days ago • 32
Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding Paper • 2604.05015 • Published 3 days ago • 212
StatEval: A Comprehensive Benchmark for Large Language Models in Statistics Paper • 2510.09517 • Published Oct 10, 2025 • 8
QuantaAlpha: An Evolutionary Framework for LLM-Driven Alpha Mining Paper • 2602.07085 • Published Feb 6 • 190
Spider-Sense: Intrinsic Risk Sensing for Efficient Agent Defense with Hierarchical Adaptive Screening Paper • 2602.05386 • Published Feb 5 • 69
Watching, Reasoning, and Searching: A Video Deep Research Benchmark on Open Web for Agentic Video Reasoning Paper • 2601.06943 • Published Jan 11 • 215