Abstract
A framework evaluates DeepResearch systems by assessing the quality, redundancy, and factuality of their research reports using an LLM-as-a-Judge methodology.
DeepResearch agents represent a transformative AI paradigm, conducting expert-level research through sophisticated reasoning and multi-tool integration. However, evaluating these systems remains critically challenging due to open-ended research scenarios and existing benchmarks that focus on isolated capabilities rather than holistic performance. Unlike traditional LLM tasks, DeepResearch systems must synthesize diverse sources, generate insights, and present coherent findings, which are capabilities that resist simple verification. To address this gap, we introduce DeepResearch-ReportEval, a comprehensive framework designed to assess DeepResearch systems through their most representative outputs: research reports. Our approach systematically measures three dimensions: quality, redundancy, and factuality, using an innovative LLM-as-a-Judge methodology achieving strong expert concordance. We contribute a standardized benchmark of 100 curated queries spanning 12 real-world categories, enabling systematic capability comparison. Our evaluation of four leading commercial systems reveals distinct design philosophies and performance trade-offs, establishing foundational insights as DeepResearch evolves from information assistants toward intelligent research partners. Source code and data are available at: https://github.com/HKUDS/DeepResearch-Eval.
Community
Excited to share our new work: "Understanding DeepResearch via Reports"! đź“„
We hope it sparks discussion around how we build and evaluate DeepResearch systems. 🤔
🔍 Motivation: In DeepResearch, the final report—not just search results—is what truly matters to users. So, what else should we evaluate beyond retrieval?
đź§Ş Experiment: We built 100 real-world research questions across 12 domains and tested 4 leading commercial systems.
đź’ˇ Key Insights (see Sec 4!):
– The often-overlooked pre-research phase (e.g., query clarification, LLM follow-up questions) is far more critical than we thought.
– Search in DeepResearch prioritizes breadth over a single “perfect” answer—shifting how we think about retrieval.
There’s so much more to explore in this space—and surprisingly little discussion so far! 🔄
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Towards Personalized Deep Research: Benchmarks and Evaluations (2025)
- DeepResearch Arena: The First Exam of LLMs' Research Abilities via Seminar-Grounded Tasks (2025)
- WebResearcher: Unleashing unbounded reasoning capability in Long-Horizon Agents (2025)
- DeepTRACE: Auditing Deep Research AI Systems for Tracking Reliability Across Citations and Evidence (2025)
- DeepScholar-Bench: A Live Benchmark and Automated Evaluation for Generative Research Synthesis (2025)
- SurveyBench: Can LLM(-Agents) Write Academic Surveys that Align with Reader Needs? (2025)
- A Rigorous Benchmark with Multidimensional Evaluation for Deep Research Agents: From Answers to Reports (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper