TRAIL: Trace Reasoning and Agentic Issue Localization
Abstract
The increasing adoption of agentic workflows across diverse domains brings a critical need to scalably and systematically evaluate the complex traces these systems generate. Current evaluation methods depend on manual, domain-specific human analysis of lengthy workflow traces - an approach that does not scale with the growing complexity and volume of agentic outputs. Error analysis in these settings is further complicated by the interplay of external tool outputs and language model reasoning, making it more challenging than traditional software debugging. In this work, we (1) articulate the need for robust and dynamic evaluation methods for agentic workflow traces, (2) introduce a formal taxonomy of error types encountered in agentic systems, and (3) present a set of 148 large human-annotated traces (TRAIL) constructed using this taxonomy and grounded in established agentic benchmarks. To ensure ecological validity, we curate traces from both single and multi-agent systems, focusing on real-world applications such as software engineering and open-world information retrieval. Our evaluations reveal that modern long context LLMs perform poorly at trace debugging, with the best Gemini-2.5-pro model scoring a mere 11% on TRAIL. Our dataset and code are made publicly available to support and accelerate future research in scalable evaluation for agentic workflows.
Community
TRAIL is a benchmark dataset of 148 annotated AI agent execution traces containing 841 errors across reasoning, execution, and planning categories. Created from real-world software engineering and information retrieval tasks, it challenges even state-of-the-art LLMs, with the best Gemini-2.5-Pro-preview model achieving only 11% accuracy, highlighting the difficulty of trace debugging and need for better evaluation for complex agent workflows.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Survey on Evaluation of LLM-based Agents (2025)
- From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review (2025)
- Why Do Multi-Agent LLM Systems Fail? (2025)
- SeaView: Software Engineering Agent Visual Interface for Enhanced Workflow (2025)
- Auto-SLURP: A Benchmark Dataset for Evaluating Multi-Agent Frameworks in Smart Personal Assistant (2025)
- Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute (2025)
- Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 1
Collections including this paper 0
No Collection including this paper