BIRD-INTERACT: Re-imagining Text-to-SQL Evaluation for Large Language Models via Lens of Dynamic Interactions
Abstract
BIRD-INTERACT is a benchmark for multi-turn text-to-SQL tasks that simulates realistic database assistant challenges through dynamic interactions, hierarchical knowledge bases, and autonomous decision-making.
Large language models (LLMs) have demonstrated remarkable performance on single-turn text-to-SQL tasks, but real-world database applications predominantly require multi-turn interactions to handle ambiguous queries, execution errors, and evolving user requirements. Existing multi-turn benchmarks fall short by treating conversation histories as static context or limiting evaluation to read-only operations, failing to reflect production-grade database assistant challenges. We introduce BIRD-INTERACT, a benchmark that restores this realism through: (1) a comprehensive interaction environment coupling each database with a hierarchical knowledge base, metadata files, and a function-driven user simulator, enabling models to solicit clarifications, retrieve knowledge, and recover from errors without human supervision; (2) two evaluation settings consisting of a pre-defined conversational protocol (c-Interact) and an open-ended agentic setting (a-Interact) where models autonomously decide when to query the user simulator or explore the environment; (3) a challenging task suite covering the full CRUD spectrum for business-intelligence and operational use cases, guarded by executable test cases. Each task features ambiguous and follow-up sub-tasks requiring dynamic interaction. The suite comprises BIRD-INTERACT-FULL (600 tasks, up to 11,796 interactions) for comprehensive performance assessment, and BIRD-INTERACT-LITE (300 tasks with simplified databases) for detailed behavioral analysis and rapid method development. Our empirical results highlight BIRD-INTERACT's difficulty: GPT-5 completes only 8.67% of tasks in c-Interact and 17.00% in a-Interact. Analysis via memory grafting and Interaction Test-time Scaling validates the importance of effective interaction for complex, dynamic text-to-SQL tasks.
Community
47 pages, 26 figures, 11 tables. Submitted to arXiv; based on work from The BIRD Team and Google Cloud. Dataset and code available at https://bird-interact.github.io/
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- How Can Input Reformulation Improve Tool Usage Accuracy in a Complex Dynamic Environment? A Study on τ-bench (2025)
- Interactive Evaluation of Large Language Models for Multi-Requirement Software Engineering Tasks (2025)
- TRUEBench: Can LLM Response Meet Real-world Constraints as Productivity Assistant? (2025)
- KnowMT-Bench: Benchmarking Knowledge-Intensive Long-Form Question Answering in Multi-Turn Dialogues (2025)
- SQL-of-Thought: Multi-agentic Text-to-SQL with Guided Error Correction (2025)
- VitaBench: Benchmarking LLM Agents with Versatile Interactive Tasks in Real-world Applications (2025)
- AmbiSQL: Interactive Ambiguity Detection and Resolution for Text-to-SQL (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 3
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper