Are We on the Right Way for Assessing Document Retrieval-Augmented Generation?
Abstract
Double-Bench is a large-scale, multilingual, and multimodal evaluation system for document Retrieval-Augmented Generation (RAG) systems, addressing limitations in current benchmarks and providing comprehensive assessments of system components.
Retrieval-Augmented Generation (RAG) systems using Multimodal Large Language Models (MLLMs) show great promise for complex document understanding, yet their development is critically hampered by inadequate evaluation. Current benchmarks often focus on specific part of document RAG system and use synthetic data with incomplete ground truth and evidence labels, therefore failing to reflect real-world bottlenecks and challenges. To overcome these limitations, we introduce Double-Bench: a new large-scale, multilingual, and multimodal evaluation system that is able to produce fine-grained assessment to each component within document RAG systems. It comprises 3,276 documents (72,880 pages) and 5,168 single- and multi-hop queries across 6 languages and 4 document types with streamlined dynamic update support for potential data contamination issues. Queries are grounded in exhaustively scanned evidence pages and verified by human experts to ensure maximum quality and completeness. Our comprehensive experiments across 9 state-of-the-art embedding models, 4 MLLMs and 4 end-to-end document RAG frameworks demonstrate the gap between text and visual embedding models is narrowing, highlighting the need in building stronger document retrieval models. Our findings also reveal the over-confidence dilemma within current document RAG frameworks that tend to provide answer even without evidence support. We hope our fully open-source Double-Bench provide a rigorous foundation for future research in advanced document RAG systems. We plan to retrieve timely corpus and release new benchmarks on an annual basis.
Community
๐ Excited to share our latest work on revolutionizing RAG evaluation!
We introduce Double-Bench: a new large-scale, multilingual, and multimodal evaluation system that is able to produce fine-grained assessment to each component within document RAG systems [2508.03644] Are We on the Right Way for Assessing Document Retrieval-Augmented Generation?
๐ With 3,276 documents (72,880 pages) and 5,168 single- and multi-hop queries across 6 languages and 4 document types, Double-Bench addresses critical gaps in current benchmarks that rely on synthetic data and incomplete ground truth. Our human-verified evaluation framework finally provides the comprehensive, real-world assessment that the RAG community desperately needs!
๐โจ This work tackles one of the biggest bottlenecks in advancing document understanding with MLLMs.
Project Homepage: https://double-bench.github.io/
Code: https://github.com/Episoode/Double-Bench
Dataset: https://huggingface.co/datasets/Episoode/Double-Bench
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Zero-Shot Document Understanding using Pseudo Table of Contents-Guided Retrieval-Augmented Generation (2025)
- SimpleDoc: Multi-Modal Document Understanding with Dual-Cue Page Retrieval and Iterative Refinement (2025)
- TableRAG: A Retrieval Augmented Generation Framework for Heterogeneous Document Reasoning (2025)
- PDF Retrieval Augmented Question Answering (2025)
- mKG-RAG: Multimodal Knowledge Graph-Enhanced RAG for Visual Question Answering (2025)
- WikiMixQA: A Multimodal Benchmark for Question Answering over Tables and Charts (2025)
- Question Decomposition for Retrieval-Augmented Generation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper