PRELUDE: A Benchmark Designed to Require Global Comprehension and Reasoning over Long Contexts
Abstract
A benchmark called PRELUDE evaluates long-context understanding by assessing the consistency of prequel stories with original books, revealing significant challenges for models compared to humans.
We introduce PRELUDE, a benchmark for evaluating long-context understanding through the task of determining whether a character's prequel story is consistent with the canonical narrative of the original book. Our task poses a stronger demand for global comprehension and deep reasoning than existing benchmarks -- as the prequels are not part of the original story, assessing their plausibility typically requires searching and integrating information that is only indirectly related. Empirically, 88% of instances require evidence from multiple parts of the narrative. Experimental results highlight the challenge of our task: in-context learning, RAG and in-domain training with state-of-the-art LLMs, and commercial DeepResearch services, lag behind humans by >15%. A further human study reveals that models often produce correct answers with flawed reasoning, leading to an over 30% gap in reasoning accuracy compared to humans. These findings underscore the substantial room for improvement in long-context understanding and reasoning.
Community
Homepage: https://gorov.github.io/prelude
Leaderboard: https://gorov.github.io/prelude/leaderboard.html
Dataset: https://huggingface.co/datasets/ttchungc/PRELUDE
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- SitEmb-v1.5: Improved Context-Aware Dense Retrieval for Semantic Association and Long Story Comprehension (2025)
- Rethinking All Evidence: Enhancing Trustworthy Retrieval-Augmented Generation via Conflict-Driven Summarization (2025)
- SciVer: Evaluating Foundation Models for Multimodal Scientific Claim Verification (2025)
- LAG: Logic-Augmented Generation from a Cartesian Perspective (2025)
- LastingBench: Defend Benchmarks Against Knowledge Leakage (2025)
- HIRAG: Hierarchical-Thought Instruction-Tuning Retrieval-Augmented Generation (2025)
- Ref-Long: Benchmarking the Long-context Referencing Capability of Long-context Language Models (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper