Movie Facts and Fibs (MF$^2$): A Benchmark for Long Movie Understanding Paper • 2506.06275 • Published Jun 6
LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks Paper • 2406.18403 • Published Jun 26, 2024
Not (yet) the whole story: Evaluating Visual Storytelling Requires More than Measuring Coherence, Grounding, and Repetition Paper • 2407.04559 • Published Jul 5, 2024