Robo2VLM: Visual Question Answering from Large-Scale In-the-Wild Robot Manipulation Datasets Paper • 2505.15517 • Published May 21 • 4
Token-Efficient Long Video Understanding for Multimodal LLMs Paper • 2503.04130 • Published Mar 6 • 95
Embrace Divergence for Richer Insights: A Multi-document Summarization Benchmark and a Case Study on Summarizing Diverse Information from News Articles Paper • 2309.09369 • Published Sep 17, 2023 • 1
Art or Artifice? Large Language Models and the False Promise of Creativity Paper • 2309.14556 • Published Sep 25, 2023
Did You Read the Instructions? Rethinking the Effectiveness of Task Definitions in Instruction Learning Paper • 2306.01150 • Published Jun 1, 2023
Next Steps for Human-Centered Generative AI: A Technical Perspective Paper • 2306.15774 • Published Jun 27, 2023
MixQG: Neural Question Generation with Mixed Answer Types Paper • 2110.08175 • Published Oct 15, 2021
SummaC: Re-Visiting NLI-based Models for Inconsistency Detection in Summarization Paper • 2111.09525 • Published Nov 18, 2021
Understanding Factual Errors in Summarization: Errors, Summarizers, Datasets, Error Detectors Paper • 2205.12854 • Published May 25, 2022
MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents Paper • 2404.10774 • Published Apr 16, 2024 • 4
Prompt Leakage effect and defense strategies for multi-turn LLM interactions Paper • 2404.16251 • Published Apr 24, 2024
CRMArena: Understanding the Capacity of LLM Agents to Perform Professional CRM Tasks in Realistic Environments Paper • 2411.02305 • Published Nov 4, 2024 • 1
Summary of a Haystack: A Challenge to Long-Context LLMs and RAG Systems Paper • 2407.01370 • Published Jul 1, 2024 • 90
Introducing v0.5 of the AI Safety Benchmark from MLCommons Paper • 2404.12241 • Published Apr 18, 2024 • 12
Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference Paper • 2403.04132 • Published Mar 7, 2024 • 41