floom
's Collections
Evaluation
updated
Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference
Paper
•
2403.04132
•
Published
•
41
Evaluating Very Long-Term Conversational Memory of LLM Agents
Paper
•
2402.17753
•
Published
•
20
The FinBen: An Holistic Financial Benchmark for Large Language Models
Paper
•
2402.12659
•
Published
•
22
TofuEval: Evaluating Hallucinations of LLMs on Topic-Focused Dialogue
Summarization
Paper
•
2402.13249
•
Published
•
13
Prometheus 2: An Open Source Language Model Specialized in Evaluating
Other Language Models
Paper
•
2405.01535
•
Published
•
123
To Believe or Not to Believe Your LLM
Paper
•
2406.02543
•
Published
•
35
Evaluating Open Language Models Across Task Types, Application Domains,
and Reasoning Types: An In-Depth Experimental Analysis
Paper
•
2406.11402
•
Published
•
6
Judging the Judges: Evaluating Alignment and Vulnerabilities in
LLMs-as-Judges
Paper
•
2406.12624
•
Published
•
38
Paper
•
2408.02666
•
Published
•
30
Beyond One-Size-Fits-All: Inversion Learning for Highly Effective NLG
Evaluation Prompts
Paper
•
2504.21117
•
Published
•
25
AutoLibra: Agent Metric Induction from Open-Ended Feedback
Paper
•
2505.02820
•
Published
•
3
Which Agent Causes Task Failures and When? On Automated Failure
Attribution of LLM Multi-Agent Systems
Paper
•
2505.00212
•
Published
•
5
Auto-SLURP: A Benchmark Dataset for Evaluating Multi-Agent Frameworks in
Smart Personal Assistant
Paper
•
2504.18373
•
Published
•
2
X-Reasoner: Towards Generalizable Reasoning Across Modalities and
Domains
Paper
•
2505.03981
•
Published
•
14