Evaluation - a aslessor Collection

aslessor 's Collections

Document conversion

Prompts

Image

CoT

Medical

Agents

Text to image papers

Vision

Audio

Video

Speech

RAG

Evaluation

updated Sep 1

MM-Vet v2: A Challenging Benchmark to Evaluate Large Multimodal Models for Integrated Capabilities

Paper • 2408.00765 • Published Aug 1, 2024 • 14
Towards Achieving Human Parity on End-to-end Simultaneous Speech Translation via LLM Agent

Paper • 2407.21646 • Published Jul 31, 2024 • 18
LLM-DetectAIve: a Tool for Fine-Grained Machine-Generated Text Detection

Paper • 2408.04284 • Published Aug 8, 2024 • 25
Training Language Models on the Knowledge Graph: Insights on Hallucinations and Their Detectability

Paper • 2408.07852 • Published Aug 14, 2024 • 16
GroUSE: A Benchmark to Evaluate Evaluators in Grounded Question Answering

Paper • 2409.06595 • Published Sep 10, 2024 • 38
MEDIC: Towards a Comprehensive Framework for Evaluating LLMs in Clinical Applications

Paper • 2409.07314 • Published Sep 11, 2024 • 56
Measuring and Enhancing Trustworthiness of LLMs in RAG through Grounded Attributions and Learning to Refuse

Paper • 2409.11242 • Published Sep 17, 2024 • 7
LLaVA-Critic: Learning to Evaluate Multimodal Models

Paper • 2410.02712 • Published Oct 3, 2024 • 37
TurtleBench: Evaluating Top Language Models via Real-World Yes/No Puzzles

Paper • 2410.05262 • Published Oct 7, 2024 • 11
A Preliminary Study of o1 in Medicine: Are We Closer to an AI Doctor?

Paper • 2409.15277 • Published Sep 23, 2024 • 38
Fusion-Eval: Integrating Evaluators with LLMs

Paper • 2311.09204 • Published Nov 15, 2023 • 6
HardTests: Synthesizing High-Quality Test Cases for LLM Coding

Paper • 2505.24098 • Published May 30 • 43
Neither Valid nor Reliable? Investigating the Use of LLMs as Judges

Paper • 2508.18076 • Published Aug 25 • 6