MM-Eval: A Multilingual Meta-Evaluation Benchmark for LLM-as-a-Judge and Reward Models Paper • 2410.17578 • Published Oct 23, 2024 • 1
LLM-as-an-Interviewer: Beyond Static Testing Through Dynamic LLM Evaluation Paper • 2412.10424 • Published Dec 10, 2024 • 2
Evaluating Language Models as Synthetic Data Generators Paper • 2412.03679 • Published Dec 4, 2024 • 49
Spatiotemporal Skip Guidance for Enhanced Video Diffusion Sampling Paper • 2411.18664 • Published Nov 27, 2024 • 24
The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models Paper • 2406.05761 • Published Jun 9, 2024 • 3
Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models Paper • 2405.01535 • Published May 2, 2024 • 123
CLIcK: A Benchmark Dataset of Cultural and Linguistic Intelligence in Korean Paper • 2403.06412 • Published Mar 11, 2024 • 3