ExpertLongBench: Benchmarking Language Models on Expert-Level Long-Form Generation Tasks with Structured Checklists Paper • 2506.01241 • Published Jun 2 • 9
ThinkPRM Collection Process Reward Models that Think -- https://arxiv.org/abs/2504.16828 • 8 items • Updated 21 days ago • 1
FactBench: A Dynamic Benchmark for In-the-Wild Language Model Factuality Evaluation Paper • 2410.22257 • Published Oct 29, 2024