GuessArena: Guess Who I Am? A Self-Adaptive Framework for Evaluating LLMs in Domain-Specific Knowledge and Reasoning Paper • 2505.22661 • Published May 28 • 1
xVerify: Efficient Answer Verifier for Reasoning Model Evaluations Paper • 2504.10481 • Published Apr 14 • 84
TurtleBench: Evaluating Top Language Models via Real-World Yes/No Puzzles Paper • 2410.05262 • Published Oct 7, 2024 • 11
Internal Consistency and Self-Feedback in Large Language Models: A Survey Paper • 2407.14507 • Published Jul 19, 2024 • 47
Grimoire is All You Need for Enhancing Large Language Models Paper • 2401.03385 • Published Jan 7, 2024 • 5
xFinder: Robust and Pinpoint Answer Extraction for Large Language Models Paper • 2405.11874 • Published May 20, 2024 • 7