BigCodeBench: Benchmarking Large Language Models on Solving Practical and Challenging Programming Tasks Jun 18, 2024 • 46
AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories Paper • 2504.08942 • Published 4 days ago • 16
DeepSeek-R1 Thoughtology: Let's <think> about LLM Reasoning Paper • 2504.07128 • Published 14 days ago • 72
A Sober Look at Progress in Language Model Reasoning: Pitfalls and Paths to Reproducibility Paper • 2504.07086 • Published 6 days ago • 17
OLMoTrace: Tracing Language Model Outputs Back to Trillions of Training Tokens Paper • 2504.07096 • Published 6 days ago • 66
SmolVLM: Redefining small and efficient multimodal models Paper • 2504.05299 • Published 8 days ago • 158
SmolVLM: Redefining small and efficient multimodal models Paper • 2504.05299 • Published 8 days ago • 158
Functional Benchmarks for Robust Evaluation of Reasoning Performance, and the Reasoning Gap Paper • 2402.19450 • Published Feb 29, 2024 • 3
ZEBRA: Zero-Shot Example-Based Retrieval Augmentation for Commonsense Question Answering Paper • 2410.05077 • Published Oct 7, 2024 • 2
Exploring Non-Verbal Predicates in Semantic Role Labeling: Challenges and Opportunities Paper • 2307.01870 • Published Jul 4, 2023
EMMA-500: Enhancing Massively Multilingual Adaptation of Large Language Models Paper • 2409.17892 • Published Sep 26, 2024 • 2
ObscuraCoder: Powering Efficient Code LM Pre-Training Via Obfuscation Grounding Paper • 2504.00019 • Published 19 days ago
IRCoder: Intermediate Representations Make Language Models Robust Multilingual Code Generators Paper • 2403.03894 • Published Mar 6, 2024