Live-SWE-agent: Can Software Engineering Agents Self-Evolve on the Fly? Paper β’ 2511.13646 β’ Published Nov 17, 2025 β’ 8
BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions Paper β’ 2406.15877 β’ Published Jun 22, 2024 β’ 48
Evaluating Language Models for Efficient Code Generation Paper β’ 2408.06450 β’ Published Aug 12, 2024
Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation Paper β’ 2305.01210 β’ Published May 2, 2023 β’ 3
Top Leaderboard Ranking = Top Coding Proficiency, Always? EvoEval: Evolving Coding Benchmarks via LLM Paper β’ 2403.19114 β’ Published Mar 28, 2024 β’ 1
A Unified Debugging Approach via LLM-Based Multi-Agent Synergy Paper β’ 2404.17153 β’ Published Apr 26, 2024
Agentless: Demystifying LLM-based Software Engineering Agents Paper β’ 2407.01489 β’ Published Jul 1, 2024 β’ 65
Copiloting the Copilots: Fusing Large Language Models with Completion Engines for Automated Program Repair Paper β’ 2309.00608 β’ Published Sep 1, 2023 β’ 2
XFT: Unlocking the Power of Code Instruction Tuning by Simply Merging Upcycled Mixture-of-Experts Paper β’ 2404.15247 β’ Published Apr 23, 2024 β’ 3
StarCoder 2 and The Stack v2: The Next Generation Paper β’ 2402.19173 β’ Published Feb 29, 2024 β’ 152
NeuRI: Diversifying DNN Generation via Inductive Rule Inference Paper β’ 2302.02261 β’ Published Feb 4, 2023 β’ 3
Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation Paper β’ 2305.01210 β’ Published May 2, 2023 β’ 3
NNSmith: Generating Diverse and Valid Test Cases for Deep Learning Compilers Paper β’ 2207.13066 β’ Published Jul 26, 2022