Abstract
The task of issue resolving is to modify a codebase to generate a patch that addresses a given issue. However, existing benchmarks, such as SWE-bench, focus almost exclusively on Python, making them insufficient for evaluating Large Language Models (LLMs) across diverse software ecosystems. To address this, we introduce a multilingual issue-resolving benchmark, called Multi-SWE-bench, covering Java, TypeScript, JavaScript, Go, Rust, C, and C++. It includes a total of 1,632 high-quality instances, which were carefully annotated from 2,456 candidates by 68 expert annotators, ensuring that the benchmark can provide an accurate and reliable evaluation. Based on Multi-SWE-bench, we evaluate a series of state-of-the-art models using three representative methods (Agentless, SWE-agent, and OpenHands) and present a comprehensive analysis with key empirical insights. In addition, we launch a Multi-SWE-RL open-source community, aimed at building large-scale reinforcement learning (RL) training datasets for issue-resolving tasks. As an initial contribution, we release a set of 4,723 well-structured instances spanning seven programming languages, laying a solid foundation for RL research in this domain. More importantly, we open-source our entire data production pipeline, along with detailed tutorials, encouraging the open-source community to continuously contribute and expand the dataset. We envision our Multi-SWE-bench and the ever-growing Multi-SWE-RL community as catalysts for advancing RL toward its full potential, bringing us one step closer to the dawn of AGI.
Community
Leaderboard: https://multi-swe-bench.github.io
Code: https://github.com/multi-swe-bench/multi-swe-bench
Multi-SWE-bench Data: https://huggingface.co/datasets/ByteDance-Seed/Multi-SWE-bench
Multi-SWE-RL Data: https://huggingface.co/datasets/ByteDance-Seed/Multi-SWE-RL
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- DependEval: Benchmarking LLMs for Repository Dependency Understanding (2025)
- EnvBench: A Benchmark for Automated Environment Setup (2025)
- BenchMAX: A Comprehensive Multilingual Evaluation Suite for Large Language Models (2025)
- CodeCriticBench: A Holistic Code Critique Benchmark for Large Language Models (2025)
- WritingBench: A Comprehensive Benchmark for Generative Writing (2025)
- Large Language Models for Code Generation: A Comprehensive Survey of Challenges, Techniques, Evaluation, and Applications (2025)
- ModiGen: A Large Language Model-Based Workflow for Multi-Task Modelica Code Generation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 2
Spaces citing this paper 0
No Space linking this paper