Papers
arxiv:2506.10954

SWE-Factory: Your Automated Factory for Issue Resolution Training Data and Evaluation Benchmarks

Published on Jun 12
ยท Submitted by itaowe on Jun 13
#2 Paper of the day
Authors:
,
,
,
,
,
,
,

Abstract

A pipeline named SWE-Factory automates the creation and validation of GitHub issue resolution datasets for training and evaluating Large Language Models, using SWE-Builder for environment setup, exit-code-based grading, and automated fail2pass validation.

AI-generated summary

Constructing large-scale datasets for the GitHub issue resolution task is crucial for both training and evaluating the software engineering capabilities of Large Language Models (LLMs). However, the traditional process for creating such benchmarks is notoriously challenging and labor-intensive, particularly in the stages of setting up evaluation environments, grading test outcomes, and validating task instances. In this paper, we propose SWE-Factory, an automated pipeline designed to address these challenges. To tackle these issues, our pipeline integrates three core automated components. First, we introduce SWE-Builder, a multi-agent system that automates evaluation environment construction, which employs four specialized agents that work in a collaborative, iterative loop and leverages an environment memory pool to enhance efficiency. Second, we introduce a standardized, exit-code-based grading method that eliminates the need for manually writing custom parsers. Finally, we automate the fail2pass validation process using these reliable exit code signals. Experiments on 671 issues across four programming languages show that our pipeline can effectively construct valid task instances; for example, with GPT-4.1-mini, our SWE-Builder constructs 269 valid instances at 0.045 per instance, while with Gemini-2.5-flash, it achieves comparable performance at the lowest cost of 0.024 per instance. We also demonstrate that our exit-code-based grading achieves 100% accuracy compared to manual inspection, and our automated fail2pass validation reaches a precision of 0.92 and a recall of 1.00. We hope our automated pipeline will accelerate the collection of large-scale, high-quality GitHub issue resolution datasets for both training and evaluation. Our code and datasets are released at https://github.com/DeepSoftwareAnalytics/swe-factory.

Community

Paper author Paper submitter

๐Ÿš€ An automated pipeline for GitHub issue resolution data collection, reducing your manual effort!
๐Ÿ˜Œ Produce reliable and reproducible Docker-based evaluation environments
๐Ÿค– Automatic environment construction using the LLM-powered multi-agent system (SWE-Builder)
๐Ÿ™Œ๐Ÿป Support for multiple programming languages (we have evaluated Python, Java, JS, and TS extensively.)

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2506.10954 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2506.10954 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2506.10954 in a Space README.md to link it from this page.

Collections including this paper 1