SWE-bench (benchmark)

From Systems Analysis Wiki
Jump to navigation Jump to search

SWE-bench is a large-scale benchmark (a set of test tasks) for evaluating the capabilities of large language models (LLMs) in the fields of automated software development and debugging[1]. It was developed by a group of researchers from Princeton University and other organizations and presented at the ICLR 2024 conference[2]. SWE-bench differs from traditional code benchmarks by using real-world tasks from development practice: the test set includes 2,294 tasks based on closed issues and their corresponding pull requests from 12 popular open-source Python repositories on GitHub[1][3]. Each task contains a problem description (issue) and provides the model with access to the corresponding project's source code; the model's goal is to generate minimal changes to the codebase (a patch) that fix the specified problem[1][3].

Methodology and Evaluation Features

SWE-bench simulates a real-world software development process. For each task, the model is presented with the text of the original GitHub issue (the problem description) and a snapshot of the repository's code before the fix was applied[4]. The model (or an agent based on the model) is required to analyze the source code, understand the nature of the bug or required change, and apply edits to the relevant code files to resolve the problem[4][5]. Solution validation is automated: each task is associated with the actual unit tests from the pull request that closed the issue. These include both "fail-to-pass" tests (which fail on the original code but should pass after the correct fix is applied) and regression tests (pass-to-pass, which initially pass and must continue to pass after the changes)[3]. The patch proposed by the model is applied to the code, and the corresponding tests are run: if all fail-to-pass tests begin to pass and the pass-to-pass tests are not broken, the task is considered successfully resolved[3]. This evaluation approach allows for checking not only the model's ability to generate syntactically correct code but also its capacity to genuinely solve the given task without breaking existing functionality. This requires the model to operate within a large context (an entire code repository), understand the relationships between components, and coordinate changes across multiple files simultaneously[1]—all of which is significantly more complex than typical tasks like writing a function from a description.

SWE-bench evaluations typically involve not just the LLMs themselves, but agentic systems that wrap the model with auxiliary tools (e.g., for file navigation, code execution, using a debugger, etc.)[4][6]. Such a system mimics a real development cycle: the model can sequentially browse files, run tests or scripts, and incrementally improve the solution until it achieves a successful result[4]. It is noteworthy that the effectiveness in solving SWE-bench tasks largely depends on the quality of this "scaffolding" (the agent's infrastructure): the same base models can show different results depending on how the interaction with the repository and tools is organized[4][7]. Thus, SWE-bench serves as a measure of the capabilities of the model and its problem-solving strategy combined, bringing the evaluation closer to the real-world working conditions of an autonomous AI developer[4][7].

Task Set Variants

The creators of SWE-bench and the community have subsequently introduced several derivative datasets for various evaluation purposes:

  • SWE-bench Lite — a lightweight version of the benchmark, containing ~300 tasks[8], selected to reduce the complexity and computational cost of testing models. This subset was created for rapid experimentation with models and excludes the most labor-intensive validations while retaining representativeness of the main problems[7]. Essentially, Lite contains simpler and shorter bug-fixing tasks, and models' results on Lite are usually higher than on the full set, due to the exclusion of the most difficult cases[7].
  • SWE-bench Verified — a subset filtered through manual verification, introduced in August 2024 in collaboration with OpenAI[7]. The researchers engaged 93 professional software developers to analyze each task from the original benchmark and excluded cases where the original problem description was too ambiguous or the behavior required by the tests did not clearly follow from the task description[7]. Tasks that were practically unsolvable due to environment issues or incorrect tests were also removed[7]. The result is a set of 500 tasks that are guaranteed to be solvable and correctly formulated[7]. SWE-bench Verified aims to provide a more reliable assessment of model capabilities by eliminating instances where even a correct solution is rejected due to inadequate tests or task specifications[7]. This set has replaced the original SWE-bench test sets (full and Lite) as the primary benchmark for comparing models[7]. Additionally, along with Verified, task difficulty ratings were published (e.g., identifying "easy" tasks solvable by a human in <15 minutes and "hard" tasks requiring >1 hour)[7], and a new Docker-based toolchain was released for more stable and reproducible test runs[7].
  • SWE-bench Multimodal — an extension of the benchmark introduced in January 2025, which includes tasks where the problem description contains not only text but also visual elements (e.g., UI images, error screenshots, etc.)[8]. This dataset (517 tasks[8]) tests the ability of models and agents to understand and use visual information when solving programming tasks. Evaluation on the multimodal set is organized similarly but requires the model to have multi-modal capabilities (e.g., recognizing text in images). The test portion of SWE-bench Multimodal is kept private (hidden) to prevent overfitting solutions to known answers; developers can submit solutions to a remote leaderboard to have their models evaluated on these tasks[2].

In addition to these main variations, an ecosystem of tools has formed around SWE-bench: SWE-agent, an open-source software solver "agent" that demonstrates state-of-the-art results on the benchmark's tasks[2]; SWE-smith, a framework for training custom developer models; SWE-REX, a tool for advanced information extraction and processing from repositories, and others. These projects aim to simplify the reproducibility of results and advance research in the field of autonomous programming systems.

Results and Model Progress

When SWE-bench first appeared, it revealed a significant gap between contemporary LLMs and the skills of experienced programmers. The authors reported that even the most powerful models of early 2023 could only handle a few percent of the tasks: for instance, the Claude 2 model from Anthropic successfully solved less than 2% of the tasks in the full set[1]. A model specially trained by the benchmark's authors (based on LLaMA, named SWE-Llama) and proprietary models like GPT-4 could mainly solve only the simplest bugs[1]. These low initial metrics underscored the difficulty of SWE-bench and spurred the development of new approaches.

During 2024, as more advanced models and agentic frameworks emerged, results improved significantly. Researchers from Princeton introduced the SWE-agent system, which combines GPT-4 with code search, planning, and other tools; it achieved around 12.5% solved tasks on the full set, setting a new benchmark for academic models[5]. By mid-2024, on the official SWE-bench leaderboard, the top solutions (including proprietary ones) reached about 20% successful resolutions on the full benchmark and up to 43% on the simplified Lite subset[7]. This growth is attributed to improvements in models (e.g., the advent of GPT-4, Claude 2, and 3) and especially to the development of "scaffolding"—external strategies that allow the model to effectively break down a task into steps, read documentation, run debugging sessions, etc.[7].

After the introduction of the Verified set (cleaned of incorrect tasks) at the end of 2024, measured performance increased even more. The GPT-4 model (GPT-4o variant) immediately showed about 33% successful resolutions on Verified, compared to ~16% previously on the original set[7]. The best open-source agent frameworks (e.g., Agentless) doubled their results from ~16% to 32% on Verified[7]. This confirmed the hypothesis that the original benchmark somewhat understated performance due to the presence of unsolvable cases[7]. At the same time, the improvement in results on Verified compared to Lite was not as dramatic (top models had already reached ~43% on Lite), which is logical: Lite initially selected easier examples, while Verified removed impossible tasks but retained difficult ones[7]. It's important to note that the performance increase when moving to Verified occurred across all task difficulty categories, not just by eliminating the hardest ones—meaning the filtering also removed hiddenly unsolvable cases among relatively simple tasks[7].

As of early 2025, leading AI systems demonstrate performance approaching human effectiveness on the verified task set, although the 100% ceiling is still distant. In January 2025, Anthropic reported that its new Claude 3.5 Sonnet model, paired with an improved agent, solved 49% of the SWE-bench Verified tasks[4], temporarily taking the top spot. Major tech companies and independent teams are also actively participating in unofficial competitions on this benchmark. For example, the CodeStory team developed a multi-model approach with trial and error ("Midwit Agent"), which achieved a record 62.2% of tasks solved on Verified (data as of early 2025)[5][9]. It was noted that this required a significant increase in computational resource expenditure during the model's inference phase (so-called inference time scaling), by running multiple solution attempts and selecting the best result[5]. In turn, OpenAI materials mentioned an experimental GPT-03 system, which, with sufficient computational scaling, allegedly surpassed the 70% threshold on Verified (unofficial data)[5]. However, independent verification of these results is lacking, and such a high figure remains more of a target for future research than an achieved milestone.

According to a 2025 Microsoft Research study, even the latest models equipped with debugging tools still do not surpass the 50% mark for successful bug fixes from SWE-bench Lite[6]. In this test, Claude 3.7 Sonnet performed best with ~48.4% of tasks solved, while a system based on GPT-4 (OpenAI 01) solved about 30%, and the more lightweight 03-mini model only 22%[6]. These results highlight that, despite rapid progress, current AIs still lag behind experienced programmers: for a human, solving such tasks (with an understanding of the code) is not difficult, whereas a model often fails to effectively use debugging tools or suffers from a lack of training data reflecting the multi-step process of bug fixing[6].

Limitations and Future Prospects

SWE-bench has become a standardized platform for evaluating intelligent code agents, but research has also identified some of its limitations. The main problem is the incompleteness of the testing: the set of verification tests for each task is taken from a specific pull request and usually includes only the unit tests that were modified during the bug fix[3]. As an analysis by a group of scientists from Zhejiang University and the University of Stuttgart showed (Wang et al. 2025), ignoring the project's other tests can hide the incorrectness of some solutions[3]. Re-evaluating solutions against the full repository test suite revealed that, on average, 7.8% of patches marked as successful in SWE-bench actually fail other tests in the project[3]. This leads to an overestimation of the "tasks solved" metric by approximately 4-6 percentage points[3]. An even more subtle case is when a generated patch passes all the original tests but is not equivalent to the developer's solution and changes the program's behavior in an unintended way. By generating additional test cases (the PatchDiff methodology), researchers found that nearly 30% of the AI-proposed fixes behave differently from the reference patches, and about 11% are definitively incorrect, although they are not detected by the existing tests[3]. Thus, the real capabilities of the models may be overestimated if one relies solely on passing a limited set of tests. The creators of SWE-bench acknowledge this vulnerability and emphasize that the benchmark must evolve over time: test coverage should be improved, checks for undesirable side effects added, and the variety of task types expanded[7]. The development of such evaluation tools is a crucial part of preparing for the emergence of increasingly autonomous and powerful AI developers, and the experience with SWE-bench shows the necessity of paying close attention to the quality of benchmarks[7].

While SWE-bench, as a static set of tasks, does not cover all aspects of programming, it has already become the de-facto standard for the comparative analysis of code models[3]. It is used in academic papers to demonstrate new methods and algorithms, as well as by industrial research groups to assess the potential of systems designed to automate programming[3]. The steady growth in SWE-bench results from 2023-2025 clearly demonstrates the rapid improvement in LLM capabilities for solving practical development tasks. At the same time, it serves as a barometer of complexity: even as they approach 50-60% of tasks solved, models are still far from being a complete replacement for humans, especially under conditions of limited information and the need for a nuanced understanding of requirements[4][7]. Nevertheless, progress is ongoing—thanks to initiatives like SWE-bench, the community has a clear view of its goals and limitations and continues to move toward the creation of a fully-fledged AI developer capable of autonomously understanding and fixing software code at the level of a human expert[4][7].

Further Reading

  • Liang, P. et al. (2022). Holistic Evaluation of Language Models (HELM). arXiv:2211.09110.
  • Chang, Y. et al. (2023). A Survey on Evaluation of Large Language Models. arXiv:2307.03109.
  • Ni, S. et al. (2025). A Survey on Large Language Model Benchmarks. arXiv:2508.15361.
  • Biderman, S. et al. (2024). The Language Model Evaluation Harness (lm-eval): Guidance and Lessons Learned. arXiv:2405.14782.
  • Kiela, D. et al. (2021). Dynabench: Rethinking Benchmarking in NLP. arXiv:2104.14337.
  • Ma, Z. et al. (2021). Dynaboard: An Evaluation‑As‑A‑Service Platform for Holistic Next‑Generation Benchmarking. arXiv:2106.06052.
  • Goel, K. et al. (2021). Robustness Gym: Unifying the NLP Evaluation Landscape. arXiv:2101.04840.
  • Xu, C. et al. (2024). Benchmark Data Contamination of Large Language Models: A Survey. arXiv:2406.04244.
  • Liu, S. et al. (2025). A Comprehensive Survey on Safety Evaluation of LLMs. arXiv:2506.11094.
  • Chiang, W.-L. et al. (2024). Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference. arXiv:2403.04132.
  • Boubdir, M. et al. (2023). Elo Uncovered: Robustness and Best Practices in Language Model Evaluation. arXiv:2311.17295.
  • Huang, L. et al. (2023). A Survey on Hallucination in Large Language Models. arXiv:2311.05232.

Notes

  1. 1.0 1.1 1.2 1.3 1.4 1.5 Jimenez, Carlos E. et al. "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?". arXiv. [1]
  2. 2.0 2.1 2.2 "SWE-bench/SWE-bench". GitHub. [2]
  3. 3.00 3.01 3.02 3.03 3.04 3.05 3.06 3.07 3.08 3.09 3.10 Wang, Shuyang et al. "Are 'Solved Issues' in SWE-bench Really Solved Correctly? An Empirical Study". arXiv. [3]
  4. 4.0 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 "Claude SWE-Bench Performance". Anthropic. [4]
  5. 5.0 5.1 5.2 5.3 5.4 Jain, Sulbha. "SWE Benchmark: LLM evaluation in Software Engineering Setting". Medium. [5]
  6. 6.0 6.1 6.2 6.3 Hatmaker, Taylor. "AI models still struggle to debug software, Microsoft study shows". TechCrunch. [6]
  7. 7.00 7.01 7.02 7.03 7.04 7.05 7.06 7.07 7.08 7.09 7.10 7.11 7.12 7.13 7.14 7.15 7.16 7.17 7.18 7.19 7.20 7.21 7.22 "Introducing SWE-bench Verified". OpenAI. [7]
  8. 8.0 8.1 8.2 "SWE-bench Leaderboard". [8]
  9. "SOTA on swebench-verified: relearning the bitter lesson". Hacker News (Y Combinator). [9]