SWE-bench-java: A GitHub Issue Resolving Benchmark for Java

简介

GitHub问题解决是软件工程中的重要任务，近年来在工业界和学术界都受到了重视。在这项任务中，SWE-bench已经发布，用于评估大型语言模型（LLMs）的问题解决能力，但目前仅关注Python版本。然而，支持更多的编程语言也很重要，因为在工业界有强烈的需求。作为支持多语言的第一步，我们开发了一个Java版本的SWE-bench，称为SWE-bench-java。我们已经公开发布了数据集，以及相应的基于Docker的评估环境和排行榜，这些将在未来几个月内持续维护和更新。为了验证SWE-bench-java的可靠性，我们实现了一个经典的方法SWE-agent，并在其上测试了几个强大的LLMs。众所周知，开发高质量的多语言基准测试需要耗费大量时间和精力，因此我们欢迎通过拉取请求或合作来加速其迭代和完善，为完全自动化的编程铺平道路。
图表
解决问题

SWE-bench-java: A Java Version of SWE-bench for Evaluating Large Language Models in Issue Resolving
关键思路

Developing a Java version of SWE-bench to evaluate issue resolving capabilities of large language models (LLMs) in software engineering, and publicly releasing the dataset, evaluation environment, and leaderboard for multilingual support.
其它亮点

SWE-bench-java is a valuable contribution to the industry as it supports more programming languages and provides a continuously maintained and updated dataset and evaluation environment. The reliability of SWE-bench-java is verified through the implementation of a classic method SWE-agent and testing of several powerful LLMs. The paper also welcomes contributions for further iteration and refinement of the benchmark.
相关研究

SWE-bench has only focused on the Python version, so this paper is a new contribution to the field of evaluating LLMs in issue resolving for Java. However, there have been related studies in the field, such as 'Evaluating the Quality of Bug Report Summaries' by Saha et al. and 'Automated Bug Report Assignment: Ensemble-based Approaches' by Xia et al.

SWE-bench-java: A GitHub Issue Resolving Benchmark for Java

评论