PyBench: Evaluating LLM Agent on various real-world coding tasks

Yaolun Zhang ,
Yinxu Pan ,
Yudong Wang ,
Jie Cai ,
Zhi Zheng ,
Guoyang Zeng ,
Zhiyuan Liu
2024年07月23日
  • 简介
    LLM Agent配备了一个代码解释器,能够自动解决现实世界中的编码任务,如数据分析和图像编辑。然而,现有的基准主要集中在简单的任务上,例如完成几行代码,或者在存储库级别上进行极其复杂和具体的任务,这两者都不能代表各种日常编码任务。为了填补这一空白,我们介绍了PyBench,这是一个涵盖五个主要类别的现实世界任务的基准,涵盖了10多种文件类型。给定高级用户查询和相关文件,LLM Agent需要通过代码解释器进行一些转换来推理和执行Python代码,然后再做出正式响应以满足用户的需求。成功解决PyBench中的任务需要对各种Python包的全面理解,优秀的推理能力以及将执行的代码的反馈纳入其中的能力。我们的评估表明,目前的开源LLM在这些任务上面临困难。因此,我们对四种数据集进行了分析和实验,证明了PyBench需要综合能力。我们的Fine-tuned 8B大小的模型:PyLlama3在PyBench上取得了令人兴奋的表现,超过了许多33B和70B大小的模型。我们的基准测试、训练数据集和模型都可以在以下网址找到:\href{https://github.com/Mercury7353/PyBench}{https://github.com/Mercury7353/PyBench}。
  • 图表
  • 解决问题
    PyBench: A Comprehensive Benchmark for Real-World Python Code
  • 关键思路
    The paper introduces PyBench, a benchmark encompassing five main categories of real-world tasks, covering more than 10 types of files, to evaluate the abilities of language model agents in solving various daily coding tasks. The benchmark requires a robust understanding of various Python packages, superior reasoning capabilities, and the ability to incorporate feedback from executed code.
  • 其它亮点
    The paper presents PyBench, a comprehensive benchmark for real-world Python code, which includes five categories of tasks and more than 10 types of files. The authors fine-tuned an 8B size model, PyLlama3, for PyBench and achieved exciting performance surpassing many 33B and 70B size models. The benchmark, training dataset, and model are available on Github. The paper also provides detailed analyses of the datasets and experiments, indicating the need for comprehensive abilities to solve PyBench tasks.
  • 相关研究
    Related studies in this field mainly focus on either simplistic tasks or extremely complex and specific tasks at the repository level. The paper provides a new benchmark for evaluating language model agents' abilities in solving various daily coding tasks, which fills the gap in the current benchmarks.
PDF
原文
点赞 收藏 评论 分享到Link

沙发等你来抢

去评论