CodeScope: An Execution-based Multilingual Multitask Multidimensional Benchmark for Evaluating LLMs on Code Understanding and Generation

2023年11月14日
  • 简介
    大型语言模型(LLMs)在编码相关任务中表现出色,特别是在协助人类编程和促进编程自动化方面。然而,现有的LLMs代码理解和生成能力评估基准存在严重限制。首先,大多数基准测试都存在缺陷,因为它们专注于狭窄的流行编程语言和特定任务,而实际的软件开发场景需要实现多语言编程环境以满足不同的要求。实际的编程实践也强烈期望多任务设置,以全面和稳健地测试LLMs的编码能力。其次,大多数基准测试也未考虑生成代码的实际可执行性和执行结果的一致性。为了弥合现有基准测试和实际应用期望之间的差距,我们引入了CodeScope,这是一个基于执行的、多语言的、多任务的、多维度的评估基准,用于全面衡量LLMs在编码任务上的能力。CodeScope涵盖了43种编程语言和8种编码任务。它从难度、效率和长度三个维度(角度)评估LLMs的编码性能。为了促进代码生成的基于执行的评估,我们开发了MultiCodeEngine,这是一个支持14种编程语言的自动化代码执行引擎。最后,我们系统地评估和分析了8个主流LLMs在CodeScope任务上的表现,并展示了CodeScope相对于其他基准测试在评估LLMs的代码理解和生成任务方面的广度和挑战性。CodeScope基准测试和数据集可在https://github.com/WeixiangYAN/CodeScope上公开获取。
  • 图表
  • 解决问题
    CodeScope: A Comprehensive Multi-lingual Multi-task Benchmark for Program Analysis and Understanding
  • 关键思路
    The paper introduces CodeScope, an execution-based, multilingual, multi-task, multi-dimensional evaluation benchmark for comprehensively gauging LLM capabilities on coding tasks. CodeScope covers 43 programming languages and 8 coding tasks, evaluating the coding performance of LLMs from three dimensions (perspectives): difficulty, efficiency, and length.
  • 其它亮点
    The paper proposes a new evaluation benchmark for LLMs on coding tasks, covering a wide range of programming languages and tasks. The benchmark evaluates LLMs from multiple dimensions, including difficulty, efficiency, and length. The paper also introduces MultiCodeEngine, an automated code execution engine that supports 14 programming languages. The benchmark and datasets are publicly available on GitHub. The paper evaluates and analyzes 8 mainstream LLMs on CodeScope tasks and demonstrates the superior breadth and challenges of CodeScope for evaluating LLMs on code understanding and generation tasks compared to other benchmarks.
  • 相关研究
    Related work includes existing benchmarks for evaluating LLMs on coding tasks, such as CodeXGLUE and CoDescCo, as well as research on LLMs for programming tasks, such as GPT-Coder and CodeBERT.
PDF
原文
点赞 收藏 评论 分享到Link

沙发等你来抢

去评论