LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

简介

我们提出了LiveCodeBench，这是一个全面且无污染的用于评估代码LLMs能力的评估标准。随着新的和改进的LLMs的开发，现有的评估基准（例如HumanEval、MBPP）已经不再足以评估它们的能力。我们的评估标准聚焦于更广泛的与代码相关的能力，如自我修复、代码执行和测试输出预测，而不仅仅是代码生成。目前，LiveCodeBench收录了从LeetCode、AtCoder和CodeForces三个竞赛平台上发布的400个高质量编码问题，发布时间从2023年5月至2024年5月。我们在LiveCodeBench上评估了18个基础LLMs和34个指令调整的LLMs。我们提供了有关污染、整体性能比较、现有基准中潜在的过度拟合以及个体模型比较的实证发现。我们将发布所有提示和模型完成结果，供社区进一步分析，同时提供一个通用工具包，用于添加新的场景和模型。
图表
解决问题

LiveCodeBench: A Comprehensive and Contamination-Free Evaluation of Large Language Models for Code
关键思路

The paper proposes LiveCodeBench, a new evaluation benchmark for Large Language Models (LLMs) applied to code-related applications, which focuses on a broader range of code related capabilities beyond just code generation.
其它亮点

The benchmark continuously collects new problems over time from contests across three competition platforms, and currently hosts four hundred high-quality coding problems. The paper presents empirical findings on contamination, holistic performance comparisons, potential overfitting in existing benchmarks as well as individual model comparisons. The prompts and model completions will be released for further community analysis, along with a general toolkit for adding new scenarios and models.
相关研究

Some related work in this field includes HumanEval and MBPP evaluation benchmarks.

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

评论