ESC-Eval: Evaluating Emotion Support Conversations in Large Language Models

简介

情感支持对话（ESC）是一个重要的应用，旨在减轻人类压力，提供情感指导，最终增强人类的心理和身体健康。随着大型语言模型（LLMs）的发展，许多研究人员已经将LLMs用作ESC模型。然而，这些基于LLMs的ESC的评估仍然不确定。受角色扮演代理的惊人发展启发，我们提出了一个ESC评估框架（ESC-Eval），它使用角色扮演代理与ESC模型进行交互，然后手动评估交互对话。具体来说，我们首先重新组织了来自七个现有数据集的2,801个角色扮演卡，以定义角色扮演代理的角色。其次，我们训练了一个名为ESC-Role的特定角色扮演模型，其行为更像是一个困惑的人而不是GPT-4。第三，通过ESC-Role和组织好的角色卡，我们系统地使用14个LLMs作为ESC模型进行实验，包括通用AI助手LLMs（ChatGPT）和面向ESC的LLMs（ExTES-Llama）。我们对不同ESC模型的交互多轮对话进行了全面的人工注释。结果表明，面向ESC的LLMs相对于通用AI助手LLMs表现出更优秀的ESC能力，但仍存在人类表现之间的差距。此外，为了自动化未来ESC模型的评分过程，我们开发了ESC-RANK，它在注释数据上进行训练，实现了超过GPT-4 35分的评分表现。我们的数据和代码可在https://github.com/haidequanbu/ESC-Eval上获得。
图表
解决问题

ESC-Eval: An Evaluation Framework for Emotion Support Conversation Models Using Role-Playing Agents
关键思路

The paper proposes an evaluation framework for Emotion Support Conversation (ESC) models using role-playing agents and manual evaluation of interactive dialogues. The framework aims to address the uncertainty in evaluating Large Language Models (LLMs) used as ESC models.
其它亮点

The paper re-organizes role-playing cards from existing datasets to define the roles of the role-playing agent, trains a specific role-playing model called ESC-Role, and conducts experiments using 14 LLMs as the ESC models. The results show that ESC-oriented LLMs perform better than general AI-assistant LLMs but still lag behind human performance. The paper also introduces ESC-RANK, an automated scoring process for future ESC models that achieves a scoring performance surpassing 35 points of GPT-4. The data and code are available on GitHub.
相关研究

Related studies in this field include 'A Survey on Emotion Recognition using Speech Processing Techniques' and 'Emotion Detection in Conversations Using Deep Learning: A Review'.

ESC-Eval: Evaluating Emotion Support Conversations in Large Language Models

评论