Mobile-Bench: An Evaluation Benchmark for LLM-based Mobile Agents

向作者提问

NEW

简介

随着大型语言模型（LLMs）的显著进展，基于LLM的代理已成为人机交互研究的热点。然而，目前缺乏可用于基于LLM的移动代理的基准。基准测试这些代理通常面临三个主要挑战：（1）仅限于UI操作的低效性对任务评估施加了限制。（2）单个应用程序内的特定指令不足以评估LLM移动代理的多维推理和决策能力。（3）当前的评估指标不足以准确评估顺序动作的过程。因此，我们提出了Mobile-Bench，这是一个用于评估基于LLM的移动代理能力的新型基准。首先，我们通过整合103个收集的API来扩展传统的UI操作，以加快任务完成的效率。随后，我们通过将真实用户查询与LLMs的增强相结合来收集评估数据。为了更好地评估移动代理的不同规划能力水平，我们的数据分为三个不同的组别：SAST，SAMT和MAMT，反映了不同的任务复杂度水平。Mobile-Bench包括832个数据条目，其中有200多个任务专门设计用于评估多应用程序协作场景。此外，我们引入了一种更准确的评估指标，称为CheckPoint，以评估LLM移动代理是否在其规划和推理步骤中达到关键点。
作者讲解

目前尚无作者解读视频，你可点击下方【许愿开讲】按钮，许愿作者开讲~
图表
解决问题

Mobile-Bench: A Benchmark for Evaluating Large Language Model-Based Mobile Agents
关键思路

Mobile-Bench proposes a novel benchmark for evaluating the capabilities of LLM-based mobile agents by expanding conventional UI operations, collecting evaluation data, categorizing data into three distinct groups, and introducing a more accurate evaluation metric named CheckPoint.
其它亮点

Mobile-Bench comprises 832 data entries, with more than 200 tasks specifically designed to evaluate multi-APP collaboration scenarios. The evaluation data is collected by combining real user queries with augmentation from LLMs. A more accurate evaluation metric, named CheckPoint, is introduced to assess whether LLM-based mobile agents reach essential points during their planning and reasoning steps.
相关研究

Recent related studies in this field include 'BERT Can See Clearly Now: Comparing Variants of the Transformer Encoder' and 'RoBERTa: A Robustly Optimized BERT Pretraining Approach'.

许愿开讲

PDF

原文

点赞收藏

向作者提问

NEW

分享到Link

提问交流

提交问题，平台邀请作者，轻松获得权威解答～

向作者提问