- 简介随着大型语言模型(LLMs)的显著进展,基于LLM的代理已成为人机交互研究的热点。然而,目前缺乏可用于基于LLM的移动代理的基准。基准测试这些代理通常面临三个主要挑战:(1)仅限于UI操作的低效性对任务评估施加了限制。 (2)单个应用程序内的特定指令不足以评估LLM移动代理的多维推理和决策能力。 (3)当前的评估指标不足以准确评估顺序动作的过程。因此,我们提出了Mobile-Bench,这是一个用于评估基于LLM的移动代理能力的新型基准。首先,我们通过整合103个收集的API来扩展传统的UI操作,以加快任务完成的效率。随后,我们通过将真实用户查询与LLMs的增强相结合来收集评估数据。为了更好地评估移动代理的不同规划能力水平,我们的数据分为三个不同的组别:SAST,SAMT和MAMT,反映了不同的任务复杂度水平。Mobile-Bench包括832个数据条目,其中有200多个任务专门设计用于评估多应用程序协作场景。此外,我们引入了一种更准确的评估指标,称为CheckPoint,以评估LLM移动代理是否在其规划和推理步骤中达到关键点。
- 图表
- 解决问题Mobile-Bench: A Benchmark for Evaluating Large Language Model-Based Mobile Agents
- 关键思路Mobile-Bench proposes a novel benchmark for evaluating the capabilities of LLM-based mobile agents by expanding conventional UI operations, collecting evaluation data, categorizing data into three distinct groups, and introducing a more accurate evaluation metric named CheckPoint.
- 其它亮点Mobile-Bench comprises 832 data entries, with more than 200 tasks specifically designed to evaluate multi-APP collaboration scenarios. The evaluation data is collected by combining real user queries with augmentation from LLMs. A more accurate evaluation metric, named CheckPoint, is introduced to assess whether LLM-based mobile agents reach essential points during their planning and reasoning steps.
- Recent related studies in this field include 'BERT Can See Clearly Now: Comparing Variants of the Transformer Encoder' and 'RoBERTa: A Robustly Optimized BERT Pretraining Approach'.
沙发等你来抢
去评论
评论
沙发等你来抢