Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration

2024年06月03日
  • 简介
    移动设备操作任务越来越成为一个受欢迎的多模态人工智能应用场景。目前的多模态大语言模型(MLLMs)由于训练数据的限制,缺乏有效地作为操作助手的能力。相反,增强能力的基于MLLM的代理逐渐被应用于这种场景。然而,现有工作的单代理体系结构显著复杂化了移动设备操作任务中的两个主要导航挑战:任务进度导航和焦点内容导航。这是由于过长的令牌序列和交错的文本-图像数据格式限制了性能。为了有效解决这些导航挑战,我们提出了Mobile-Agent-v2,这是一个用于移动设备操作辅助的多代理体系结构。该体系结构包括三个代理:规划代理、决策代理和反思代理。规划代理生成任务进度,使历史操作的导航更加高效。为了保留焦点内容,我们设计了一个随任务进度更新的记忆单元。此外,为了纠正错误操作,反思代理观察每个操作的结果并相应地处理任何错误。实验结果表明,Mobile-Agent-v2相比Mobile-Agent的单代理体系结构,任务完成率提高了30%以上。代码已在https://github.com/X-PLUG/MobileAgent上开源。
  • 图表
  • 解决问题
    Mobile-Agent-v2: A Multi-Agent Architecture for Mobile Device Operation Assistance
  • 关键思路
    The paper proposes a multi-agent architecture, consisting of planning agent, decision agent, and reflection agent, to address the navigation challenges in mobile device operation tasks. The architecture achieves over a 30% improvement in task completion compared to the single-agent architecture of Mobile-Agent.
  • 其它亮点
    The paper addresses the two major navigation challenges in mobile device operation tasks, task progress navigation and focus content navigation, through the multi-agent architecture. A memory unit is designed to retain focus content, and the reflection agent corrects erroneous operations. Experimental results show significant improvement in task completion. The code is open-sourced at https://github.com/X-PLUG/MobileAgent.
  • 相关研究
    Recent related studies in this field include 'Multi-modal Large Language Models for Mobile Device Operation Assistance' and 'Enhancing Mobile Device Operation Assistance through Tool Invocation'.
PDF
原文
点赞 收藏 评论 分享到Link

沙发等你来抢

去评论