JARVIS-1: Open-World Multi-task Agents with Memory-Augmented Multimodal Language Models

2023年11月10日
  • 简介
    实现多模态观察下类人的规划和控制,是更具功能性的通用型代理的一个关键里程碑。现有方法可以处理某些长期任务,但在任务数量可能无限的开放世界中仍然存在困难,并且缺乏逐步提高任务完成能力的能力。我们介绍了JARVIS-1,这是一个开放世界的代理,可以在流行但具有挑战性的Minecraft世界中感知多模态输入(视觉观察和人类指令),生成复杂的计划,并执行具体的控制。具体而言,我们在预训练的多模态语言模型上开发了JARVIS-1,该模型将视觉观察和文本指令映射到计划中。计划最终将被分配给目标条件控制器。我们为JARVIS-1配备了多模态存储器,它可以利用预训练的知识和实际游戏生存经验来促进规划。在我们的实验中,JARVIS-1在Minecraft Universe Benchmark的200多个不同任务中表现出几乎完美的表现,范围从入门到中级。 JARVIS-1在长期任务“钻石镐”中实现了12.5%的完成率。这相当于以前记录的5倍以上的显着提高。此外,我们展示了JARVIS-1能够通过多模态存储器遵循终身学习范式进行自我改进,从而引发更普遍的智能和改进的自主性。项目页面可在https://craftjarvis-jarvis1.github.io上找到。
  • 图表
  • 解决问题
    Developing an open-world agent that can handle an infinite number of tasks and progressively enhance task completion in the Minecraft universe.
  • 关键思路
    Developing JARVIS-1, an open-world agent that can perceive multimodal input, generate plans using pre-trained multimodal language models, and perform embodied control using goal-conditioned controllers.
  • 其它亮点
    JARVIS-1 achieves nearly perfect performance across over 200 varying tasks from the Minecraft Universe Benchmark, including a 5x increase in completion rate for the long-horizon diamond pickaxe task. JARVIS-1 can self-improve following a life-long learning paradigm thanks to multimodal memory. The project page is available at https://craftjarvis-jarvis1.github.io.
  • 相关研究
    Recent related work includes 'Learning to Navigate the Web' by Misra et al., 'Learning to Learn by Gradient Descent by Gradient Descent' by Andrychowicz et al., and 'Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model' by Silver et al.
PDF
原文
点赞 收藏 评论 分享到Link

沙发等你来抢

去评论