作者:D Driess, F Xia, M S. M. Sajjadi...
[Robotics at Google & Google Research]

要点:

  1. PaLM-E 是一个具身语言模型,将现实世界的传感器模态纳入语言模型,以建立词和感知之间的联系;
  2. PaLM-E 在各种具身推理任务上取得了最先进的性能,包括连续的机器人操纵规划、视觉问答和描述,以及一般的视觉语言任务;
  3. PaLM-E 中使用的新的架构理念,如神经场景表征和实体标记的多模态标记,在摄取多模态信息方面特别有效;
  4. 扩大语言模型规模带来了灾难性遗忘的明显减少,同时成为一个具身的主体,使 PaLM-E 能展示多模态思维链推理和对多个图像进行推理的能力等新兴能力。

总结:
PaLM-E 是一个具身语言模型,结合了现实世界的传感器模态进行接地(grounded)语言推理,并在各种任务和具身体中表现出积极的迁移。

https://palm-e.github.io/assets/palm-e.pdf

Large language models have been demonstrated to perform complex tasks. However, enabling general inference in the real world, e.g. for robotics problems, raises the challenge of grounding. We propose embodied language models to directly incorporate real-world continuous sensor modalities into language models and thereby establish the link between words and percepts. Input to our embodied language model are multi-modal sentences that interleave visual, continuous state estimation, and textual input encodings. We train these encodings end-to-end, in conjunction with a pretrained large language model, for multiple embodied tasks including sequential robotic manipulation planning, visual question answering, and captioning. Our evaluations show that PaLM-E, a single large embodied multimodal model, can address a variety of embodied reasoning tasks, from a variety of observation modalities, on multiple embodiments, and further, exhibits positive transfer: the model benefits from diverse joint training across internet-scale language, vision, and visual-language domains. Our largest model, PaLM-E-562B with 562B parameters, in addition to being trained on robotics tasks, is a visual-language generalist with state-of-the-art performance on OK-VQA, and retains generalist language capabilities with increasing scale.

内容中包含的图片若涉及版权问题,请及时与我们联系删除