来自今天的爱可可AI前沿推介
[LG] PIRLNav: Pretraining with Imitation and RL Finetuning for ObjectNav
R Ramrakhya, D Batra, E Wijmans, A Das
[Georgia Institute of Technology & Meta AI]
PIRLNav: 面向ObjectNav的模仿学习+强化学习微调预训练
要点:
-
提出一种用于 ObjectGoal 导航的两阶段学习方案,将人工演示的模仿学习(IL)预训练与强化学习(RL)微调相结合; -
PIRLNav 在 ObjectNav 上取得了最先进的结果,成功率为65%,比之前的最佳水平提高了5%; -
对用于模仿学习预训练的不同演示数据集对下游强化学习微调性能的影响进行了彻底的实证研究,表明模仿学习/模仿学习+强化学习在人工演示上表现优于模仿学习/模仿学习+强化学习在最短路径和前沿探索轨迹上的表现。
一句话总结:
提出 PIRLNav 方法,将 ObjectNav 的模仿学习和强化学习结合起来,获得65%的成功率、比之前最好的结果提高5%,对用于模仿学习预训练的不同演示数据集对下游强化学习微调性能的影响进行了实证研究,并深入了解最佳 ObjectNav 智能体的故障模式。
摘要:
本文研究 ObjectGoal 导航——要求位于新环境中的虚拟机器人导航到目标。之前的工作表明,人工演示数据集上的模仿学习(IL)取得了有希望的结果,但存在以下局限性:1) 模仿学习策略对新状态的泛化性很差,因为训练模仿了行动而不是其后果,2)收集演示活动很昂贵。另一方面,强化学习(RL)不可扩展,需要仔细的奖励工程才能实现理想的行为。本文提出了一种用人工演示模仿学习预训练加强化学习微调的两阶段学习方案。即 PIRLNav 策略,将 ObjectNav 上最先进的成功率从 60.0% 提高到 65.0% (绝对值+5.0%)。使用此模仿学习→强化学习训练方案,本文对设计选择进行了严格的实证分析。首先,调查人工演示是否可以被“免费”(自动生成)演示来源所取代,例如最短路径(SP)或任务无关的前沿探索(FE)轨迹。本文发现,在人工演示上的模仿学习→强化学习的表现优于在SP和FE轨迹上的模仿学习→强化学习,即使在 TRAIN 上控制了相同的模仿学习预训练成功,甚至在模仿学习预训练成功有利于SP或FE策略的 VAL 轮子集上也是如此。本文研究了强化学习微调性能如何随着模仿学习预训练数据集的大小进行扩展。随着增加模仿学习预训练数据集的大小并获得高模仿学习准确性,强化学习微调的改进较小,最好的模仿学习→强化学习策略的90%的性能可以在模仿学习演示数量不到一半的情况下实现。本文分析了 ObjectNav 策略的故障模式,并提出了进一步改进它们的指南。
We study ObjectGoal Navigation - where a virtual robot situated in a new environment is asked to navigate to an object. Prior work has shown that imitation learning (IL) on a dataset of human demonstrations achieves promising results. However, this has limitations − 1) IL policies generalize poorly to new states, since the training mimics actions not their consequences, and 2) collecting demonstrations is expensive. On the other hand, reinforcement learning (RL) is trivially scalable, but requires careful reward engineering to achieve desirable behavior. We present a two-stage learning scheme for IL pretraining on human demonstrations followed by RL-finetuning. This leads to a PIRLNav policy that advances the state-of-the-art on ObjectNav from 60.0% success rate to 65.0% (+5.0% absolute). Using this IL→RL training recipe, we present a rigorous empirical analysis of design choices. First, we investigate whether human demonstrations can be replaced with 'free' (automatically generated) sources of demonstrations, e.g. shortest paths (SP) or task-agnostic frontier exploration (FE) trajectories. We find that IL→RL on human demonstrations outperforms IL→RL on SP and FE trajectories, even when controlled for the same IL-pretraining success on TRAIN, and even on a subset of VAL episodes where IL-pretraining success favors the SP or FE policies. Next, we study how RL-finetuning performance scales with the size of the IL pretraining dataset. We find that as we increase the size of the IL-pretraining dataset and get to high IL accuracies, the improvements from RL-finetuning are smaller, and that 90% of the performance of our best IL→RL policy can be achieved with less than half the number of IL demonstrations. Finally, we analyze failure modes of our ObjectNav policies, and present guidelines for further improving them.
论文链接:https://arxiv.org/abs/2301.07302



内容中包含的图片若涉及版权问题,请及时与我们联系删除


评论
沙发等你来抢