爱可可AI前沿推介(12.28)

LG - 机器学习 CV - 计算机视觉 CL - 计算与语言 AS - 音频与语音 RO - 机器人

转自爱可可爱生活

1、[LG] Offline Reinforcement Learning as One Big Sequence Modeling Problem

M Janner, Q Li, S Levine

[UC Berkeley]

将离线强化学习看作大序列建模问题。强化学习(RL)通常关注估计静态策略或单步模型，用马尔可夫特性在时间上分解问题。然而，也可以把强化学习看作是通用的序列建模问题，其目标是产生一连串行动，导致一连串的高回报。以这种方式来看，很自然想到在其他领域(如自然语言处理)运作良好的高容量序列预测模型是否也能为强化学习问题提供有效的解决方案。本文探讨了如何用序列建模工具来解决强化学习问题，用Transformer架构对轨迹分布建模，并将波束搜索作为一种规划算法重新利用。将强化学习定义为序列建模问题简化了一系列的设计决策，能省去离线抢啊互信息算法中常见的许多组件。本文展示了这种方法在长程动态预测、模仿学习、目标条件强化学习和离线强化学习中的灵活性。该方法还可以与现有的无模型算法相结合，在稀疏奖励、大范围任务中产生最先进的规划器。

Reinforcement learning (RL) is typically concerned with estimating stationary policies or single-step models, leveraging the Markov property to factorize problems in time. However, we can also view RL as a generic sequence modeling problem, with the goal being to produce a sequence of actions that leads to a sequence of high rewards. Viewed in this way, it is tempting to consider whether high-capacity sequence prediction models that work well in other domains, such as natural-language processing, can also provide effective solutions to the RL problem. To this end, we explore how RL can be tackled with the tools of sequence modeling, using a Transformer architecture to model distributions over trajectories and repurposing beam search as a planning algorithm. Framing RL as sequence modeling problem simplifies a range of design decisions, allowing us to dispense with many of the components common in offline RL algorithms. We demonstrate the flexibility of this approach across long-horizon dynamics prediction, imitation learning, goal-conditioned RL, and offline RL. Further, we show that this approach can be combined with existing model-free algorithms to yield a state-of-the-art planner in sparse-reward, long-horizon tasks.

2、[LG] Equivariant Subgraph Aggregation Networks

B Bevilacqua, F Frasca, D Lim, B Srinivasan, C Cai, G Balamurugan, M M. Bronstein, H Maron

[Purdue University & Imperial College London & MIT CSAIL & UCSD CSE & University of Tuebingen & NVIDIA Research]

等变子图聚合网络。消息传递神经网络(MPNN)是在图结构数据上进行深度学习的领先架构，在很大程度上得益于其简单性和可扩展性。不幸的是，事实证明，这些架构的表达能力是有限的。本文提出一种新框架，等变子图聚合网络(ESAN)来解决该问题。虽然两个图可能无法被MPNN区分，但它们往往包含可区分的子图。因此，本文建议用一些预定义策略将每个图表示为一组子图，用一个合适的等变架构进行处理。本文为图同构开发了一维Weisfeiler-Leman(1-WL)测试的新变体，并证明ESAN在这些新WL变体方面表达能力限制更小。所提出方法提高了MPNN，实现了更具表现力的架构。本文还提供了理论结果，描述了诸如子图选择策略和等变神经架构的设计选择如何影响架构的表达能力。为处理增加的计算成本，本文提出一种子图抽样方案，可以看作所提出框架的随机版本。在真实和合成数据集上进行的一系列综合实验表明，所提出框架提高了流行GNN架构的表达能力和整体性能。

Message-passing neural networks (MPNNs) are the leading architecture for deep learning on graph-structured data, in large part due to their simplicity and scalability. Unfortunately, it was shown that these architectures are limited in their expressive power. This paper proposes a novel framework called Equivariant Subgraph Aggregation Networks (ESAN) to address this issue. Our main observation is that while two graphs may not be distinguishable by an MPNN, they often contain distinguishable subgraphs. Thus, we propose to represent each graph as a set of subgraphs derived by some predefined policy, and to process it using a suitable equivariant architecture. We develop novel variants of the 1-dimensional Weisfeiler-Leman (1-WL) test for graph isomorphism, and prove lower bounds on the expressiveness of ESAN in terms of these new WL variants. We further prove that our approach increases the expressive power of both MPNNs and more expressive architectures. Moreover, we provide theoretical results that describe how design choices such as the subgraph selection policy and equivariant neural architecture affect our architecture’s expressive power. To deal with the increased computational cost, we propose a subgraph sampling scheme, which can be viewed as a stochastic version of our framework. A comprehensive set of experiments on real and synthetic datasets demonstrates that our framework improves the expressive power and overall performance of popular GNN architectures.

3、[RO] OstrichRL: A Musculoskeletal Ostrich Simulation to Study Bio-mechanical Locomotion

V L Barbera, F Pardo, Y Tassa, M Daley, C Richards, P Kormushev, J Hutchinson

[Royal Veterinary College & Imperial College London & DeepMind & University of California, Irvine]

OstrichRL: 用于研究生物机械运动的鸵鸟肌肉骨骼模拟系统。肌肉运动控制是一个横跨不同领域的研究课题，尤其是生物力学、机器人学和图形学。这种类型的控制特别具有挑战性，因为模型往往是过度灵活的，而且动力学是延迟和非线性的。然而，它是一个经过很好测试和调整的驱动模型，经历了数百万年的进化，涉及利用肌肉-肌腱单元的被动力量和有效的能量储存和释放的有趣特性。为了促进肌肉驱动模拟的研究，本文发布了基于MuJoCo模拟器的鸵鸟3D肌肉骨骼模拟。鸵鸟是地球上速度最快的双足动物之一，因此也是研究肌肉驱动双足运动的优秀模型。该模型以CT扫描和解剖为基础，用于收集实际的肌肉数据，如插入部位、长度和俯仰角度。与这个模型一起，还提供了一组强化学习任务，包括参考运动跟踪和伸出脖子触达的任务。参考运动数据是基于各种行为的运动捕捉片段，对其进行了预处理，并使其适合于模型。本文描述了该模型是如何建立的，并利用这些任务进行反复改进。通过与实验中收集的运动鸟类的肌电图数据进行比较，评估了肌肉驱动模式的准确性。本文相信，通过提供快速和易于使用的模拟，这项工作可以成为生物力学、强化学习、图形学和机器人学界之间的有用桥梁。

Muscle-actuated control is a research topic of interest spanning different fields, in particular biomechanics, robotics and graphics. This type of control is particularly challenging because models are often overactuated, and dynamics are delayed and non-linear. It is however a very well tested and tuned actuation model that has undergone millions of years of evolution and that involves interesting properties exploiting passive forces of muscle-tendon units and efficient energy storage and release. To facilitate research on muscle-actuated simulation, we release a 3D musculoskeletal simulation of an ostrich based on the MuJoCo simulator. Ostriches are one of the fastest bipeds on earth and are therefore an excellent model for studying muscle-actuated bipedal locomotion. The model is based on CT scans and dissections used to gather actual muscle data such as insertion sites, lengths and pennation angles. Along with this model, we also provide a set of reinforcement learning tasks, including reference motion tracking and a reaching task with the neck. The reference motion data are based on motion capture clips of various behaviors which we pre-processed and adapted to our model. This paper describes how the model was built and iteratively improved using the tasks. We evaluate the accuracy of the muscle actuation patterns by comparing them to experimentally collected electromyographic data from locomoting birds. We believe that this work can be a useful bridge between the biomechanics, reinforcement learning, graphics and robotics communities, by providing a fast and easy to use simulation.

4、[LG] Co-Adaptation of Algorithmic and Implementational Innovations in Inference-based Deep Reinforcement Learning

H Furuta, T Kozuno, T Matsushima, Y Matsuo, S S Gu

[The University of Tokyo & University of Alberta & Google Research]

基于推理深度强化学习中的算法协同自适应和实施性创新。最近，许多算法被设计为带有函数近似的强化学习(RL)。虽然它们在算法上有明显的区别，但也有许多和算法无关的实施性差异，有时也没有得到充分的重视。这种算法上的新颖性和实施上的创新混在一起，使得严格分析不同算法的性能改进来源变得困难。本文专注于一系列基于策略推理的行为批评算法——MPO、AWR和SAC——以解开其算法创新和实施性决策。通过单一的控制即推理目标提出了统一的推导，将每个算法归类为基于期望-最大化(EM)或直接的Kullback-Leibler(KL)散度最小化，并将其余的规格作为实施细节。进行了广泛的消融研究，并发现只要实施细节与算法选择不匹配，性能就会大幅下降。这些结果表明，哪些实施或代码细节是与算法协同自适应和共同进化的，哪些是可以跨算法迁移的：作为例子，发现tanh高斯策略和网络尺寸与算法类型高度适配，而层规范化和ELU对MPO的性能至关重要，但也可以迁移到SAC带来明显收益。希望该工作能够激励未来的工作，以进一步解开多种算法的性能改进来源，并允许研究人员在算法和实现性创新上相互借鉴。

Recently many algorithms were devised for reinforcement learning (RL) with function approximation. While they have clear algorithmic distinctions, they also have many implementation differences that are algorithm-independent and sometimes under-emphasized. Such mixing of algorithmic novelty and implementation craftsmanship makes rigorous analyses of the sources of performance improvements across algorithms difficult. In this work, we focus on a series of off-policy inference-based actor-critic algorithms – MPO, AWR, and SAC – to decouple their algorithmic innovations and implementation decisions. We present unified derivations through a single control-as-inference objective, where we can categorize each algorithm as based on either Expectation-Maximization (EM) or direct Kullback-Leibler (KL) divergence minimization and treat the rest of specifications as implementation details. We performed extensive ablation studies, and identified substantial performance drops whenever implementation details are mismatched for algorithmic choices. These results show which implementation or code details are co-adapted and co-evolved with algorithms, and which are transferable across algorithms: as examples, we identified that tanh Gaussian policy and network sizes are highly adapted to algorithmic types, while layer normalization and ELU are critical for MPO’s performances but also transfer to noticeable gains in SAC. We hope our work can inspire future work to further demystify sources of performance improvements across multiple algorithms and allow researchers to build on one another’s both algorithmic and implementational innovations.

5、[LG] On the Expressivity of Markov Reward

D Abel, W Dabney, A Harutyunyan, M K. Ho, M L. Littman, D Precup, S Singh

[DeepMind & Princeton University & Brown University]

马尔科夫回报的表达性研究。回报是强化学习智能体的驱动力。本文致力于理解回报的表达性，作为捕捉希望智能体执行的任务的一种方式。本文围绕三个新的抽象"任务"概念进行研究，这些任务可能是理想化的。(1）一组可接受的行为，（2）行为的部分排序，或（3）轨迹的部分排序。主要结果证明，虽然回报可以表达许多这样的任务，但每一种任务类型都存在马尔科夫回报函数无法捕捉的实例。本文提供了一套多项式时间算法，该算法构建了一个马尔科夫回报函数，允许智能体优化这三种类型的任务，并正确判断何时不存在这种回报函数。最后，通过一项实证研究来证实和说明所提的理论发现。

Reward is the driving force for reinforcement-learning agents. This paper is dedicated to understanding the expressivity of reward as a way to capture tasks that we would want an agent to perform. We frame this study around three new abstract notions of “task” that might be desirable: (1) a set of acceptable behaviors, (2) a partial ordering over behaviors, or (3) a partial ordering over trajectories. Our main results prove that while reward can express many of these tasks, there exist instances of each task type that no Markov reward function can capture. We then provide a set of polynomial-time algorithms that construct a Markov reward function that allows an agent to optimize tasks of each of these three types, and correctly determine when no such reward function exists. We conclude with an empirical study that corroborates and illustrates our theoretical findings.