爱可可AI前沿推介 (10.27)

LG - 机器学习 CV - 计算机视觉 CL - 计算与语言 AS - 音频与语音 RO - 机器人

转自爱可可爱生活

摘要：基于算法蒸馏的上下文强化学习、高保真神经音频压缩、基于对象级表示的可解释规划Transformer、敏捷手上操作的模拟到现实迁移、深度学习中不完全距离相关的多种用途研究、基于少数免奖励部署的通用世界模型学习、用语法归纳查找数据集捷径、预训练词嵌入空间上基于子空间的集合运算、基于神经辐射场的实时致密单目SLAM

1、[LG] In-context Reinforcement Learning with Algorithm Distillation

M Laskin, L Wang, J Oh, E Parisotto, S Spencer, R Steigerwald, D Strouse, S Hansen, A Filos, E Brooks, M Gazeau, H Sahni, S Singh, V Mnih
[DeepMind]
基于算法蒸馏的上下文强化学习。本文提出算法蒸馏(AD)，一种将强化学习(RL)算法蒸馏成神经网络的方法，通过用因果序列模型对其训练历史进行建模。算法蒸馏将强化学习视为一个跨训练轮次的顺序预测问题。由源强化学习算法生成一个学习历史(过程)的数据集，通过自回归预测行动来训练因果transformer，并将其之前的学习历史作为上下文。与蒸馏后学习或专家序列的顺序策略预测架构不同，算法蒸馏能在不更新其网络参数的情况下完全在上下文中改进其策略。本文证明了算法蒸馏能在各种具有稀疏奖励、组合任务结构和基于像素的观察的环境中进行上下文强化学习，并发现算法蒸馏学习的强化学习算法比产生源数据的算法具有更高的数据效率。

We propose Algorithm Distillation (AD), a method for distilling reinforcement learning (RL) algorithms into neural networks by modeling their training histories with a causal sequence model. Algorithm Distillation treats learning to reinforcement learn as an across-episode sequential prediction problem. A dataset of learning histories is generated by a source RL algorithm, and then a causal transformer is trained by autoregressively predicting actions given their preceding learning histories as context. Unlike sequential policy prediction architectures that distill post-learning or expert sequences, AD is able to improve its policy entirely in-context without updating its network parameters. We demonstrate that AD can reinforcement learn in-context in a variety of environments with sparse rewards, combinatorial task structure, and pixel-based observations, and find that AD learns a more data-efficient RL algorithm than the one that generated the source data.

https://arxiv.org/abs/2210.14215

2、[AS] High Fidelity Neural Audio Compression

A Défossez, J Copet, G Synnaeve, Y Adi
[Meta AI]
高保真神经音频压缩。本文介绍一种最先进的实时、高保真、基于神经网络的音频编解码器。它包括一个流式编-解码器架构，其量化的潜空间以端到端方式进行训练。通过用单一的多尺度频谱图对抗来简化和加速训练，有效地减少了伪影并产生高质量样本。本文提出一种新的损失平衡机制来稳定训练：损失的权重现在定义了它应该代表的整体梯度的一部分，从而将这个超参数的选择与损失的典型规模脱钩。本文研究了如何用轻量级Transformer模型来进一步压缩所获得的表示，最多可压缩40%，同时保持比实时更快。本文详细描述了所提出模型的关键设计选择，包括：训练目标、架构变化和对各种感知损失函数的研究。本文提出了一个广泛的主观评价(MUSHRA测试)，以及对一系列带宽和音频领域的消融研究，包括语音、噪声反响的语音和音乐。所提出方法对24kHz的单声道和48kHz的立体声音频在所有评估环境中都优于基线方法。

We introduce a state-of-the-art real-time, high-fidelity, audio codec leveraging neural networks. It consists in a streaming encoder-decoder architecture with quantized latent space trained in an end-to-end fashion. We simplify and speed-up the training by using a single multiscale spectrogram adversary that efficiently reduces artifacts and produce high-quality samples. We introduce a novel loss balancer mechanism to stabilize training: the weight of a loss now defines the fraction of the overall gradient it should represent, thus decoupling the choice of this hyper-parameter from the typical scale of the loss. Finally, we study how lightweight Transformer models can be used to further compress the obtained representation by up to 40%, while staying faster than real time. We provide a detailed description of the key design choices of the proposed model including: training objective, architectural changes and a study of various perceptual loss functions. We present an extensive subjective evaluation (MUSHRA tests) together with an ablation study for a range of bandwidths and audio domains, including speech, noisy-reverberant speech, and music. Our approach is superior to the baselines methods across all evaluated settings, considering both 24 kHz monophonic and 48 kHz stereophonic audio. Code and models are available at this http URL.

https://arxiv.org/abs/2210.134

3、[RO] PlanT: Explainable Planning Transformers via Object-Level Representations

K Renz, K Chitta, O Mercea, A. S Koepke, Z Akata, A Geiger
[University of Tübingen]
PlanT：基于对象级表示的可解释规划Transformer。在复杂环境中规划一条最佳路线需要对周围场景进行有效推理。人类司机会优先考虑重要的物体而忽略与决策无关的细节，基于学习的规划器往往从包含所有车辆和道路环境信息的稠密的高维网格表示中提取特征。本文提出PlanT，一种在自动驾驶背景下进行规划的新方法，使用标准的Transformer结构。PlanT是基于模仿学习和紧凑的对象级输入表示。在CARLA的Longest6基准上，PlanT优于所有之前的方法(与专家的驾驶得分相匹配)，同时在推理过程中比基于像素的同等规划基准快5.3倍。将PlanT与现成的感知模块相结合，提供了一个基于传感器的驾驶系统，在驾驶得分方面比现有的最先进水平要好10分以上。此外，本文还提出一个评估协议，以量化规划器识别相关目标的能力，提供有关其决策的见解。实验结果表明，PlanT可以专注于场景中最相关的物体，即使这个物体在几何上是遥远的。

Planning an optimal route in a complex environment requires efficient reasoning about the surrounding scene. While human drivers prioritize important objects and ignore details not relevant to the decision, learning-based planners typically extract features from dense, high-dimensional grid representations containing all vehicle and road context information. In this paper, we propose PlanT, a novel approach for planning in the context of self-driving that uses a standard transformer architecture. PlanT is based on imitation learning with a compact object-level input representation. On the Longest6 benchmark for CARLA, PlanT outperforms all prior methods (matching the driving score of the expert) while being 5.3x faster than equivalent pixel-based planning baselines during inference. Combining PlanT with an off-the-shelf perception module provides a sensor-based driving system that is more than 10 points better in terms of driving score than the existing state of the art. Furthermore, we propose an evaluation protocol to quantify the ability of planners to identify relevant objects, providing insights regarding their decision-making. Our results indicate that PlanT can focus on the most relevant object in the scene, even when this object is geometrically distant.

https://arxiv.org/abs/2210.14222

4、[RO] DeXtreme: Transfer of Agile In-hand Manipulation from Simulation to Reality

A Handa, A Allshire, V Makoviychuk, A Petrenko, R Singh...
[NVIDIA & University of Toronto & University of Southern California]
DeXtreme：敏捷手上操作的模拟到现实迁移。最近的工作表明，深度强化学习(RL)算法有能力在模拟中学习复杂的机器人行为，包括在多指操作领域。然而，由于模拟和现实之间的差距，这样的模型要迁移到现实世界中是有难度的。本文提出一种新方案，以训练：a）能在拟人机器手上进行鲁棒的灵巧操纵的策略；b）适合提供被操纵物体状态的可靠实时信息的鲁棒姿态估计器。该策略经过训练，能适应模拟中的各种条件。基于视觉的策略在相同的调整方向任务上明显优于文献中的最佳视觉策略，并且与通过运动捕捉系统获得特殊状态信息的策略相比具有竞争力。本文工作再次证实了在不同类型的硬件和模拟器设置中，模拟到现实的灵巧操纵的可能性，所述案例中，用Allegro Hand和Isaac Gym的GPU模拟。此外，还为研究人员提供了用普通的、可负担得起的机器手和摄像机实现这种结果的可能性。

Recent work has demonstrated the ability of deep reinforcement learning (RL) algorithms to learn complex robotic behaviours in simulation, including in the domain of multi-fingered manipulation. However, such models can be challenging to transfer to the real world due to the gap between simulation and reality. In this paper, we present our techniques to train a) a policy that can perform robust dexterous manipulation on an anthropomorphic robot hand and b) a robust pose estimator suitable for providing reliable real-time information on the state of the object being manipulated. Our policies are trained to adapt to a wide range of conditions in simulation. Consequently, our vision-based policies significantly outperform the best vision policies in the literature on the same reorientation task and are competitive with policies that are given privileged state information via motion capture systems. Our work reaffirms the possibilities of sim-to-real transfer for dexterous manipulation in diverse kinds of hardware and simulator setups, and in our case, with the Allegro Hand and Isaac Gym GPU-based simulation. Furthermore, it opens up possibilities for researchers to achieve such results with commonly-available, affordable robot hands and cameras. Videos of the resulting policy and supplementary information, including experiments and demos, can be found at this https URL

https://arxiv.org/abs/2210.13702

5、[CV] On the Versatile Uses of Partial Distance Correlation in Deep Learning

X Zhen, Z Meng, R Chakraborty, V Singh
[University of Wisconsin-Madison & Butlr]
深度学习中不完全距离相关的多种用途研究。比较神经网络模型的功能性行为，无论是单一网络还是两个(或更多的网络)在训练期间或训练后，都是了解它们在学习什么(以及它们没有学习什么)，以及确定正则化或效率改进策略的重要步骤。尽管最近取得了一些进展，例如将视觉transformer与CNN进行比较，但系统的功能比较，特别是不同网络之间的功能比较，仍然很困难，而且往往是逐层进行的。诸如典型相关分析(CCA)的方法在原则上是适用的，但到目前为止还很少使用。本文重新审视了一种(不太广为人知的)来自统计学的方法，称为距离相关(及其不完全变体)，旨在评估不同维度的特征空间之间的相关性。本文描述了对大规模模型进行部署的必要步骤——为一系列令人惊讶的应用打开了大门，包括微调一个深度模型与另一个模型的关系、学习分离的表示以及优化不同的模型，这些模型将直接对对抗性攻击更加强大。实验表明，一个多功能的正则化器(或约束)具有许多优点，避免了人们在此类分析中面临的一些常见困难。

Comparing the functional behavior of neural network models, whether it is a single network over time or two (or more networks) during or post-training, is an essential step in understanding what they are learning (and what they are not), and for identifying strategies for regularization or efficiency improvements. Despite recent progress, e.g., comparing vision transformers to CNNs, systematic comparison of function, especially across different networks, remains difficult and is often carried out layer by layer. Approaches such as canonical correlation analysis (CCA) are applicable in principle, but have been sparingly used so far. In this paper, we revisit a (less widely known) from statistics, called distance correlation (and its partial variant), designed to evaluate correlation between feature spaces of different dimensions. We describe the steps necessary to carry out its deployment for large scale models -- this opens the door to a surprising array of applications ranging from conditioning one deep model w.r.t. another, learning disentangled representations as well as optimizing diverse models that would directly be more robust to adversarial attacks. Our experiments suggest a versatile regularizer (or constraint) with many advantages, which avoids some of the common difficulties one faces in such analyses. Code is at this https URL.

https://arxiv.org/abs/2207.09684

另外几篇值得关注的论文：

[LG] Learning General World Models in a Handful of Reward-Free Deployments

基于少数免奖励部署的通用世界模型学习
Y Xu, J Parker-Holder, A Pacchiano, P J. Ball, O Rybkin...
[UCL & University of Oxford & Microsoft Research & UPenn]
https://arxiv.org/abs/2210.12719