爱可可AI前沿推介(5.10)

LG - 机器学习 CV - 计算机视觉 CL - 计算与语言 AS - 音频与语音 RO - 机器人

转自爱可可爱生活

摘要：用伪语言预训练语音-文本编-解码器模型、面向无人驾驶模拟的现实多样化智能体学习、CLIP引导的拼贴和合成照片、任意网格的本征映射学习、基于编码率减小最大化原则的白盒深度网络、面向基于视觉机器人操纵的Kickstarting强化学习和离线强化学习的结合、遥感数据视觉问答的语言指导课程学习、少次学习和上下文学习中解释的不可靠性、综合可组合的多模态学习框架

1、[CL] Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo Languages

F Wu, K Kim, S Watanabe, K Han, R McDonald, K Q. Weinberger, Y Artzi

[ASAPP Inc. & CMU]

Wav2Seq：用伪语言预训练语音-文本编-解码器模型。本文提出Wav2Seq，第一个对语音数据的编-解码器模型的两部分进行预训练的自监督方法。Wav2Seq只需要原始音频数据，同时预训练编码器和解码器参数，从而使常见的编-解码器架构的两个主要部分都能从预训练中受益。将伪语言归纳为一种紧凑的离散表示，并制定了一个自监督的伪语音识别任务——将音频输入转录为伪子词序列。该过程可独立存在，也可作为低成本的第二阶段预训练来应用。Wav2Seq缩小了编-解码器模型和CTC模型在ASR的低资源条件下的性能差距。对自动语音识别（ASR）、口语命名实体识别和语音到文本翻译进行了实验。为端到端的口语命名实体识别创造了新的最先进结果，并在20种语言对的语音到文本翻译上显示出一致的改进，即使竞争方法使用额外的文本数据进行训练。在ASR方面，所提出方法使编-解码器方法从网络所有部分的预训练中受益，并显示出与高度优化的最新方法相当的性能。

We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data. We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task — transcribing audio inputs into pseudo subword sequences. This process stands on its own, or can be applied as low-cost second-stage pre-training. We experiment with automatic speech recognition (ASR), spoken named entity recognition, and speech-to-text translation. We set new state-of-the-art results for end-to-end spoken named entity recognition, and show consistent improvements on 20 language pairs for speech-to-text translation, even when competing methods use additional text data for training. On ASR, our approach enables encoderdecoder methods to benefit from pre-training for all parts of the network, and shows comparable performance to highly optimized recent methods.

https://arxiv.org/abs/2205.01086

2、[LG] Symphony: Learning Realistic and Diverse Agents for Autonomous Driving Simulation

M Igl, D Kim, A Kuefler, P Mougin, P Shah, K Shiarlis, D Anguelov, M Palatucci, B White, S Whiteson

[Waymo Research]

Symphony：面向无人驾驶模拟的现实多样化智能体学习。仿真是加速无人驾驶汽车发展的一个重要工具。要使模拟逼真，需要建立与这些汽车互动的人类道路使用者的模型。这种模型可以通过对已经在路上行驶的汽车所观察到的轨迹进行示范学习(LfD)来获得。然而，现有的LfD方法通常是不充分的，产生的策略往往会发生碰撞或驶离道路。为解决这个问题，本文提出Symphony，通过将传统的策略与平行的波束搜索结合起来，大大提高了现实性。波束搜索通过修剪被鉴别器评估为不利的分支，在运行中完善这些策略。然而，它也会损害多样性，即智能体如何覆盖整个现实行为的分布，因为修剪会鼓励模式坍缩。Symphony用分层的方法解决该问题，将智能体的行为纳入目标生成和目标调节中。这些目标的使用确保了智能体的多样性既不会在对抗性训练中消失，也不会被波束搜索所剪除。在专有和开放的Waymo数据集上的实验证实，Symphony智能体学习到的行为比几个基线更加真实和多样化。

Simulation is a crucial tool for accelerating the development of autonomous vehicles. Making simulation realistic requires models of the human road users who interact with such cars. Such models can be obtained by applying learning from demonstration (LfD) to trajectories observed by cars already on the road. However, existing LfD methods are typically insufficient, yielding policies that frequently collide or drive off the road. To address this problem, we propose Symphony, which greatly improves realism by combining conventional policies with a parallel beam search. The beam search refines these policies on the fly by pruning branches that are unfavourably evaluated by a discriminator. However, it can also harm diversity, i.e., how well the agents cover the entire distribution of realistic behaviour, as pruning can encourage mode collapse. Symphony addresses this issue with a hierarchical approach, factoring agent behaviour into goal generation and goal conditioning. The use of such goals ensures that agent diversity neither disappears during adversarial training nor is pruned away by the beam search. Experiments on both proprietary and open Waymo datasets confirm that Symphony agents learn more realistic and diverse behaviour than several baselines.

https://arxiv.org/abs/2205.03195

3、[CV] CLIP-CLOP: CLIP-Guided Collage and Photomontage

P Mirowski, D Banarse, M Malinowski, S Osindero, C Fernando

[DeepMind]

CLIP-CLOP：CLIP引导的拼贴和合成照片。大规模神经网络有增无减的神秘感，如CLIP图像-文本双向编码器，普及了自动生成的艺术。越来越复杂的生成器增强了艺术作品的真实性和视觉外观，而创造性提示工程使风格表达成为可能。在艺术家的理想指导下，本文设计了一种基于梯度的生成器来制作拼贴画，并将其与流行的图像-文本双向编码器(如CLIP)结合起来，它要求人类艺术家策划图像补块库，并(通过提示)描述整个图像构成，在生成过程中可以选择手动调整图块位置，从而使人类能够重新获得对过程的一些控制权，实现更大的创作自由。

The unabated mystique of large-scale neural networks, such as the CLIP dual image-and-text encoder, popularized automatically generated art. Increasingly more sophisticated generators enhanced the artworks’ realism and visual appearance, and creative prompt engineering enabled stylistic expression. Guided by an artistin-the-loop ideal, we design a gradient-based generator to produce collages. It requires the human artist to curate libraries of image patches and to describe (with prompts) the whole image composition, with the option to manually adjust the patches’ positions during generation, thereby allowing humans to reclaim some control of the process and achieve greater creative freedom. We explore the aesthetic potentials of high-resolution collages, and provide an open-source Google Colab as an artistic tool.

https://arxiv.org/abs/2205.03146

4、[CV] Neural Jacobian Fields: Learning Intrinsic Mappings of Arbitrary Meshes

N Aigerman, K Gupta, V G. Kim, S Chaudhuri, J Saito, T Groueix

[Adobe Research & University of California San Diego]

神经雅各布场：任意网格的本征映射学习。本文提出一种框架，旨在通过神经网络准确预测任意网格的每片线性映射，使训练和评估不共享三角形异质网格集合成为可能，并产生高度保留细节的图，其精度超过了目前的技术水平。该框架的基础是将神经方面缩减到对单个给定点矩阵的预测，以全局形状描述子为条件。然后，矩阵场被投射到给定网格的切线束上，并被用作预测图的候选jacobian。该图由一个标准的泊松解计算，作为一个可微的层来实现，并为有效的训练提供缓存的预因子。这种结构与输入的三角形无关，因此可以应用于具有不同三角形的数据集。同时，通过在每个单独网格的内在梯度域中操作，它允许框架预测高精确的映射。通过在广泛的场景中进行实验来验证这些特性，从语义上的，如变形、配准和变形迁移，到基于优化的，如模拟弹性变形和接触校正，这是第一个解决学习计算任意网格的UV参数化的工作。研究结果显示了该方法的高精确度以及它的多功能性，可以很容易地应用于上述情况而不需要对框架做任何改变。

This paper introduces a framework designed to accurately predict piecewise linear mappings of arbitrary meshes via a neural network, enabling training and evaluating over heterogeneous collections of meshes that do not share a triangulation, as well as producing highly detail-preserving maps whose accuracy exceeds current state of the art. The framework is based on reducing the neural aspect to a prediction of a matrix for a single given point, conditioned on a global shape descriptor. The field of matrices is then projected onto the tangent bundle of the given mesh, and used as candidate jacobians for the predicted map. The map is computed by a standard Poisson solve, implemented as a differentiable layer with cached pre-factorization for efficient training. This construction is agnostic to the triangulation of the input, thereby enabling applications on datasets with varying triangulations. At the same time, by operating in the intrinsic gradient domain of each individual mesh, it allows the framework to predict highly-accurate mappings. We validate these properties by conducting experiments over a broad range of scenarios, from semantic ones such as morphing, registration, and deformation transfer, to optimization-based ones, such as emulating elastic deformations and contact correction, as well as being the first work, to our knowledge, to tackle the task of learning to compute UV parameterizations of arbitrary meshes. The results exhibit the high accuracy of the method as well as its versatility, as it is readily applied to the above scenarios without any changes to the framework.

https://arxiv.org/abs/2205.02904

5、[LG] ReduNet: A White-box Deep Network from the Principle of Maximizing Rate Reduction

K Ho R Chan, Y Yu, C You, H Qi, J Wright, Y Ma

[UC Berkeley & Columbia University]

ReduNet：基于编码率减小最大化原则的白盒深度网络。本文试图提供一个合理的理论框架，旨在从数据压缩和判别性表示的原则来解释现代深度(卷积)网络。对于高维多类数据，最佳线性判别表示可使整个数据集和所有子集的平均值之间的编码率差异最大化。优化编码率降低目标的基本迭代梯度上升方案自然导致了一个多层深度网络，ReduNet，其具有现代深度网络的共同特征。深度分层架构、线性和非线性算子，甚至网络的参数都是通过前向传播逐层显式构建的，尽管它们可以通过反向传播进行微调。如此获得的"白盒"网络的所有组件都有精确的优化、统计和几何解释。此外，当强制分类遵守漂移不变时，如此衍生的网络的所有线性算子自然成为多通道卷积。在不变量环境下的推导表明了在稀疏性和不变性之间的权衡，同时也表明这样一个深度卷积网络在谱域中的构建和学习效率明显更高。初步模拟和实验清楚地验证了降低编码率目标和相关ReduNet的有效性。

This work attempts to provide a plausible theoretical framework that aims to interpret modern deep (convolutional) networks from the principles of data compression and discriminative representation. We argue that for high-dimensional multi-class data, the optimal linear discriminative representation maximizes the coding rate difference between the whole dataset and the average of all the subsets. We show that the basic iterative gradient ascent scheme for optimizing the rate reduction objective naturally leads to a multi-layer deep network, named ReduNet, which shares common characteristics of modern deep networks. The deep layered architectures, linear and nonlinear operators, and even parameters of the network are all explicitly constructed layer-by-layer via forward propagation, although they are amenable to fine-tuning via back propagation. All components of so-obtained “white-box” network have precise optimization, statistical, and geometric interpretation. Moreover, all linear operators of the so-derived network naturally become multi-channel convolutions when we enforce classification to be rigorously shift-invariant. The derivation in the invariant setting suggests a trade-off between sparsity and invariance, and also indicates that such a deep convolution network is significantly more efficient to construct and learn in the spectral domain. Our preliminary simulations and experiments clearly verify the effectiveness of both the rate reduction objective and the associated ReduNet. All code and data are available at https://github.com/Ma-Lab-Berkeley.

https://jmlr.org/papers/v23/21-0631.html