爱可可AI前沿推介(12.16)

LG - 机器学习 CV - 计算机视觉 CL - 计算与语言 AS - 音频与语音 RO - 机器人

转自爱可可爱生活

1、[LG] CoMPS: Continual Meta Policy Search

G Berseth, Z Zhang, G Zhang, C Finn, S Levine

[Berkeley AI Research & Stanford]

CoMPS：连续元策略搜索。本文提出一种新的持续元学习方法，以解决连续多任务学习中的挑战。这种情况下，智能体的目标是在任意任务序列上快速实现高回报。之前的元强化学习算法在加速获得新任务方面表现出了很好的效果。然而，他们需要在训练期间获得所有任务。除了简单地将过去的经验迁移到新任务上，本文目标是设计出持续强化学习算法，让它学会学习，利用它们在以前任务上的经验来更快地学习新任务。所提出的新方法——持续元策略搜索(CoMPS)，通过对序列中的每个任务进行增量元训练来消除这种限制，而不需要重新审视之前的任务。CoMPS不断重复两个子程序：使用强化学习学习一个新任务，并使用强化学习的经验来执行完全离线的元学习，为后续的任务学习做准备。CoMPS在几个具有挑战性的连续控制任务序列上的表现优于之前的持续学习和off-policy元强化方法。

We develop a new continual meta-learning method to address challenges in sequential multi-task learning. In this setting, the agent’s goal is to achieve high reward over any sequence of tasks quickly. Prior meta-reinforcement learning algorithms have demonstrated promising results in accelerating the acquisition of new tasks. However, they require access to all tasks during training. Beyond simply transferring past experience to new tasks, our goal is to devise continual reinforcement learning algorithms that learn to learn, using their experience on previous tasks to learn new tasks more quickly. We introduce a new method, continual meta-policy search (CoMPS), that removes this limitation by meta-training in an incremental fashion, over each task in a sequence, without revisiting prior tasks. CoMPS continuously repeats two subroutines: learning a new task using RL and using the experience from RL to perform completely offline meta-learning to prepare for subsequent task learning. We find that CoMPS outperforms prior continual learning and off-policy meta-reinforcement methods on several sequences of challenging continuous control tasks.

2、[CV] I M Avatar: Implicit Morphable Head Avatars from Videos

Y Zheng, V F Abrevaya, X Chen, M C. Bühler, M J. Black, O Hilliges

[ETH Zurich & Max Planck Institute for Intelligent Systems]

基于视频的隐性可变形头像。传统的可变形人脸模型提供了对表情的精细控制，但不能轻易捕捉几何和外观细节。神经体表征接近照片的真实性，但很难制作成动画，也不能很好地泛化未见过的表情。为解决这个问题，本文提出IMavatar（Implicit Morphable avatar），一种从单目视频中学习隐性头像的新方法。受传统3DMM提供的细粒度控制机制的启发，通过形状和皮肤场学习表示表情和姿态相关的变形。这些属性与姿态无关，并且可以用来在给定新的表情和姿态参数的情况下对典型的几何和纹理场进行变形。采用光线追踪和迭代寻根法来定位每个像素的规范表面交点。一个关键贡献是新的分析性梯度公式，能从视频中进行端到端的IMavatars训练。从数量和质量上表明，与最先进的方法相比，所提出方法改善了几何学，覆盖了更完整的表达空间。

Traditional morphable face models provide fine-grained control over expression but cannot easily capture geometric and appearance details. Neural volumetric representations approach photo-realism but are hard to animate and do not generalize well to unseen expressions. To tackle this problem, we propose IMavatar (Implicit Morphable avatar), a novel method for learning implicit head avatars from monocular videos. Inspired by the fine-grained control mechanisms afforded by conventional 3DMMs, we represent the expressionand pose-related deformations via learned blendshapes and skinning fields. These attributes are pose-independent and can be used to morph the canonical geometry and texture fields given novel expression and pose parameters. We employ ray tracing and iterative rootfinding to locate the canonical surface intersection for each pixel. A key contribution is our novel analytical gradient formulation that enables end-to-end training of IMavatars from videos. We show quantitatively and qualitatively that our method improves geometry and covers a more complete expression space compared to state-of-the-art methods.

3、[CL] Step-unrolled Denoising Autoencoders for Text Generation

N Savinov, J Chung, M Binkowski, E Elsen, A v d Oord

[DeepMind]

面向文本生成的Step-unrolled去噪自编码器。本文提出一种新的文本生成模型——Step-unrolled去噪自编码器(SUNDAE)，不依赖于自回归模型。与去噪扩散技术类似，SUNDAE被反复应用于一连串的Token，从随机输入开始，每次都进行改进，直到收敛。提出了一个简单的新的改进算子，比扩散方法以更少的迭代次数收敛，同时在质量上对自然语言数据集产生更好的样本。SUNDAE在WMT'14英德翻译任务上取得了最先进的结果(在非自回归方法中)，在Colossal Cleaned Common Crawl数据集和GitHub的Python代码数据集上的无条件语言建模上取得了良好的定性结果。SUNDAE的非自回归性质为从左到右的提示生成提供了可能性，通过在模板中填补任意的空白模式。

In this paper we propose a new generative model of text, Step-unrolled Denoising Autoencoder (SUNDAE), that does not rely on autoregressive models. Similarly to denoising diffusion techniques, SUNDAE is repeatedly applied on a sequence of tokens, starting from random inputs and improving them each time until convergence. We present a simple new improvement operator that converges in fewer iterations than diffusion methods, while qualitatively producing better samples on natural language datasets. SUNDAE achieves state-of-the-art results (among non-autoregressive methods) on the WMT’14 English-to-German translation task and good qualitative results on unconditional language modeling on the Colossal Cleaned Common Crawl dataset and a dataset of Python code from GitHub. The non-autoregressive nature of SUNDAE opens up possibilities beyond left-to-right prompted generation, by filling in arbitrary blank patterns in a template.

4、[CL] Sparse Interventions in Language Models with Differentiable Masking

N D Cao, L Schmid, D Hupkes, I Titov

[University of Amsterdam & University of Osnabrück & Facebook AI Research]

基于可微掩码的语言模型稀疏干预。人们对了解语言模型(LM)的隐藏表示捕获了哪些信息一直很感兴趣。通常情况下，解释方法 i）不能保证模型实际使用了编码信息，ii）不能发现对所考虑的现象负责的小的神经元子集。受因果调解分析的启发，本文提出了一种方法，在神经语言模型中发现对某一特定语言现象负责的小的神经元子集，即导致相应标记发射概率变化的子集。用一个可微调的松弛来近似搜索组合空间。一个L0正则化项确保搜索收敛到离散和稀疏方案。用该方法分析LSTM中的主语-动词数量一致性和性别偏差检测，速度很快，而且比其他方法(REINFORCE)能找到更好的解决方案。实验证实，这些现象中的每一个都是通过一小部分神经元介导的，而这些神经元并没有发挥任何其他可识别的作用。

There has been a lot of interest in understanding what information is captured by hidden representations of language models (LMs). Typically, interpretation methods i) do not guarantee that the model actually uses the encoded information, and ii) do not discover small subsets of neurons responsible for a considered phenomenon. Inspired by causal mediation analysis, we propose a method that discovers within a neural LM a small subset of neurons responsible for a particular linguistic phenomenon, i.e., subsets causing a change in the corresponding token emission probabilities. We use a differentiable relaxation to approximately search through the combinatorial space. An L0 regularization term ensures that the search converges to discrete and sparse solutions. We apply our method to analyze subject-verb number agreement and gender bias detection in LSTMs. We observe that it is fast and finds better solutions than the alternative (REINFORCE). Our experiments confirm that each of these phenomenons is mediated through a small subset of neurons that do not play any other discernible role.

5、[CV] Towards a Unified Foundation Model: Jointly Pre-Training Transformers on Unpaired Images and Text

Q Li, B Gong, Y Cui, D Kondratyuk, X Du, M Yang, M Brown

[Google Research]

统一基础模型研究：未配对图像和文本上的Transformer联合预训练。本文探讨了建立一个统一基础模型的可能性，该模型可以自适应仅有视觉和仅有文本的任务。从BERT和ViT开始，设计了一个统一的Transformer，由特定模式的tokenizer、共享Transformer编码器和特定任务输出头组成。为了有效地在未配对的图像和文本上联合预训练所提出的模型，提出了两种新的技术：(i) 用单独训练的BERT和ViT模型作为教师，并应用知识蒸馏为联合训练提供额外的、准确的监督信号；(ii) 提出一种新的梯度掩码策略，以平衡来自图像和文本预训练损失的参数更新。通过在图像分类任务和自然语言理解任务上分别对联合预训练的Transformer进行微调来评估。实验表明，所产生的统一基础Transformer在纯视觉任务和纯文本任务上都有惊人的表现，所提出的知识蒸馏和梯度掩码策略可以有效地提高性能，接近单独训练模型的水平。

In this paper, we explore the possibility of building a unified foundation model that can be adapted to both visiononly and text-only tasks. Starting from BERT and ViT, we design a unified transformer consisting of modality-specific tokenizers, a shared transformer encoder, and task-specific output heads. To efficiently pre-train the proposed model jointly on unpaired images and text, we propose two novel techniques: (i) We employ the separately-trained BERT and ViT models as teachers and apply knowledge distillation to provide additional, accurate supervision signals for the joint training; (ii) We propose a novel gradient masking strategy to balance the parameter updates from the image and text pre-training losses. We evaluate the jointly pretrained transformer by fine-tuning it on image classification tasks and natural language understanding tasks, respectively. The experiments show that the resultant unified foundation transformer works surprisingly well on both the vision-only and text-only tasks, and the proposed knowledge distillation and gradient masking strategy can effectively lift the performance to approach the level of separately-trained models.