爱可可AI前沿推介(3.3)

LG - 机器学习 CV - 计算机视觉 CL - 计算与语言 AS - 音频与语音 RO - 机器人

转自爱可可爱生活

1、[CL] DeepNet: Scaling Transformers to 1,000 Layers

H Wang, S Ma, L Dong, S Huang, D Zhang, F Wei

[Microsoft Research]

DeepNet：将Transformer扩展到1000层。本文提出一种简单有效的方法来稳定极深的Transformer。提出一种新的归一化函数(DEEPNORM)来修改Transformer的残差连接，同时伴随着理论推导的初始化。深入的理论分析表明，模型的更新可以以一种稳定的方式被约束。所提出的方法结合了两个世界的优点，即Post-LN的良好性能和Pre-LN的稳定训练，使DEEPNORM成为首选。成功地将Transformer扩展到1000层(即2500个注意力和前馈网络子层)，比之前的深度Transformer深一个数量级。值得注意的是，在一个具有7482个翻译方向的多语言基准上，具有3.2B参数的200层模型明显优于具有12B参数的48层最先进模型5个BLEU点，这表明了一个有希望的扩展方向。

In this paper, we propose a simple yet effective method to stabilize extremely deep Transformers. Specifically, we introduce a new normalization function (DEEPNORM) to modify the residual connection in Transformer, accompanying with theoretically derived initialization. In-depth theoretical analysis shows that model updates can be bounded in a stable way. The proposed method combines the best of two worlds, i.e., good performance of Post-LN and stable training of Pre-LN, making DEEPNORM a preferred alternative. We successfully scale Transformers up to 1,000 layers (i.e., 2,500 attention and feed-forward network sublayers) without difficulty, which is one order of magnitude deeper than previous deep Transformers. Remarkably, on a multilingual benchmark with 7,482 translation directions, our 200-layer model with 3.2B parameters significantly outperforms the 48-layer state-of-the-art model with 12B parameters by 5 BLEU points, which indicates a promising scaling direction.

2、[CV] Generative Adversarial Networks

G Cohen, R Giryes

[Tel Aviv University]

生成式对抗网络的原理机制和主要挑战。生成式对抗网络(GAN)是生成高质量数据的流行框架，在学术界和工业界的许多领域都有大量应用。可以说，它们最实质性的影响是在计算机视觉领域，在那里它们实现了最先进的图像生成。本文概述了GAN，讨论了它们的原理机制，并介绍了它们在训练和评估中的一些固有问题。重点讨论三个问题：(1）模式坍缩，（2）梯度消失，以及（3）生成低质量图像。列出了一些弥补上述挑战的架构变化和损失变化的GAN。介绍了GAN在现实世界应用中的两个实例：数据增强和人脸图像生成。

Generative Adversarial Networks (GANs) are very popular frameworks for generating high-quality data, and are immensely used in both the academia and industry in many domains. Arguably, their most substantial impact has been in the area of computer vision, where they achieve state-of-the-art image generation. This chapter gives an introduction to GANs, by discussing their principle mechanism and presenting some of their inherent problems during training and evaluation. We focus on these three issues: (1) mode collapse, (2) vanishing gradients, and (3) generation of low-quality images. We then list some architecture-variant and loss-variant GANs that remedy the above challenges. Lastly, we present two utilization examples of GANs for real-world applications: Data augmentation and face images generation.

3、[LG] Graph Attention Retrospective

K Fountoulakis, A Levi, S Yang, A Baranwal, A Jagannath

[University of Waterloo]

图注意力回顾。基于图的学习是机器学习的一个快速增长的子领域，在社会网络、引文网络和生物信息学都有应用。最流行的模型类型之一是图注意力网络。这些模型的引入是为了让一个节点以非统一的方式从相邻节点的特征中聚合信息，不同于简单图卷积不区分节点邻居的做法。本文从理论上研究了图注意力网络的这种预期行为。证明了在上下文随机块模型的节点分类问题上，图注意力机制性能的多个结果。在这里，节点特征来自高斯混合，边来自随机块模型，特征和边以一种自然方式耦合在一起。在一个"容易"的设置下，高斯平均之间距离足够大，图注意力保持了类内边权重，并大大降低了类间边权重。这意味着完美的节点分类与类间边权重无关。然而，一个经典论证表明，在"简单"设置下，完全不需要图来对数据进行高概率的分类。在"难"的设置下，每一个注意力机制都不能区分类内和类间的边。在合成和真实世界的数据上评估了该理论结果。

Graph-based learning is a rapidly growing sub-field of machine learning with applications in social networks, citation networks, and bioinformatics. One of the most popular type of models is graph attention networks. These models were introduced to allow a node to aggregate information from the features of neighbor nodes in a non-uniform way in contrast to simple graph convolution which does not distinguish the neighbors of a node. In this paper, we study theoretically this expected behaviour of graph attention networks. We prove multiple results on the performance of the graph attention mechanism for the problem of node classification for a contextual stochastic block model. Here the features of the nodes are obtained from a mixture of Gaussians and the edges from a stochastic block model where the features and the edges are coupled in a natural way. First, we show that in an “easy” regime, where the distance between the means of the Gaussians is large enough, graph attention maintains the weights of intra-class edges and significantly reduces the weights of the inter-class edges. As a corollary, we show that this implies perfect node classification independent of the weights of inter-class edges. However, a classical argument shows that in the “easy” regime, the graph is not needed at all to classify the data with high probability. In the “hard” regime, we show that every attention mechanism fails to distinguish intra-class from inter-class edges. We evaluate our theoretical results on synthetic and real-world data.

4、[LG] MLDemon: Deployment Monitoring for Machine Learning Systems

A Ginart, M Zhang, J Zou

[Stanford University & Harvard University]

MLDemon: 机器学习系统的部署后监控。机器学习系统的部署后监控对于确保可靠性至关重要，尤其是新的用户输入可能与训练分布不同。本文提出了一种新方法MLDemon，用于机器学习部署监控。MLDemon整合了无标签数据和少量的按需标签，以产生对机器学习模型在给定数据流上当前性能的实时估计。根据预算限制，MLDemon决定何时获得额外的、可能昂贵的、专家监督的标签来验证该模型。在具有不同分布漂移和模型的时间数据集上，MLDemon的性能优于现有方法。理论分析表明MLDemon对于一大类分布漂移是最小最大速率最优的。

Post-deployment monitoring of ML systems is critical for ensuring reliability, especially as new user inputs can differ from the training distribution. Here we propose a novel approach, MLDemon, for ML Deployment monitoring. MLDemon integrates both unlabeled data and a small amount of on-demand labels to produce a real-time estimate of the ML model’s current performance on a given data stream. Subject to budget constraints, MLDemon decides when to acquire additional, potentially costly, expert supervised labels to verify the model. On temporal datasets with diverse distribution drifts and models, MLDemon outperforms existing approaches. Moreover, we provide theoretical analysis to show that MLDemon is minimax rate optimal for a broad class of distribution drifts.

5、[LG] Combining Modular Skills in Multitask Learning

E M. Ponti, A Sordoni, Y Bengio, S Reddy

[Mila & Microsoft Research Montréal]

多任务学习模块化技能组合。模块化设计鼓励神经模型解缠和重新组合不同方面的知识，以更系统地泛化新任务。本文假设每个任务都与一个(可能很小的)潜离散技能子集有关。反过来，技能对应于参数有效的(稀疏/低秩)模型参数化。通过联合学习这些参数和任务-技能分配矩阵，每个任务的网络被实例化为激活技能参数的均值。为支持跨任务技能的非平凡软分区，试验了一系列的归纳偏差，如印度自助餐过程先验和双速学习率。在两个主要场合评估了latentskill模型。1）在BabyAI平台的8个层次上对有基础指令进行多任务强化学习；2）在CrossFit上对预训练文本到文本生成模型进行少样本自适应，CrossFit是一个包括160个NLP任务的基准。与具有完全共享的、特定任务的或有条件生成参数的基线相比，网络的模块化设计大大增加了强化学习的样本效率和监督学习的少样本泛化，尤其是知识跨任务纠缠的场景下。展示了离散技能如何有助于可解释性，因为它们产生了一个明确的任务层次。

A modular design encourages neural models to disentangle and recombine different facets of knowledge to generalise more systematically to new tasks. In this work, we assume that each task is associated with a subset of latent discrete skills from a (potentially small) inventory. In turn, skills correspond to parameter-efficient (sparse / lowrank) model parameterisations. By jointly learning these and a task–skill allocation matrix, the network for each task is instantiated as the average of the parameters of active skills. To favour non-trivial soft partitions of skills across tasks, we experiment with a series of inductive biases, such as an Indian Buffet Process prior and a twospeed learning rate. We evaluate our latentskill model on two main settings: 1) multitask reinforcement learning for grounded instruction following on 8 levels of the BabyAI platform; and 2) few-shot adaptation of pre-trained text-to-text generative models on CrossFit, a benchmark comprising 160 NLP tasks. We find that the modular design of a network significantly increases sample-efficiency in reinforcement learning and few-shot generalisation in supervised learning, compared to baselines with fully shared, task-specific, or conditionally generated parameters where knowledge is entangled across tasks. In addition, we show how discrete skills help interpretability, as they yield an explicit hierarchy of tasks.