爱可可AI前沿推介(10.29)

LG - 机器学习 CV - 计算机视觉 CL - 计算与语言 AS - 音频与语音 RO - 机器人

转自爱可可爱生活

1、[LG] Shaking the foundations: delusions in sequence models for interaction and control

P A. Ortega, M Kunesch, G Delétang, T Genewein, J Grau-Moya, J Veness, J Buchli, J Degrave, B Piot, J Perolat, T Everitt, C Tallec, E Parisotto, T Erez, Y Chen, S Reed, M Hutter, N d Freitas, S Legg

[DeepMind]

基础的动摇：面向互动和控制序列模型的错觉。最近，语言模型的惊人成功重振了机器学习研究，大型序列模型，如Transformer，正被应用于各个领域。然而，一类重要的问题仍然相对难以捉摸，那就是有目的的适应性行为。目前，人们普遍认为序列模型"缺乏对其行为因果关系的理解"，导致它们由于自动暗示的错觉而做出不正确的推断。这份报告解释了这种不匹配的起源，并表明它可以通过将行动作为因果干预来解决。引入正确的因果约束，使智能体只能通过其行动效果来了解任务。本文表明，在监督学习中，可以通过分别用事实和反事实的错误信号进行训练，教会一个系统对数据进行调节或干预。

The recent phenomenal success of language models has reinvigorated machine learning research, and large sequence models such as transformers are being applied to a variety of domains. One important problem class that has remained relatively elusive however is purposeful adaptive behavior. Currently there is a common perception that sequence models “lack the understanding of the cause and effect of their actions” leading them to draw incorrect inferences due to auto-suggestive delusions. In this report we explain where this mismatch originates, and show that it can be resolved by treating actions as causal interventions. Finally, we show that in supervised learning, one can teach a system to condition or intervene on data by training with factual and counterfactual error signals respectively.

https://weibo.com/1402400261/KEUl509CL

2、[LG] Why Machine Learning Cannot Ignore Maximum Likelihood Estimation

M J. v d Laan, S Rose

[UC Berkeley & Stanford University]

为什么机器学习不能忽视最大似然估计？机器学习作为一个领域其发展一直在加速，各领域的兴趣和文献都在增加，包括统计学，但主要是在计算机科学。如何解析这些庞大的文献，寻找体现必要严谨性的发展？在这些文献中，有多少是结合了基础理论以实现统计推理的？哪些进展在实践中具有最大的影响潜力？对于这些问题，可以提出许多答案。本文断言，一个基本的想法是机器学习需要整合最大似然来估计功能参数，如预测函数和条件密度。

The growth of machine learning as a field has been accelerating with increasing interest and publications across fields, including statistics, but predominantly in computer science. How can we parse this vast literature for developments that exemplify the necessary rigor? How many of these manuscripts incorporate foundational theory to allow for statistical inference? Which advances have the greatest potential for impact in practice? One could posit many answers to these queries. Here, we assert that one essential idea is for machine learning to integrate maximum likelihood for estimation of functional parameters, such as prediction functions and conditional densities.

https://weibo.com/1402400261/KEUqF1pSj

3、[LG] VQ-GNN: A Universal Framework to Scale up Graph Neural Networks using Vector Quantization

M Ding, K Kong, J Li, C Zhu, J P Dickerson, F Huang, T Goldstein

[University of Maryland]

VQ-GNN: 基于矢量量化的图神经网络扩展通用框架。大多数最先进的图神经网络(GNN)可以被定义为一种图卷积的形式，可以通过直接邻居之间或更多的信息传递实现。为了将这种GNN扩展到大规模图，人们提出了各种邻居、层或子图的采样技术，以缓解"邻居爆炸"的问题，只考虑小批次中传递到节点的一小部分消息。然而，基于抽样的方法很难应用于每层利用多跳以外或全局上下文的GNN，对于不同的任务和数据集表现出不稳定的性能，并且不能加快模型推理速度。本文提出一种原则性的、根本性不同的方法VQ-GNN，一种通用框架，用矢量量化(VQ)扩展任何基于卷积的GNN，而不影响其性能。与基于抽样的技术相比，所提出方法可通过学习和更新全局节点表示的少量量化参考向量，在GNN各层内用VQ，有效地保留传递给小批次节点的所有信息。该框架避免了GNN的"邻居爆炸 "问题，使用量化表示与图卷积矩阵的低秩版本相结合。这样一个紧凑的低秩版本的巨大卷积矩阵在理论上和实验上都是足够的。为该框架设计了一个新的近似消息传递算法和一个非平凡反向传播法则。在各种类型的GNN骨干网上的实验证明了所提出框架在大图节点分类和链接预测基准上的可扩展性和竞争性能。

Most state-of-the-art Graph Neural Networks (GNNs) can be defined as a form of graph convolution which can be realized by message passing between direct neighbors or beyond. To scale such GNNs to large graphs, various neighbor-, layer-, or subgraph-sampling techniques are proposed to alleviate the “neighbor explosion” problem by considering only a small subset of messages passed to the nodes in a mini-batch. However, sampling-based methods are difficult to apply to GNNs that utilize many-hops-away or global context each layer, show unstable performance for different tasks and datasets, and do not speed up model inference. We propose a principled and fundamentally different approach, VQ-GNN, a universal framework to scale up any convolution-based GNNs using Vector Quantization (VQ) without compromising the performance. In contrast to sampling-based techniques, our approach can effectively preserve all the messages passed to a mini-batch of nodes by learning and updating a small number of quantized reference vectors of global node representations, using VQ within each GNN layer. Our framework avoids the “neighbor explosion” problem of GNNs using quantized representations combined with a low-rank version of the graph convolution matrix. We show that such a compact low-rank version of the gigantic convolution matrix is sufficient both theoretically and experimentally. In company with VQ, we design a novel approximated message passing algorithm and a nontrivial back-propagation rule for our framework. Experiments on various types of GNN backbones demonstrate the scalability and competitive performance of our framework on large-graph node classification and link prediction benchmarks.

https://weibo.com/1402400261/KEUtpaetg

4、[CL] Hierarchical Transformers Are More Efficient Language Models

P Nawrot, S Tworkowski, M Tyrolski, Ł Kaiser, Y Wu, C Szegedy, H Michalewski

[University of Warsaw & OpenAI & Google Research]

基于层次Transformer的更高效语言模型。Transformer模型在许多NLP和序列建模任务中产生了令人印象深刻的结果。值得注意的是，Transformer可以处理长序列，这使得它们可以产生长的连贯性输出：GPT-3产生的完整段落或DALL-E产生的结构良好的图像。这些大型语言模型令人印象深刻，但也非常低效和昂贵，这限制了它们的应用和可及性。本文推测，拥有一个明确的分层结构是有效处理长序列的Transformer的关键。为验证这一说法，本文首先研究了在Transformer中对激活进行降采样和升采样的不同方法，以使它们具有层次性。用性能最好的上采样和下采样层来创建Hourglass，一种分层的Transformer语言模型。在相同计算量下，Hourglass改进了Transformer的基线，并能更有效地产生与Transformer相同的结果。特别是，Hourglass在ImageNet32生成任务上为Transformer模型设定了新的最新水平，并在广泛研究的enwik8基准上提高了语言建模的效率。

Transformer models yield impressive results on many NLP and sequence modeling tasks. Remarkably, Transformers can handle long sequences which allows them to produce long coherent outputs: full paragraphs produced by GPT-3 or wellstructured images produced by DALL-E. These large language models are impressive but also very inefficient and costly, which limits their applications and accessibility. We postulate that having an explicit hierarchical architecture is the key to Transformers that efficiently handle long sequences. To verify this claim, we first study different ways to downsample and upsample activations in Transformers so as to make them hierarchical. We use the best performing upsampling and downsampling layers to create Hourglass a hierarchical Transformer language model. Hourglass improves upon the Transformer baseline given the same amount of computation and can yield the same results as Transformers more efficiently. In particular, Hourglass sets new stateof-the-art for Transformer models on the ImageNet32 generation task and improves language modeling efficiency on the widely studied enwik8 benchmark.

https://weibo.com/1402400261/KEUxsE3dw

5、[LG] FlexMatch: Boosting Semi-Supervised Learning with Curriculum Pseudo Labeling

B Zhang, Y Wang, W Hou, H Wu, J Wang, M Okumura, T Shinozaki

[Tokyo Institute of Technology & Microsoft & Microsoft Research Asia]

FlexMatch：基于课程伪标记提升半监督学习。最近提出的FixMatch在大多数半监督学习(SSL)基准上取得了最先进的结果。然而，像其他现代SSL算法一样，FixMatch对所有类别使用预定义的恒定阈值来选择有助于训练的未标记数据，因此未能考虑不同类别的不同学习状态和学习困难。为解决这个问题，本文提出课程伪标记(CPL)，一种课程学习方法，根据模型的学习状态来利用未标记数据。CPL的核心是在每个时间步骤灵活地调整不同类别的阈值，让信息量大的无标记数据和其伪标签通过。CPL不引入额外的参数或计算(前向或后向传播)。将CPL应用于FixMatch，并将改进算法称为FlexMatch。FlexMatch在各种SSL基准上实现了最先进的性能，当标记数据极其有限或任务具有挑战性时，其性能尤其强大。例如，当每类只有4个标签时，FlexMatch在CIFAR-100和STL-10数据集上的表现分别比FixMatch高出14.32%和24.55%。CPL还大大提升了收敛速度，例如，FlexMatch可以只使用FixMatch的1/5的训练时间，以达到更好的性能。CPL可以很容易地适用于其他SSL算法，并显著提高其性能。

The recently proposed FixMatch achieved state-of-the-art results on most semisupervised learning (SSL) benchmarks. However, like other modern SSL algorithms, FixMatch uses a pre-defined constant threshold for all classes to select unlabeled data that contribute to the training, thus failing to consider different learning status and learning difficulties of different classes. To address this issue, we propose Curriculum Pseudo Labeling (CPL), a curriculum learning approach to leverage unlabeled data according to the model’s learning status. The core of CPL is to flexibly adjust thresholds for different classes at each time step to let pass informative unlabeled data and their pseudo labels. CPL does not introduce additional parameters or computations (forward or backward propagation). We apply CPL to FixMatch and call our improved algorithm FlexMatch. FlexMatch achieves state-of-the-art performance on a variety of SSL benchmarks, with especially strong performances when the labeled data are extremely limited or when the task is challenging. For example, FlexMatch outperforms FixMatch by 14.32% and 24.55% on CIFAR-100 and STL-10 datasets respectively, when there are only 4 labels per class. CPL also significantly boosts the convergence speed, e.g., FlexMatch can use only 1/5 training time of FixMatch to achieve even better performance. Furthermore, we show that CPL can be easily adapted to other SSL algorithms and remarkably improve their performances. We open source our code at https://github.com/TorchSSL/TorchSSL.

https://weibo.com/1402400261/KEUA3rdTg

另外几篇值得关注的论文：

[LG] Infinitely Deep Bayesian Neural Networks with Stochastic Differential Equations

基于随机微分方程的无限深贝叶斯神经网络

W Xu, R T.Q. Chen, X Li, D Duvenaud

[University of Toronto & Stanford University]

https://weibo.com/1402400261/KEUDpETDM

[CL] SLAM: A Unified Encoder for Speech and Language Modeling via Speech-Text Joint Pre-Training

SLAM：基于语音-文本联合预训练的语音语言建模统一编码器

A Bapna, Y Chung, N Wu, A Gulati, Y Jia, J H. Clark, M Johnson, J Riesa, A Conneau, Y Zhang

[Google Research]

https://weibo.com/1402400261/KEUFb5vU9

[AS] Neural Analysis and Synthesis: Reconstructing Speech from Self-Supervised Representations

神经分析与合成：基于自监督表示的语音重构

H Choi, J Lee, W Kim, J H Lee, H Heo, K Lee

[Seoul National University & Supertone Inc]

https://weibo.com/1402400261/KEUGTnzKP

[CV] NeRV: Neural Representations for Videos

NeRV：视频的神经表示

H Chen, B He, H Wang, Y Ren, S Lim, A Shrivastava

[University of Maryland & Facebook AI]

https://weibo.com/1402400261/KEUIM6Rxk

内容中包含的图片若涉及版权问题，请及时与我们联系删除