爱可可AI前沿推介(2.24)

LG - 机器学习 CV - 计算机视觉 CL - 计算与语言 AS - 音频与语音 RO - 机器人

转自爱可可爱生活

1、[LG] Message passing all the way up

P Veličković

[DeepMind]

GNN增强消息传递。消息传递框架是近年来图神经网络(GNN)获得巨大成功的基础。尽管它很优雅，但在给定的输入图上，存在许多被证明无法解决的问题。这就导致了之后"超越信息传递"的研究热潮，构建不受这些限制的GNN——这种说法在日常讨论中已经无处不在。然而，这些方法是否真正超越了消息传递？本文论证了使用这种说法的危险性——尤其是在向新人教授图表示学习时。想在图上计算的任何感兴趣的函数，都可以用成对的消息传递来表达——只是在一个潜在的修改后的图上，并论证了大多数实际的实现是如何巧妙地实现这种技巧的。为发起一场富有成效的讨论，建议将"超越消息传递"替换为一种更温和的术语，即"增强消息传递"。

The message passing framework is the foundation of the immense success enjoyed by graph neural networks (GNNs) in recent years. In spite of its elegance, there exist many problems it provably cannot solve over given input graphs. This has led to a surge of research on going “beyond message passing”, building GNNs which do not suffer from those limitations—a term which has become ubiquitous in regular discourse. However, have those methods truly moved beyond message passing? In this position paper, I argue about the dangers of using this term— especially when teaching graph representation learning to newcomers. I show that any function of interest we want to compute over graphs can, in all likelihood, be expressed using pairwise message passing – just over a potentially modified graph, and argue how most practical implementations subtly do this kind of trick anyway. Hoping to initiate a productive discussion, I propose replacing “beyond message passing” with a more tame term, “augmented message passing”.

2、[CV] Hierarchical Perceiver

J Carreira, S Koppula, D Zoran, A Recasens, C Ionescu, O Henaff, E Shelhamer, R Arandjelovic, M Botvinick, O Vinyals, K Simonyan, A Zisserman, A Jaegle

[DeepMind]

分层感知器。一般的感知系统，如感知器，可以处理任意模态的任意组合，并且能够处理多达几十万个输入，通过完全用全局注意力操作来实现这种通用性。然而，这阻碍了它们扩展到处理原始高分辨率图像或视频所需的输入规模。本文提出某种程度的局部性可以被重新引入这些模型，在保留其通用性的同时大大提高了其效率。为进一步扩展这些模型，引入了一种自监督方法，能为非常大的信号学习稠密的低维位置嵌入。由此产生的模型称为分层感知器(Hierarchical Perceiver)。HiP保留了处理任意模态的能力，但现在是在更高的分辨率下，不需要任何专门的预处理，在ImageNet、Audioset和PASCAL VOC数据集上的效率和精度都比扁平感知器要好。

General perception systems such as Perceivers can process arbitrary modalities in any combination and are able to handle up to a few hundred thousand inputs. They achieve this generality by exclusively using global attention operations. This however hinders them from scaling up to the inputs sizes required to process raw highresolution images or video. In this paper, we show that some degree of locality can be introduced back into these models, greatly improving their efficiency while preserving their generality. To scale them further, we introduce a selfsupervised approach that enables learning dense low-dimensional positional embeddings for very large signals. We call the resulting model a Hierarchical Perceiver (HiP). HiP retains the ability to process arbitrary modalities, but now at higherresolution and without any specialized preprocessing, improving over flat Perceivers in both efficiency and accuracy on the ImageNet, Audioset and PASCAL VOC datasets.

3、[LG] Gaussian Processes and Statistical Decision-making in Non-Euclidean Spaces

A Terenin

[Imperial College London]

高斯过程和非欧几里得空间统计决策。使用高斯过程的贝叶斯学习为决策提供了一个基础框架，以平衡已知的和通过收集数据可以学到的。本文开发了扩大高斯过程适用范围的技术。主要通过两种方式实现：首先，开发了高斯过程的路径调节技术，允许将后验随机函数表达为先验随机函数加上一个依赖更新项。介绍了从这一观点出发建立的一大类有效近似，这些近似可以事先随机取样一次，并在任意位置进行评估，没有任何后续随机性。这一关键属性提高了效率，使高斯过程模型在决策环境中的部署更加简单。其次，开发了一系列非欧几里得空间上的高斯过程模型，包括黎曼流形和图。推导出黎曼流形和图上标量高斯过程协方差核的完全建设性表达。在这些思想的基础上，描述了一种用于定义黎曼流形上矢量值高斯过程的形式。引入的技术使所有这些模型都可以用标准的计算方法进行训练。总的来说，这些贡献使高斯过程更容易操作，并允许它们以有效和有原则的方式在更广泛的领域内使用。这反过来又使得有可能将高斯过程应用于新的决策环境。

Bayesian learning using Gaussian processes provides a foundational framework for making decisions in a manner that balances what is known with what could be learned by gathering data. In this dissertation, we develop techniques for broadening the applicability of Gaussian processes. This is done in two ways. Firstly, we develop pathwise conditioning techniques for Gaussian processes, which allow one to express posterior random functions as prior random functions plus a dependent update term. We introduce a wide class of efficient approximations built from this viewpoint, which can be randomly sampled once in advance, and evaluated at arbitrary locations without any subsequent stochasticity. This key property improves efficiency and makes it simpler to deploy Gaussian process models in decision-making settings. Secondly, we develop a collection of Gaussian process models over nonEuclidean spaces, including Riemannian manifolds and graphs. We derive fully constructive expressions for the covariance kernels of scalar-valued Gaussian processes on Riemannian manifolds and graphs. Building on these ideas, we describe a formalism for defining vector-valued Gaussian processes on Riemannian manifolds. The introduced techniques allow all of these models to be trained using standard computational methods. In total, these contributions make Gaussian processes easier to work with and allow them to be used within a wider class of domains in an effective and principled manner. This, in turn, makes it possible to potentially apply Gaussian processes to novel decision-making settings.

4、[AS] It's Raw! Audio Generation with State-Space Models

K Goel, A Gu, C Donahue, C Ré

[Stanford University]

基于状态空间模型的音频生成。由于音频波形的高采样率，开发适用于原始音频建模的架构是个具有挑战性的问题。标准的序列建模方法，如RNN和CNN，之前已经定制以适应音频的需求，但由此产生的架构在计算上的的权衡并不理想，且难以对波形进行有效建模。本文提出SaShiMi，一种新的用于波形建模的多尺度架构，围绕最近推出的S4模型进行长序列建模。本文发现S4在自回归生成过程中可能是不稳定的，并通过与Hurwitz矩阵的联系对其参数化进行了简单改进。SaShiMi在自回归设置中产生了最先进的无条件波形生成性能。此外，SaShiMi在作为扩散模型的骨干架构时，提高了非自回归生成的性能。与之前的自回归生成架构相比，SaShiMi生成的钢琴和语音波形，人工认为更有音乐性和连贯性，在无条件语音生成任务中，平均主观得分比WaveNet高2倍。在音乐生成任务中，SaShiMi在密度估计和训练与推理速度上都优于WaveNet，而使用的参数则要少3倍。

Developing architectures suitable for modeling raw audio is a challenging problem due to the high sampling rates of audio waveforms. Standard sequence modeling approaches like RNNs and CNNs have previously been tailored to fit the demands of audio, but the resultant architectures make undesirable computational tradeoffs and struggle to model waveforms effectively. We propose SaShiMi, a new multi-scale architecture for waveform modeling built around the recently introduced S4 model for long sequence modeling. We identify that S4 can be unstable during autoregressive generation, and provide a simple improvement to its parameterization by drawing connections to Hurwitz matrices. SaShiMi yields state-of-the-art performance for unconditional waveform generation in the autoregressive setting. Additionally, SaShiMi improves non-autoregressive generation performance when used as the backbone architecture for a diffusion model. Compared to prior architectures in the autoregressive generation setting, SaShiMi generates piano and speech waveforms which humans find more musical and coherent respectively, e.g. 2× better mean opinion scores than WaveNet on an unconditional speech generation task. On a music generation task, SaShiMi outperforms WaveNet on density estimation and speed at both training and inference even when using 3× fewer parameters.

5、[LG] Mixture-of-Experts with Expert Choice Routing

Y Zhou, T Lei, H Liu, N Du, Y Huang, V Zhao, A Dai, Z Chen, Q Le, J Laudon

[Google]

专家混合模型的专家选择路由优化。本文为稀疏激活的专家混合(MoE)模型提出一种新的路由方法,解决了传统MoE方法中负载不平衡和专家利用不足的问题，并能为每个token选择不同数量的专家。稀疏激活的专家混合模型(MoE)允许参数数量大大增加，同时保持特定token或特定样本的计算量不变。然而，糟糕的专家路由策略(例如导致负载不平衡的策略)会导致某些专家训练不足，从而导致专家的专业性不足或过度。之前的工作用top-k函数为每个token分配固定数量的专家，而不考虑不同token的相对重要性。为解决这个问题，本文提出一在采用专家选择方法的异质专家混合，不是让token选择前k个专家，而是让专家选择前k个token。每个token可以被送往不同数量的专家，每个专家可以有一个固定的桶大小。本文系统地研究了使用先前工作中的Switch Transformer top-1和GShard top-2门控的相同计算资源的预训练速度，发现所提出方法将训练收敛时间缩短了2倍以上。在相同计算成本下，该方法在GLUE和SuperGLUE基准的11个选定任务的微调中表现出更高的性能。对于较小的激活成本，该方法在11个任务中的7个任务中优于T5稠密模型。

Sparsely-activated Mixture-of-experts (MoE) models allow the number of parameters to greatly increase while keeping the amount of computation for a given token or a given sample unchanged. However, a poor expert routing strategy (e.g. one resulting in load imbalance) can cause certain experts to be under-trained, leading to an expert being under or over-specialized. Prior work allocates a fixed number of experts to each token using a top-k function regardless of the relative importance of different tokens. To address this, we propose a heterogeneous mixture-of-experts employing an expert choice method. Instead of letting tokens select the top-k experts, we have experts selecting the top-k tokens. As a result, each token can be routed to a variable number of experts and each expert can have a fixed bucket size. We systematically study pre-training speedups using the same computational resources of the Switch Transformer top-1 and GShard top-2 gating of prior work and find that our method improves training convergence time by more than 2×. For the same computational cost, our method demonstrates higher performance in fine-tuning 11 selected tasks in the GLUE and SuperGLUE benchmarks. For a smaller activation cost, our method outperforms the T5 dense model in 7 out of the 11 tasks.