爱可可AI前沿推介(9.17)

LG - 机器学习 CV - 计算机视觉 CL - 计算与语言 AS - 音频与语音 RO - 机器人

转自爱可可爱生活

摘要：基于潜扩散模型的脑成像生成、残差神经网络是否是神经常微分方程的离散化、高效的多头注意力Hydra注意力、基于高保真图像生成器的自回归潜视频预测、模型何时能"理解"语言、面向深度潜变量模型学习的Langevin自编码器、扩展智能理论、视觉语言模型中零样本泛化的测试时提示微调、基于粒子梯度流的批量贝叶斯优化

1、[CV] Brain Imaging Generation with Latent Diffusion Models

W H. L. Pinaya, P Tudosiu, J Dafflon...
[King’s College London & National Institute of Mental Health & University College London]
基于潜扩散模型的脑成像生成。深度神经网络在医学图像分析方面带来了显著的突破。然而，由于它们对数据的饥渴，医学影像项目中适度的数据集规模可能阻碍了它们的全部潜力。生成合成数据提供了一个有希望的替代方案，可补充训练数据集，并在更大范围内进行医学图像研究。扩散模型最近通过产生逼真的合成图像引起了计算机视觉界的注意。本文探索用潜扩散模型从高分辨率3D大脑图像中生成合成图像。使用英国生物库数据集(N=31,740)中的T1w MRI图像来训练模型，以了解大脑图像的概率分布，条件是协变量，如年龄、性别和大脑结构体。所训练的模型创造了真实的数据，可以用条件变量来有效控制数据的生成。除此之外，本文创建了一个有10万张大脑图像的合成数据集，并向科学界开放提供。

Deep neural networks have brought remarkable breakthroughs in medical image analysis. However, due to their data-hungry nature, the modest dataset sizes in medical imaging projects might be hindering their full potential. Generating synthetic data provides a promising alternative, allowing to complement training datasets and conducting medical image research at a larger scale. Diffusion models recently have caught the attention of the computer vision community by producing photorealistic synthetic images. In this study, we explore using Latent Diffusion Models to generate synthetic images from high-resolution 3D brain images. We used T1w MRI images from the UK Biobank dataset (N=31,740) to train our models to learn about the probabilistic distribution of brain images, conditioned on covariables, such as age, sex, and brain structure volumes. We found that our models created realistic data, and we could use the conditioning variables to control the data generation effectively. Besides that, we created a synthetic dataset with 100,000 brain images and made it openly available to the scientific community.

https://arxiv.org/abs/2209.07162

2、[LG] Do Residual Neural Networks discretize Neural Ordinary Differential Equations?

M E. Sander, P Ablin, G Peyré
[ENS & Université Paris-Dauphine]
残差神经网络是否是离散化的神经常微分方程？神经常微分方程(Neural ODEs)是残差神经网络(ResNets)的连续类似物。本文研究由ResNet定义的离散动态是否接近于神经常微分方程的连续动态。首先量化ResNet的隐状态轨迹和其相应的神经ODE的解之间的距离。所用约束是严格的，在消极方面，如果残差函数不随深度变化而平滑，则在深度为N时不会归于0。在积极方面，本文表明，对于具有线性残差函数和足够小的初始损失的ResNet来说，这种平滑性被梯度下降所保留。其确保了以1/N的速率向极限神经ODE的隐性正则化，并随着深度和优化时间的增加均匀实现。作为分析的副产品，本文考虑用无记忆的离散邻接法来训练ResNet，通过网络的后向传递来快速恢复激活，并表明如果残差函数与输入是Lipschitz的话，这种方法理论上可以成功。Heun的方法，一种二阶ODE积分方案，在残差函数随深度平滑的情况下，可以用邻接方法进行更好的梯度估计。本文通过实验验证了邻接方法在大深度下是成功的，而Heun的方法需要更少的层才能成功。最后，本文成功地用邻接法对非常深的ResNets进行微调，而在残差层中没有内存消耗。

Neural Ordinary Differential Equations (Neural ODEs) are the continuous analog of Residual Neural Networks (ResNets). We investigate whether the discrete dynamics defined by a ResNet are close to the continuous one of a Neural ODE. We first quantify the distance between the ResNet’s hidden state trajectory and the solution of its corresponding Neural ODE. Our bound is tight and, on the negative side, does not go to 0 with depth N if the residual functions are not smooth with depth. On the positive side, we show that this smoothness is preserved by gradient descent for a ResNet with linear residual functions and small enough initial loss. It ensures an implicit regularization towards a limit Neural ODE at rate 1 N , uniformly with depth and optimization time. As a byproduct of our analysis, we consider the use of a memory-free discrete adjoint method to train a ResNet by recovering the activations on the fly through a backward pass of the network, and show that this method theoretically succeeds at large depth if the residual functions are Lipschitz with the input. We then show that Heun’s method, a second order ODE integration scheme, allows for better gradient estimation with the adjoint method when the residual functions are smooth with depth. We experimentally validate that our adjoint method succeeds at large depth, and that Heun’s method needs fewer layers to succeed. We finally use the adjoint method successfully for fine-tuning very deep ResNets without memory consumption in the residual layers.

https://arxiv.org/abs/2205.14612

3、[CV] Hydra Attention: Efficient Attention with Many Heads

D Bolya, C Fu, X Dai, P Zhang, J Hoffman
[Georgia Tech & Meta AI]
Hydra注意力：高效的多头注意力。虽然Transformer已经开始在视觉领域的许多任务中占据主导地位，但将其应用于大型图像在计算上仍然是困难的。这其中的一个重要原因是，自注意力的规模与Token的数量成四次方，而Token的数量又与图像大小成四次方。在较大的图像上(例如，1080p)，网络中超过60%的总计算量只花在创建和应用注意力矩阵上。本文提出Hydra注意力，一种极其高效的视觉Transformer(ViT)的注意力操作，向解决这个问题迈出了一步。矛盾的是，这种效率来自于将多头注意力发挥到极致：通过用与特征同样多的注意力头，Hydra注意力在Token和特征方面的计算都是线性的，没有隐藏的常数，这使得它比现成的ViT-B/16中的标准自注意力快很多，是Token数量的常数倍。此外，Hydra注意力在ImageNet上保持了较高的精度，在某些情况下，实际上还提高了精度。

While transformers have begun to dominate many tasks in vision, applying them to large images is still computationally difficult. A large reason for this is that self-attention scales quadratically with the number of tokens, which in turn, scales quadratically with the image size. On larger images (e.g., 1080p), over 60% of the total computation in the network is spent solely on creating and applying attention matrices. We take a step toward solving this issue by introducing Hydra Attention, an extremely efficient attention operation for Vision Transformers (ViTs). Paradoxically, this efficiency comes from taking multi-head attention to its extreme: by using as many attention heads as there are features, Hydra Attention is computationally linear in both tokens and features with no hidden constants, making it significantly faster than standard self-attention in an off-the-shelf ViT-B/16 by a factor of the token count. Moreover, Hydra Attention retains high accuracy on ImageNet and, in some cases, actually improves it.

https://arxiv.org/abs/2209.07484

4、[CV] HARP: Autoregressive Latent Video Prediction with High-Fidelity Image Generator

Y Seo, K Lee, F Liu, S James, P Abbeel
[KAIST & UC Berkeley]
HARP：基于高保真图像生成器的自回归潜视频预测。视频预测是一个重要而具有挑战性的问题；它承担着生成未来帧和学习环境动态的任务。最近，自回归潜视频模型被证明是一个强大的视频预测工具，它将视频预测分为两个子问题：预训练图像生成器模型，然后在图像生成器的潜空间中学习自回归预测模型。然而，成功地生成高保真和高分辨率的视频还没有看到。本文研究了如何训练一个能够预测高保真未来帧的自回归潜视频预测模型，同时对现有模型进行最小的修改，并生成高分辨率(256x256)视频。通过采用高保真图像生成器(VQ-GAN)和因果Transformer模型来扩展之前的模型，并引入top-k采样和数据增强的额外技术来进一步提高视频预测的质量。尽管简单，所提出的方法在标准视频预测基准上以较少的参数取得了与最先进的方法相竞争的性能，并在复杂和大规模的数据集上实现了高分辨率的视频预测。

Video prediction is an important yet challenging problem; burdened with the tasks of generating future frames and learning environment dynamics. Recently, autoregressive latent video models have proved to be a powerful video prediction tool, by separating the video prediction into two sub-problems: pre-training an image generator model, followed by learning an autoregressive prediction model in the latent space of the image generator. However, successfully generating highfidelity and high-resolution videos has yet to be seen. In this work, we investigate how to train an autoregressive latent video prediction model capable of predicting high-fidelity future frames with minimal modification to existing models, and produce high-resolution (256x256) videos. Specifically, we scale up prior models by employing a high-fidelity image generator (VQ-GAN) with a causal transformer model, and introduce additional techniques of top-k sampling and data augmentation to further improve video prediction quality. Despite the simplicity, the proposed method achieves competitive performance to state-of-the-art approaches on standard video prediction benchmarks with fewer parameters, and enables highresolution video prediction on complex and large-scale datasets. Videos are available at https://sites.google.com/view/harp-videos/home.

https://arxiv.org/abs/2209.07143

5、[CL] Machine Reading, Fast and Slow: When Do Models "Understand" Language?

S R Choudhury, A Rogers, I Augenstein
[University of Michigan & University of Copenhagen]
机器阅读，快与慢：模型何时能"理解"语言？目前，自然语言理解(NLU)的两个最基本的挑战是：(a) 如何确定基于深度学习的模型是否因为"正确"的原因而在NLU基准上获得高分；以及 (b) 了解这些原因是什么。本文研究了阅读理解模型在两种语言"技能"方面的行为：核心推理和比较。为一个"缓慢阅读"的系统所预期的推理步骤提出了一个定义，并将其与BERT系列的五个不同规模的模型的行为进行了比较，通过显著性分数和反事实解释进行观察。对于比较(而非核心推理），基于较大编码器的系统更有可能依赖"正确"的信息，但即使它们在泛化方面也很困难，这表明它们仍然在学习具体的词汇模式，而不是比较的一般原则。

Two of the most fundamental challenges in Natural Language Understanding (NLU) at present are: (a) how to establish whether deep learning-based models score highly on NLU benchmarks for the ‘right’ reasons; and (b) to understand what those reasons would even be. We investigate the behavior of reading comprehension models with respect to two linguistic ‘skills’: coreference resolution and comparison. We propose a definition for the reasoning steps expected from a system that would be ‘reading slowly’, and compare that with the behavior of five models of the BERT family of various sizes, observed through saliency scores and counterfactual explanations. We find that for comparison (but not coreference) the systems based on larger encoders are more likely to rely on the ‘right’ information, but even they struggle with generalization, suggesting that they still learn specific lexical patterns rather than the general principles of comparison.

https://arxiv.org/abs/2209.07430

另外几篇值得关注的论文：

[LG] Langevin Autoencoders for Learning Deep Latent Variable Models

面向深度潜变量模型学习的Langevin自编码器
S Taniguchi, Y Iwasawa, W Kumagai, Y Matsuo
[The University of Tokyo]
https://arxiv.org/abs/2209.07036

[LG] Extended Intelligence

扩展智能理论
D L Barack, A Jaegle
[University of Pennsylvania & DeepMind]
https://arxiv.org/abs/2209.07449

[CV] Test-Time Prompt Tuning for Zero-Shot Generalization in Vision-Language Models

视觉语言模型中零样本泛化的测试时提示微调
M Shu, W Nie, D Huang, Z Yu, T Goldstein, A Anandkumar, C Xiao
[University of Maryland & NVIDIA]
https://arxiv.org/abs/2209.07511

[LG] Batch Bayesian Optimization via Particle Gradient Flows

基于粒子梯度流的批量贝叶斯优化
E Crovini, S L. Cotter, K Zygalakis, A B. Duncan
[Imperial College London & University of Manchester & University of Edinburgh]
https://arxiv.org/abs/2209.04722