爱可可AI前沿推介(1.5)

LG - 机器学习 CV - 计算机视觉 CL - 计算与语言 AS - 音频与语音 RO - 机器人

转自爱可可爱生活

1、[CV] Vision Transformer with Deformable Attention

Z Xia, X Pan, S Song, L E Li, G Huang

[Tsinghua University & AWS AI]

基于可变形注意力的视觉Transformer。Transformer最近在各种视觉任务上表现出了卓越的性能。大的、有时甚至是全局性的感受野赋予了Transformer模型比CNN对应模型更高的表示能力。然而，简单地扩大感受野也引起了一些担忧。一方面，使用稠密的注意力，例如在ViT中，会导致过多的内存和计算成本，而且特征可能会受到超出感兴趣区域的不相关部分的影响。另一方面，在PVT或Swin Transformer中采用的稀疏注意是与数据无关的，可能会限制对长程关系的建模能力。为缓解这些问题，本文提出一种新的可变形自注意力模块，其中自注意力中的键和值对的位置是以一种依赖于数据的方式选择的。这种灵活的方案使自注意力模块能专注于相关区域，并捕获更多的信息特征。在此基础上，提出了Deformable Attention Transformer，一种具有可变形注意力的通用骨干模型，用于图像分类和稠密预测任务。广泛的实验表明，所提出模型在综合基准上取得了持续改进的结果。

Transformers have recently shown superior performances on various vision tasks. The large, sometimes even global, receptive field endows Transformer models with higher representation power over their CNN counterparts. Nevertheless, simply enlarging receptive field also gives rise to several concerns. On the one hand, using dense attention e.g., in ViT, leads to excessive memory and computational cost, and features can be influenced by irrelevant parts which are beyond the region of interests. On the other hand, the sparse attention adopted in PVT or Swin Transformer is data agnostic and may limit the ability to model long range relations. To mitigate these issues, we propose a novel deformable self-attention module, where the positions of key and value pairs in self-attention are selected in a data-dependent way. This flexible scheme enables the self-attention module to focus on relevant regions and capture more informative features. On this basis, we present Deformable Attention Transformer, a general backbone model with deformable attention for both image classification and dense prediction tasks. Extensive experiments show that our models achieve consistently improved results on comprehensive benchmarks. Code is available at https://github.com/LeapLabTHU/DAT.

2、[LG] On the Role of Neural Collapse in Transfer Learning

T Galanti, A György, M Hutter

[DeepMind]

迁移学习神经坍缩效用研究。本文研究基础模型为分类学习表示的能力，这些表示可以迁移到新的、未见过的类别。最近文献结果显示，由单一分类器在许多类上学习的表示，在少样本学习问题上与由专门为这类问题设计的特殊目的算法学习的表示相比具有竞争力。本文根据最近观察到的现象对这一行为进行了解释，即由超参数化分类网络学习的特征显示出一种有趣的聚类属性，称为神经坍缩。本文从理论上和经验上证明，神经坍缩可泛化到训练类的新样本，更重要的是也可以泛化到新类别，从而使基础模型能够提供在迁移学习中运作良好的特征图，特别是在少样本场景下。

We study the ability of foundation models to learn representations for classification that are transferable to new, unseen classes. Recent results in the literature show that representations learned by a single classifier over many classes are competitive on few-shot learning problems with representations learned by special-purpose algorithms designed for such problems. In this paper we provide an explanation for this behavior based on the recently observed phenomenon that the features learned by overparameterized classification networks show an interesting clustering property, called neural collapse. We demonstrate both theoretically and empirically that neural collapse generalizes to new samples from the training classes, and – more importantly – to new classes as well, allowing foundation models to provide feature maps that work well in transfer learning and, specifically, in the few-shot setting.

3、[CV] Robust Contrastive Learning Using Negative Samples with Diminished Semantics

S Ge, S Mishra, H Wang, C Li, D Jacobs

[Univeristy of Maryland & CMU & Google Cloud AI]

基于弱语义负样本的鲁棒对比性学习。由于更有效的对比性学习方法的发展，无监督学习最近取得了非凡的进展。然而，CNN很容易依赖于人类认为非语义的低层特征。这种依赖性被推测为对图像扰动或领域迁移缺乏鲁棒性。本文展示了通过生成精心设计的负样本，对比学习可以学习到更多的鲁棒性表示，减少对这些特征的依赖。对比学习利用了保留语义信息的阳性对，同时扰乱了训练图像的表面特征。同样地，本文建议以相反的方式生成负样本，只保留多余的而不是语义特征。本文开发了两种方法，即基于纹理和基于图块的增强，来生成负样本。这些样本实现了更好的泛化，特别是在域外场景下。还分析了所提出方法和生成的基于纹理的样本，表明纹理特征在分类特定的ImageNet类别，特别是更精细类别时是不可缺少的。在不同的测试环境下，模型会产生偏向于纹理和形状特征的差异。

Unsupervised learning has recently made exceptional progress because of the development of more effective contrastive learning methods. However, CNNs are prone to depend on low-level features that humans deem non-semantic. This dependency has been conjectured to induce a lack of robustness to image perturbations or domain shift. In this paper, we show that by generating carefully designed negative samples, contrastive learning can learn more robust representations with less dependence on such features. Contrastive learning utilizes positive pairs that preserve semantic information while perturbing superficial features in the training images. Similarly, we propose to generate negative samples in a reversed way, where only the superfluous instead of the semantic features are preserved. We develop two methods, texture-based and patch-based augmentations, to generate negative samples. These samples achieve better generalization, especially under out-of-domain settings. We also analyze our method and the generated texture-based samples, showing that texture features are indispensable in classifying particular ImageNet classes and especially finer classes. We also show that model bias favors texture and shape features differently under different test settings.

4、[CV] Splicing ViT Features for Semantic Appearance Transfer

N Tumanyan, O Bar-Tal, S Bagon, T Dekel

[The Weizmann Inst. of Science]

基于ViT特征拼接的语义外观迁移。本文提出一种将一幅自然图像的视觉外观语义化迁移到另一幅图像的方法。目标是生成一幅图像，在该图像中，源结构图像中的目标被"涂"上了目标外观图像中与之语义相关的目标的视觉外观。所提出方法是通过训练一个生成器，只给一个单一的结构/外观图像对作为输入。为了将语义信息整合到该框架中——这是处理这一任务的关键部分——关键想法是利用预训练的、固定的视觉Transformer(ViT)模型，作为外部语义先验。从深度的ViT特征中提取结构和外观的新表示，将它们从学到的自注意力模块种解出。建立一个目标函数来拼接所需的结构和外观表示，在ViT特征空间中把它们交织在一起。所提出的框架，称为"拼接"，不涉及对抗性训练，也不需要任何额外的输入信息，如语义分割或对应关系，可以产生高分辨率的结果，如高清场景。我们在各种真实图像对上展示了高质量的结果，在各种目标数量、姿态和外观的重大变化下。

We present a method for semantically transferring the visual appearance of one natural image to another. Specifically, our goal is to generate an image in which objects in a source structure image are “painted” with the visual appearance of their semantically related objects in a target appearance image. Our method works by training a generator given only a single structure/appearance image pair as input. To integrate semantic information into our framework—a pivotal component in tackling this task—our key idea is to leverage a pre-trained and fixed Vision Transformer (ViT) model which serves as an external semantic prior. Specifically, we derive novel representations of structure and appearance extracted from deep ViT features, untwisting them from the learned self-attention modules. We then establish an objective function that splices the desired structure and appearance representations, interweaving them together in the space of ViT features. Our framework, which we term “Splice”, does not involve adversarial training, nor does it require any additional input information such as semantic segmentation or correspondences, and can generate high resolution results, e.g., work in HD. We demonstrate high quality results on a variety of in-thewild image pairs, under significant variations in the number of objects, their pose and appearance.

5、[AS] Diffusion-Based Voice Conversion with Fast Maximum Likelihood Sampling Scheme

V Popov, I Vovk, V Gogoryan, T Sadekova, M Kudinov, J Wei

[Huawei Noah’s Ark Lab]

基于快速最大似然采样方案的基于扩散语音转换。语音转换是一项常见的语音合成任务，根据特定的实际情况，可以用不同的方式解决。最具挑战性的方法通常被称为单样本多对多的语音转换，包括在源和目标说话人都不属于训练数据集的最一般情况下，只从一个参考语料中复制目标语音。本文提出一种基于扩散概率建模的可扩展高质量解决方案，并证明它与最先进的单样本语音转换方法相比的卓越质量。此外，在关注实时应用的同时，本文研究了能使扩散模型更快同时保持高水平合成质量的一般原则。开发了一种新的随机微分方程求解器，适用于各种扩散模型类型和生成任务，如实证研究所示，并通过理论分析加以证明。

Voice conversion is a common speech synthesis task which can be solved in different ways depending on a particular real-world scenario. The most challenging one often referred to as one-shot many-to-many voice conversion consists in copying the target voice from only one reference utterance in the most general case when both source and target speakers do not belong to the training dataset. We present a scalable high-quality solution based on diffusion probabilistic modeling and demonstrate its superior quality compared to state-of-the-art one-shot voice conversion approaches. Moreover, focusing on real-time applications, we investigate general principles which can make diffusion models faster while keeping synthesis quality at a high level. As a result, we develop a novel Stochastic Differential Equations solver suitable for various diffusion model types and generative tasks as shown through empirical studies and justify it by theoretical analysis.