爱可可AI前沿推介(12.27)

LG - 机器学习 CV - 计算机视觉 CL - 计算与语言 AS - 音频与语音 RO - 机器人

转自爱可可爱生活

1、[CV] CoAtNet: Marrying Convolution and Attention for All Data Sizes

Z Dai, H Liu, Q V. Le, M Tan

[Google Research]

CoAtNet: 将卷积和注意力结合起来用于任意数据大小。Transformer在计算机视觉领域引起了越来越多的兴趣，但它们仍然落后于最先进的卷积网络。本文表明，虽然Transformer往往有更大的模型容量，但由于缺乏正确的归纳偏差，泛化能力可能比卷积网络更差。为了有效地结合两种架构的优势，本文提出了CoAtNets，由两个关键见解构建的混合模型族：(1) 深度卷积和自注意力可通过简单的相对注意力自然统一起来；(2) 以一种原则性的方式垂直堆叠卷积层和注意力层，在提高泛化、容量和效率方面有惊人的效果。实验表明，CoAtNets在各种数据集的不同资源限制下都能达到最先进性能。在没有额外数据的情况下，CoAtNet达到了86.0%的ImageNet最高准确率；当用ImageNet-21K的1300万张图像进行预训练时，CoAtNet达到了88.56%的最高准确率，与用JFT-300M的3亿张图像进行预训练的ViT-huge相当，而使用的数据却少了23倍；值得注意的是，用JFT-3B进一步扩展CoAtNet时，在ImageNet上达到了90.88%的最高准确率，创造了新的技术水平。

Transformers have attracted increasing interests in computer vision, but they still fall behind state-of-the-art convolutional networks. In this work, we show that while Transformers tend to have larger model capacity, their generalization can be worse than convolutional networks due to the lack of the right inductive bias. To effectively combine the strengths from both architectures, we present CoAtNets (pronounced “coat” nets), a family of hybrid models built from two key insights: (1) depthwise Convolution and self-Attention can be naturally unified via simple relative attention; (2) vertically stacking convolution layers and attention layers in a principled way is surprisingly effective in improving generalization, capacity and efficiency. Experiments show that our CoAtNets achieve state-of-the-art performance under different resource constraints across various datasets: Without extra data, CoAtNet achieves 86.0% ImageNet top-1 accuracy; When pre-trained with 13M images from ImageNet-21K, our CoAtNet achieves 88.56% top-1 accuracy, matching ViT-huge pre-trained with 300M images from JFT-300M while using 23x less data; Notably, when we further scale up CoAtNet with JFT-3B, it achieves 90.88% top-1 accuracy on ImageNet, establishing a new state-of-the-art result.

2、[CV] One Shot Face Swapping on Megapixels

Y Zhu, Q Li, J Wang, C Xu, Z Sun

[Center for Research on Intelligent Perception and Computing & University of Macau]

百万像素级别的单样本换脸。人脸互换(换脸)既有积极的应用，如娱乐、人机互动等，也有消极的应用，如DeepFake对政治、经济的威胁等。然而，有必要了解高质量换脸的先进方法的方案，并生成足够的、有代表性的人脸互换图像来训练DeepFake检测算法。本文提出了第一个百万像素级的单样本换脸方法(MegaFS)。通过提出的分层表示人脸编码器(Hierarchical Representation Face Encoder，HieRFE)在扩展的潜空间分层组织人脸表征，以保持更多的人脸细节，而不是之前人脸交换方法中的压缩表示。提出一种精心设计的人脸迁移模块(FTM)，通过非线性轨迹将身份从源图像迁移到目标图像，而不需要明确的特征拆分。交换后的人脸可以由StyleGAN2合成，优点是训练稳定，生成能力强。MegaFS的每一部分都可以单独训练，模型对GPU内存的要求可以满足百万像素的人脸交换。总之，完整的人脸表示、稳定的训练和有限的内存使用是该方法成功的三个主要贡献。大量实验证明了MegaFS的优越性，发布了第一个百万像素级别的人脸交换数据库，用于研究公共领域的DeepFake检测和人脸图像编辑。

Face swapping has both positive applications such as entertainment, human-computer interaction, etc., and negative applications such as DeepFake threats to politics, economics, etc. Nevertheless, it is necessary to understand the scheme of advanced methods for high-quality face swapping and generate enough and representative face swapping images to train DeepFake detection algorithms. This paper proposes the first Megapixel level method for one shot Face Swapping (or MegaFS for short). Firstly, MegaFS organizes face representation hierarchically by the proposed Hierarchical Representation Face Encoder (HieRFE) in an extended latent space to maintain more facial details, rather than compressed representation in previous face swapping methods. Secondly, a carefully designed Face Transfer Module (FTM) is proposed to transfer the identity from a source image to the target by a non-linear trajectory without explicit feature disentanglement. Finally, the swapped faces can be synthesized by StyleGAN2 with the benefits of its training stability and powerful generative capability. Each part of MegaFS can be trained separately so the requirement of our model for GPU memory can be satisfied for megapixel face swapping. In summary, complete face representation, stable training, and limited memory usage are the three novel contributions to the success of our method. Extensive experiments demonstrate the superiority of MegaFS and the first megapixel level face swapping database is released for research on DeepFake detection and face image editing in the public domain. The dataset is at this link.

3、[LG] Representing Long-Range Context for Graph Neural Networks with Global Attention

P Jain, Z Wu, M Wright, A Mirhoseini, JE Gonzalez…

[UC Berkeley & Google Brain]

用全局注意力表示图神经网络长程上下文。图神经网络是结构化数据集的强大架构。然而，目前的方法在表示长程依赖关系方面很困难。扩展GNN的深度或宽度不足以扩大感受野，因为更大的GNN会遇到优化不稳定的问题，如梯度消失和表示过平滑，而基于池的方法还没有像计算机视觉那样变得普遍可用。本文提出用基于Transformer的自注意来力学习长程的成对关系，用一种新的"readout"机制来获得全局图嵌入。最近的计算机视觉结果发现位置不变的注意力在学习长程关系方面表现良好，受此启发，所提出的的方法，GraphTrans，在标准GNN模块之后应用了一个排列不变的Transformer模块。这种简单的结构导致了在几个图分类任务上的最先进的结果，超过了显式编码图结构的方法。实验结果表明，非图结构的纯学习方法可能适合学习图上的高层次、长程关系。

Graph neural networks are powerful architectures for structured datasets. However, current methods struggle to represent long-range dependencies. Scaling the depth or width of GNNs is insufficient to broaden receptive fields as larger GNNs encounter optimization instabilities such as vanishing gradients and representation oversmoothing, while pooling-based approaches have yet to become as universally useful as in computer vision. In this work, we propose the use of Transformer-based self-attention to learn long-range pairwise relationships, with a novel “readout” mechanism to obtain a global graph embedding. Inspired by recent computer vision results that find position-invariant attention performant in learning long-range relationships, our method, which we call GraphTrans, applies a permutation-invariant Transformer module after a standard GNN module. This simple architecture leads to state-of-the-art results on several graph classification tasks, outperforming methods that explicitly encode graph structure. Our results suggest that purely-learning-based approaches without graph structure may be suitable for learning high-level, long-range relationships on graphs.

4、[CV] Efficient Visual Tracking with Exemplar Transformers

P Blatter, M Kanakis, M Danelljan, L V Gool

[ETH Zurich]

基于Exemplar Transformer的高性能视觉跟踪。更加复杂和强大的神经网络模型的设计极大地推动了视觉目标追踪领域的先进性。这些进步可以归功于更深的网络，或者是新的构建模块的引入，如Transformer。然而，在追求提高跟踪性能的过程中，高效的跟踪架构却出人意料地很少受到关注。本文提出Exemplar Transformer，一种用于实时视觉目标追踪的高效Transformer。E.T.Track是视觉追踪器，包含了Exemplar Transformer层，在CPU上的运行速度为47 fps，比其他基于Transformer的模型快8倍，使其成为唯一基于Transformer的实时跟踪器。当与可以在标准CPU上实时运行的轻量跟踪器相比，E.T.Track在LaSOT、OTB-100、NFS、TrackingNet和VOT-ST2020数据集上的性能始终优于所有其他方法。

The design of more complex and powerful neural network models has significantly advanced the state-of-the-art in visual object tracking. These advances can be attributed to deeper networks, or to the introduction of new building blocks, such as transformers. However, in the pursuit of increased tracking performance, efficient tracking architectures have received surprisingly little attention. In this paper, we introduce the Exemplar Transformer, an efficient transformer for real-time visual object tracking. E.T.Track, our visual tracker that incorporates Exemplar Transformer layers, runs at 47 fps on a CPU. This is up to 8× faster than other transformer-based models, making it the only real-time transformer-based tracker. When compared to lightweight trackers that can operate in real-time on standard CPUs, E.T.Track consistently outperforms all other methods on the LaSOT [12], OTB-100 [32], NFS [18], TrackingNet [24] and VOT-ST2020 [19] datasets.

5、[CL] skweak: Weak Supervision Made Easy for NLP

P Lison, J Barnes, A Hubin

[Norwegian Computing Center Oslo & University of Oslo]

skweak: 让弱监督NLP变简单。本文介绍skweak，一个多功能的、基于Python的软件工具箱，使NLP开发者能够将弱监督应用于广泛的NLP任务。弱监督是一种新兴的机器学习范式，基于一种简单的想法：用从领域知识中获得的标记函数来自动获得给定数据集的标注，而不是用手工标记数据点。用一个生成模型对所得到的标记进行汇总，该模型估计每个标记函数的准确性(和可能的混淆)。skweak工具包使得在文本数据上实现大量标记函数(如启发式方法、地名词典、神经模型或语言约束)变得容易，将它们应用于语料库，并以完全无监督的方式汇总结果。

We present skweak, a versatile, Python-based software toolkit enabling NLP developers to apply weak supervision to a wide range of NLP tasks. Weak supervision is an emerging machine learning paradigm based on a simple idea: instead of labelling data points by hand, we use labelling functions derived from domain knowledge to automatically obtain annotations for a given dataset. The resulting labels are then aggregated with a generative model that estimates the accuracy (and possible confusions) of each labelling function. The skweak toolkit makes it easy to implement a large spectrum of labelling functions (such as heuristics, gazetteers, neural models or linguistic constraints) on text data, apply them on a corpus, and aggregate their results in a fully unsupervised fashion. skweak is especially designed to facilitate the use of weak supervision for NLP tasks such as text classification and sequence labelling. We illustrate the use of skweak for NER and sentiment analysis.