爱可可AI前沿推介(11.17)

LG - 机器学习 CV - 计算机视觉 CL - 计算与语言 AS - 音频与语音 RO - 机器人

1、[CV] Attention Mechanisms in Computer Vision: A Survey

M Guo, T Xu, J Liu, Z Liu, P Jiang, T Mu, S Zhang, R R. Martin, M Cheng, S Hu

[Tsinghua University & Nankai University & Cardiff University]

计算机视觉中的注意力机制综述。人类可以自然而有效地在复杂场景中找到突出的区域。在这一观察的激励下，注意力机制被引入计算机视觉中，目的是模仿人类视觉系统的这一方面。这种注意力机制可以看作是一个基于输入图像特征的动态权重调整过程。注意力机制在许多视觉任务中取得了巨大的成功，包括图像分类、目标检测、语义分割、视频理解、图像生成、3D视觉、多模态任务和自监督学习。本文对计算机视觉中的各种注意力机制进行了全面的回顾，并根据方法进行了分类，如通道注意力、空间注意力、时间注意力和分支注意力，提出了注意力机制研究的未来方向。

Humans can naturally and effectively find salient regions in complex scenes. Motivated by this observation, attention mechanisms were introduced into computer vision with the aim of imitating this aspect of the human visual system. Such an attention mechanism can be regarded as a dynamic weight adjustment process based on features of the input image. Attention mechanisms have achieved great success in many visual tasks, including image classification, object detection, semantic segmentation, video understanding, image generation, 3D vision, multi-modal tasks and self-supervised learning. In this survey, we provide a comprehensive review of various attention mechanisms in computer vision and categorize them according to approach, such as channel attention, spatial attention, temporal attention and branch attention; a related repository https://github.com/MenghaoGuo/Awesome-Vision-Attentions is dedicated to collecting related work. We also suggest future directions for attention mechanism research.

https://weibo.com/1402400261/L1Qaf8d8w

2、[CV] Are we ready for a new paradigm shift? A Survey on Visual Deep MLP

R Liu, Y Li, D Liang, L Tao, S Hu, H Zheng

[Tsinghua University]

视觉深度MLP综述。多层感知器(MLP)作为第一个出现的神经网络结构，受制于硬件计算能力和数据集大小，一度沉沦了数十载。在此期间，我们见证了从人工特征提取到具有局部感受野的CNN，再到基于自注意力机制的具有全局接受场的Transformer的范式转变。而在今年(2021年)，随着MLP-Mixer的推出，MLP重新进入人们的视线，并吸引了计算机视觉界的广泛研究。与传统MLP相比，它变得更深，但将输入从全平坦化改为图块平坦化。鉴于它的高性能和对视觉特定的归纳偏差的较少需求，社区不禁要问：深度MLP，这个具有全局感受野但没有注意力的最简单结构，会不会成为一个新的计算机视觉范式？为了回答这个问题，本文旨在对视觉领域的深度MLP模型的最新发展提供一个全面的综述。详细回顾了这些MLP，从细微的子模块设计到全局网络结构。比较了不同网络设计的感受野、计算复杂性和其他属性，以便清楚地了解MLP的发展路径。MLP的分辨率敏感度和计算密度仍未得到解决，纯MLP正逐渐向CNN一样发展。目前的数据量和计算能力还没有准备好接受纯MLP，人工视觉引导仍然很重要。对开放的研究方向和潜在的未来工作提出了自己的观点。

Multilayer perceptron (MLP), as the first neural network structure to appear, was a big hit. But constrained by the hardware computing power and the size of the datasets, it once sank for tens of years. During this period, we have witnessed a paradigm shift from manual feature extraction to the CNN with local receptive field, and further to the Transformer with global receptive field based on selfattention mechanism. And this year (2021), with the introduction of MLP-Mixer, MLP has re-entered the limelight and has attracted extensive research from the computer vision community. Compare to the conventional MLP, it gets deeper but changes the input from full flattening to patch flattening. Given its high performance and less need for vision-specific inductive bias, the community can’t help but wonder, Will deep MLP, the simplest structure with global receptive field but no attention, become a new computer vision paradigm? To answer this question, this survey aims to provide a comprehensive overview of the recent development of deep MLP models in vision. Specifically, we review these MLPs in detail, from the subtle sub-module design to the global network structure. We compare the receptive field, computational complexity, and other properties of different network designs in order to understand the development path of MLPs clearly. The investigation shows that MLPs’ resolution-sensitivity and computational densities remain unresolved, and pure MLPs are gradually evolving towards CNN-like. We suggest that the current data volume and computational power are not ready to embrace pure MLPs, and artificial visual guidance remains important. Finally, we provide our viewpoint about open research directions and potential future works. We hope this effort will ignite further interest in the community and encourage better visual tailored design for the neural network in the future.

https://weibo.com/1402400261/L1QcRijSx

3、[CL] Textless Speech Emotion Conversion using Decomposed and Discrete Representations

F Kreuk, A Polyak, J Copet, E Kharitonov, T Nguyen, M Rivière, W Hsu, A Mohamed, E Dupoux, Y Adi

[Bar-Ilan University & Facebook AI Research]

基于分解和离散表示的非文本语音情感转换。语音情感转换是在保留词汇内容和说话人身份的情况下修改语音语料的可感知情感的任务。本文把情感转换问题作为一项口语翻译任务。将语音分解成离散的、不相干的习得表示，由内容单元、F0、说话人和情绪组成。首先，通过将内容单元翻译成目标情绪来修改语音内容，然后根据这些单元来预测声音特征。最后，通过将预测的表示送入一个神经声码器来生成语音波形。这样的范式能超越信号的频谱和参数变化，并对非语言发声进行建模，如插入笑声、消除哈欠等。在客观上和主观上证明了所提出方法在感知情感和音频质量方面优于基线。严格评估了这样一个复杂系统的所有组成部分，并以广泛的模型分析和消融研究作为结论，以更好地强调所提出方法的架构选择、优势和劣势。

Speech emotion conversion is the task of modifying the perceived emotion of a speech utterance while preserving the lexical content and speaker identity. In this study, we cast the problem of emotion conversion as a spoken language translation task. We decompose speech into discrete and disentangled learned representations, consisting of content units, F0, speaker, and emotion. First, we modify the speech content by translating the content units to a target emotion, and then predict the prosodic features based on these units. Finally, the speech waveform is generated by feeding the predicted representations into a neural vocoder. Such a paradigm allows us to go beyond spectral and parametric changes of the signal, and model non-verbal vocalizations, such as laughter insertion, yawning removal, etc. We demonstrate objectively and subjectively that the proposed method is superior to the baselines in terms of perceived emotion and audio quality. We rigorously evaluate all components of such a complex system and conclude with an extensive model analysis and ablation study to better emphasize the architectural choices, strengths and weaknesses of the proposed method. Samples and code will be publicly available under the following link: https://speechbot.github. io/emotion.

https://weibo.com/1402400261/L1QgriJf1

4、[LG] DriverGym: Democratising Reinforcement Learning for Autonomous Driving

P Kothari, C Perone, L Bergamini, A Alahi, P Ondruska

[Woven Planet & EPFL]

DriverGym：强化学习无人驾驶大众化。尽管强化学习(RL)取得了可喜的进展，但为无人驾驶(AD)开发算法仍然具有挑战性：其中一个关键问题是缺乏一个开源平台，能在真实世界的数据上训练和有效验证强化学习策略。本文提出DriverGym，一个开源的与OpenAI Gym兼容的环境，专门为开发无人驾驶的强化学习算法而定制。DriverGym提供了对超过1000小时的专家记录数据的访问，也支持反应式和数据驱动的智能体行为。使用广泛而灵活的闭环评估协议，可以很容易地在真实世界的数据上验证强化学习策略的性能。本文还提供了使用监督学习和强化学习的行为克隆基线，在DriverGym中训练。

Despite promising progress in reinforcement learning (RL), developing algorithms for autonomous driving (AD) remains challenging: one of the critical issues being the absence of an open-source platform capable of training and effectively validating the RL policies on real-world data. We propose DriverGym, an opensource OpenAI Gym-compatible environment specifically tailored for developing RL algorithms for autonomous driving. DriverGym provides access to more than 1000 hours of expert logged data and also supports reactive and data-driven agent behavior. The performance of an RL policy can be easily validated on real-world data using our extensive and flexible closed-loop evaluation protocol. In this work, we also provide behavior cloning baselines using supervised learning and RL, trained in DriverGym. Code and videos are available on the L5Kit repository.

https://weibo.com/1402400261/L1QkViT0a

5、[CV] LiT: Zero-Shot Transfer with Locked-image Text Tuning

X Zhai, X Wang, B Mustafa, A Steiner, D Keysers, A Kolesnikov, L Beyer

[Google Research]

LiT: 基于锁定图像文本微调的零样本迁移。本文介绍了对比性微调，一种采用对比训练的简单方法，在利用其预训练的同时，使图像和文本模型保持一致。在实证研究中，发现锁定的预训练图像模型和未锁定的文本模型效果最好。称这种对比性微调的实例为"锁定图像文本微调"(LiT-tuning)，只是教文本模型从预训练图像模型中读出好的表示，用于新任务。一个经过LiT微调的模型获得了向新的视觉任务(如图像分类或检索)进行零样本迁移的能力。所提出的LiTtuning具有广泛的适用性；它在多种预训练方法(有监督和无监督)和不同的架构(ResNet、Vision Transformers和MLP-Mixer)中使用三种不同的图像-文本数据集可靠地工作。通过基于transformer的预训练ViT-g/14模型，LiT微调后的模型在ImageNet测试集上实现了84.5%的零样本迁移精度，在具有挑战性的分布外ObjectNet测试集上实现了81.1%。

This paper presents contrastive-tuning, a simple method employing contrastive training to align image and text models while still taking advantage of their pre-training. In our empirical study we find that locked pre-trained image models with unlocked text models work best. We call this instance of contrastive-tuning “Locked-image Text tuning” (LiT-tuning), which just teaches a text model to read out good representations from a pre-trained image model for new tasks. A LiT-tuned model gains the capability of zero-shot transfer to new vision tasks, such as image classification or retrieval. The proposed LiTtuning is widely applicable; it works reliably with multiple pre-training methods (supervised and unsupervised) and across diverse architectures (ResNet, Vision Transformers and MLP-Mixer) using three different image-text datasets. With the transformer-based pre-trained ViT-g/14 model, the LiT-tuned model achieves 84.5% zero-shot transfer accuracy on the ImageNet test set, and 81.1% on the challenging out-of-distribution ObjectNet test set.

https://weibo.com/1402400261/L1Qo4lsvq

另外几篇值得关注的论文：

[CV] AnimeCeleb: Large-Scale Animation CelebFaces Dataset via Controllable 3D Synthetic Models

AnimeCeleb：基于可控3D合成模型的大规模动画CelebFaces数据集

K Kim, S Park, J Lee, S Chung, J Lee, J Choo

[KAIST & DGIST & Korea University & Naver Webtoon]

https://weibo.com/1402400261/L1QrLBeG5

[CV] Category-orthogonal object features guide information processing in recurrent neural networks trained for object categorization

类正交目标特征指导目标分类递归神经网络信息处理

S Thorat, G Aldegheri, T C. Kietzmann

[Donders Institute for Brain]

https://weibo.com/1402400261/L1Qu22ltL

[CV] iBOT: Image BERT Pre-Training with Online Tokenizer

iBOT：基于在线Tokenizer的图像BERT预训练

J Zhou, C Wei, H Wang, W Shen, C Xie, A Yuille, T Kong

[ByteDance & Johns Hopkins University & Shanghai Jiao Tong University & UC Santa Cruz]

https://weibo.com/1402400261/L1QwF7TfW

[CV] Towards Optimal Strategies for Training Self-Driving Perception Models in Simulation

用模拟训练无人驾驶感知模型最佳策略

D Acuna, J Philion, S Fidler

[NVIDIA]

https://weibo.com/1402400261/L1QyVcL3u

内容中包含的图片若涉及版权问题，请及时与我们联系删除