爱可可AI前沿推介(5.20)

LG - 机器学习 CV - 计算机视觉 CL - 计算与语言 AS - 音频与语音 RO - 机器人

转自爱可可爱生活

摘要：掩码自编码器作为时空学习器、全身密集对应图学习、克服少样本提示顺序敏感问题、面向高效半监督学习的自动规则归纳、基于运动预测的无监督视频图像分割、算法社会中的责任、将文档变成对话、面向混合自动语音识别的现实场景自监督学习部署、稀疏压缩网络元学习

1、[CV] Masked Autoencoders As Spatiotemporal Learners

C Feichtenhofer, H Fan, Y Li, K He

[Facebook AI Research (FAIR)]

掩码自编码器作为时空学习器。本文研究了一个概念上简单的掩码自编码器(MAE)的扩展，用以从视频中学习时空表示。通过随机掩码视频中的时空块，学习一个自编码器来进行像素集重建。该MAE方法可以在几乎没有时空归纳偏差的情况下学习强大的表示(只除了图块和位置嵌入)，与时空无关的随机掩码表现最好。最佳掩码率高达90%(与图像上的75%相比），印证了这一比率与数据信息冗余有关的假设。高掩码率导致了巨大的速度提升，例如，实际运行时速度超出4倍甚至更多。本文报告了在几个具有挑战性的视频数据集上使用vanilla Vision Transformer的竞争结果。MAE可以在很大程度上超过有监督预训练的效果。进一步报告了在真实世界、未经整理的Instagram数据上的训练结果，令人鼓舞。所做研究表明，掩码自编码的一般框架可成为一种统一方法，用于具有最小领域知识的表示学习。

This paper studies a conceptually simple extension of Masked Autoencoders (MAE) [31] to spatiotemporal representation learning from videos. We randomly mask out spacetime patches in videos and learn an autoencoder to reconstruct them in pixels. Interestingly, we show that our MAE method can learn strong representations with almost no inductive bias on spacetime (only except for patch and positional embeddings), and spacetime-agnostic random masking performs the best. We observe that the optimal masking ratio is as high as 90% (vs. 75% on images [31]), supporting the hypothesis that this ratio is related to information redundancy of the data. A high masking ratio leads to a large speedup, e.g., > 4× in wall-clock time or even more. We report competitive results on several challenging video datasets using vanilla Vision Transformers [18]. We observe that MAE can outperform supervised pre-training by large margins. We further report encouraging results of training on real-world, uncurated Instagram data. Our study suggests that the general framework of masked autoencoding (BERT [15], MAE [31], etc.) can be a unified methodology for representation learning with minimal domain knowledge.

https://arxiv.org/abs/2205.09113

2、[CV] BodyMap: Learning Full-Body Dense Correspondence Map

A Ianina, N Sarafianos, Y Xu, I Rocco, T Tung

[Moscow Institute of Physics and Technology & Meta AI & Meta Reality Labs Research]

BodyMap: 全身密集对应图学习。人与人之间的密集对应关系带有强大的语义信息，可用来解决全身理解的基本问题，如真实场景的表面匹配、跟踪和重建。本文提出BodyMap，一种新的框架，用于获得着装人体的真实图像和3D模板模型的表面间的高清晰度全身和连续密集的对应关系。这些对应关系涵盖了精细的细节，如手和头发，同时捕捉到远离身体表面的区域，如宽松的衣服。之前估计这种密集的表面对应关系的方法是：i）将3D人体切割成部分，这些部分被解包到2D UV贴图空间，沿部分接缝产生不连续；或者ii）使用单一表面代表整个身体，但没有处理身体细节。本文提出一种基于视觉Transformer的新的网络结构，可以在连续的人体表面学习精细的特征。BodyMap在各种指标和数据集上的表现优于之前的工作，包括DensePose-COCO，具有很大优势。本文还展示了各种应用，包括多层密集衣服对应、神经渲染与新视图合成和外观互换。

Dense correspondence between humans carries powerful semantic information that can be utilized to solve fundamental problems for full-body understanding such as in-the-wild surface matching, tracking and reconstruction. In this paper we present BodyMap, a new framework for obtaining high-definition full-body and continuous dense correspondence between in-the-wild images of clothed humans and the surface of a 3D template model. The correspondences cover fine details such as hands and hair, while capturing regions far from the body surface, such as loose clothing. Prior methods for estimating such dense surface correspondence i) cut a 3D body into parts which are unwrapped to a 2D UV space, producing discontinuities along part seams, or ii) use a single surface for representing the whole body, but none handled body details. Here, we introduce a novel network architecture with Vision Transformers that learn fine-level features on a continuous body surface. BodyMap outperforms prior work on various metrics and datasets, including DensePose-COCO by a large margin. Furthermore, we show various applications ranging from multi-layer dense cloth correspondence, neural rendering with novel-view synthesis and appearance swapping.

https://arxiv.org/abs/2205.09111

3、[CL] Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity

Y Lu, M Bartolo, A Moore, S Riedel, P Stenetorp

[University College London & Mishcon de Reya LLP]

难以置信的有序提示以及到哪里去找：克服少样本提示顺序敏感问题。当只有少数训练样本的时候，像GPT-3这样的大型预训练语言模型与完全监督的、经过微调的、大型预训练语言模型相比，显示出有竞争力的结果。本文证明，提供样本的顺序可以使接近最先进的性能和随机猜测的性能之间产生差异：相同的提示，某些排列方式是"难以置信的"，而另一些则不是。本文详细分析了这一现象，确定：它存在于不同的模型规模中(即使是目前最大的模型)，与特定的样本子集无关，而且一个模型的给定好的排列方式不能迁移到另一个模型中。虽然人们可以用开发集来确定哪些排列方式是有效的，但这将偏离真实的少样本设置，因为它需要额外的标注数据。利用语言模型的生成性来构建一个人工开发集，并根据这个开发集上的候选排列组合的熵统计，来确定性能良好的提示。所提出方法使GPT族模型在11个不同的既定文本分类任务中产生了13%的相对改进。

When primed with only a handful of training samples, very large, pretrained language models such as GPT-3 have shown competitive results when compared to fully-supervised, finetuned, large, pretrained language models. We demonstrate that the order in which the samples are provided can make the difference between near state-of-the-art and random guess performance: essentially some permutations are “fantastic” and some not. We analyse this phenomenon in detail, establishing that: it is present across model sizes (even for the largest current models), it is not related to a specific subset of samples, and that a given good permutation for one model is not transferable to another. While one could use a development set to determine which permutations are performant, this would deviate from the true fewshot setting as it requires additional annotated data. Instead, we use the generative nature of language models to construct an artificial development set and based on entropy statistics of the candidate permutations on this set, we identify performant prompts. Our method yields a 13% relative improvement for GPTfamily models across eleven different established text classification tasks.

https://arxiv.org/abs/2104.08786

4、[CL] Automatic Rule Induction for Efficient Semi-Supervised Learning

R Pryzant, Z Yang, Y Xu, C Zhu, M Zeng

[Microsoft Cognitive Services Research Group]

面向高效半监督学习的自动规则归纳。半监督学习在允许NLP模型从少量标记数据中进行归纳方面显示出有希望的前景。同时，预训练的Transformer模型作为黑箱关联引擎，难以解释，有时表现得不可靠。本文建议通过自动规则归纳(ARI)来解决这两个挑战，这是一个简单而通用的框架，用于自动发现和整合符号规则到预训练的Transformer模型中。从在少量标记数据上训练的低容量机器学习模型中提取弱的符号规则，用注意力机制将这些规则整合到高容量预训练Transformer模型中，规则增强的系统成为自训练框架的一部分，以增强对未标记数据的监督信号。这些步骤可以分层在各种现有的弱监督和半监督NLP算法之下，以提高性能和可解释性。在9个序列分类和关系提取任务中的实验表明，ARI可以改进最先进的方法，无需额外的人工努力，仅需要最小的计算开销。

Semi-supervised learning has shown promise in allowing NLP models to generalize from small amounts of labeled data. Meanwhile, pretrained transformer models act as blackbox correlation engines that are difficult to explain and sometimes behave unreliably. In this paper, we propose tackling both of these challenges via Automatic Rule Induction (ARI), a simple and general-purpose framework for the automatic discovery and integration of symbolic rules into pretrained transformer models. First, we extract weak symbolic rules from low-capacity machine learning models trained on small amounts of labeled data. Next, we use an attention mechanism to integrate these rules into high-capacity pretrained transformer models. Last, the rule-augmented system becomes part of a self-training framework to boost supervision signal on unlabeled data. These steps can be layered beneath a variety of existing weak supervision and semisupervised NLP algorithms in order to improve performance and interpretability. Experiments across nine sequence classification and relation extraction tasks suggest that ARI can improve state-of-the-art methods with no manual effort and minimal computational overhead.

https://arxiv.org/abs/2205.09067

5、[CV] Guess What Moves: Unsupervised Video and Image Segmentation by Anticipating Motion

S Choudhury, L Karazija, I Laina, A Vedaldi, C Rupprecht

[University of Oxford]

猜猜看什么在动：基于运动预测的无监督视频图像分割。通过光流测量的运动，为图像和视频中物体的发现和学习提供了一个强大的线索。然而，与使用外观相比，该方法存在一些盲点，例如，如果目标不移动，就会变得不可见。本文提出一种结合基于运动和基于外观两种分割各自优势的方法，利用视频中的运动和目标间的协同作用，在没有监督的情况下对视觉目标进行分割。用运动预期作为学习信号：训练一个图像分割网络来预测可能包含简单光流模式的区域，因为这些区域有很大的机会对应于目标。该模型可以通过测试时训练用于视频目标分割，因为无监督的损失会告知图像模型的运动情况。在两种模式下应用这个网络。在无监督视频分割模式中，该网络在未标记视频集合上进行训练，将学习过程本身作为一种算法来分割这些视频。在无监督图像分割模式中，网络使用视频学习，并应用于分割独立的静态图像。凭借这一点，在无监督视频和图像分割方面获得了强有力的经验结果，在DAVIS等基准上的表现明显优于现有技术水平，有时还有5%的IoU差距。

Motion, measured via optical flow, provides a powerful cue to discover and learn objects in images and videos. However, compared to using appearance, it has some blind spots, such as the fact that objects become invisible if they do not move. In this work, we propose an approach that combines the strengths of motion-based and appearance-based segmentation. We propose to supervise an image segmentation network, tasking it with predicting regions that are likely to contain simple motion patterns, and thus likely to correspond to objects. We apply this network in two modes. In the unsupervised video segmentation mode, the network is trained on a collection of unlabelled videos, using the learning process itself as an algorithm to segment these videos. In the unsupervised image segmentation model, the network is learned using videos and applied to segment independent still images. With this, we obtain strong empirical results in unsupervised video and image segmentation, significantly outperforming the state of the art on benchmarks such as DAVIS, sometimes with a 5% IoU gap.

https://arxiv.org/abs/2205.07844