爱可可AI前沿推介(11.13)

LG - 机器学习 CV - 计算机视觉 CL - 计算与语言 AS - 音频与语音 RO - 机器人

转自爱可可爱生活

1、[CV] Dense Unsupervised Learning for Video Segmentation

N Araslanov, S Schaub-Meyer, S Roth

[TU Darmstadt]

面向视频分割的密集无监督学习。本文提出一种新方法，用于视频目标分割(VOS)的无监督学习。与以前工作不同，该方法在一个全卷积系统中直接学习密集特征表示。依靠均匀的网格采样提取一组锚点，训练模型在视频间和视频内层面上对它们进行区分。然而，一个训练这种模型的简单方案会导致一个退化的解决方案。本文建议用一个简单的正则化方案来防止这种情况的发生，将分割任务的等值特性适应于相似性转换。该训练目标可以有效地实现，并表现出快速的训练收敛性。在既定的VOS基准上，该方法超过了之前工作的分割精度，尽管使用的训练数据和计算量大大减少。

We present a novel approach to unsupervised learning for video object segmentation (VOS). Unlike previous work, our formulation allows to learn dense feature representations directly in a fully convolutional regime. We rely on uniform grid sampling to extract a set of anchors and train our model to disambiguate between them on both interand intra-video levels. However, a naive scheme to train such a model results in a degenerate solution. We propose to prevent this with a simple regularisation scheme, accommodating the equivariance property of the segmentation task to similarity transformations. Our training objective admits efficient implementation and exhibits fast training convergence. On established VOS benchmarks, our approach exceeds the segmentation accuracy of previous work despite using significantly less training data and compute power.

https://weibo.com/1402400261/L1ezaaYLh

2、[CV] Masked Autoencoders Are Scalable Vision Learners

K He, X Chen, S Xie, Y Li, P Dollár, R Girshick

[Facebook AI Research (FAIR)]

基于掩码自编码器的可扩展视觉学习。本文表明，掩码自编码器(MAE)是可扩展的计算机视觉自监督学习器。该MAE方法很简单：对输入图像的随机图块进行掩蔽，并重建缺失像素。它基于两个核心设计。首先，开发了一种非对称编-解码器架构，其中的编码器只对可见的图块子集进行操作(没有掩码标记)，同时还有一个轻量级的解码器，从潜表示和掩码标记中重建原始图像。其次，对输入图像的高比例进行掩蔽，例如75%，会产生一个非平凡的、有意义的自监督任务。耦合这两种设计能有效地训练大型模型：加速训练(3倍或更多)并提高精度。所提出的可扩展方法允许学习具有良好泛化能力的高容量模型：例如，在只使用ImageNet-1K数据的方法中，vanilla ViT-Huge模型达到了最好的精度(87.8%)。在下游任务中的迁移性能优于有监督预训练，并显示出有希望的扩展行为。

This paper shows that masked autoencoders (MAE) are scalable self-supervised learners for computer vision. Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels. It is based on two core designs. First, we develop an asymmetric encoder-decoder architecture, with an encoder that operates only on the visible subset of patches (without mask tokens), along with a lightweight decoder that reconstructs the original image from the latent representation and mask tokens. Second, we find that masking a high proportion of the input image, e.g., 75%, yields a nontrivial and meaningful self-supervisory task. Coupling these two designs enables us to train large models efficiently and effectively: we accelerate training (by 3× or more) and improve accuracy. Our scalable approach allows for learning high-capacity models that generalize well: e.g., a vanilla ViT-Huge model achieves the best accuracy (87.8%) among methods that use only ImageNet-1K data. Transfer performance in downstream tasks outperforms supervised pretraining and shows promising scaling behavior.

https://weibo.com/1402400261/L1eCuaFVf

3、[CV] The Emergence of Objectness: Learning Zero-Shot Segmentation from Videos

R Liu, Z Wu, S X. Yu, S Lin

[Microsoft Research Asia & UC Berkeley]

物体性的涌现：视频的零样本分割学习。人类可以在不知道它们是什么的情况下轻易地分割移动物体。物体性可以从连续的视觉观察中出现，这促使我们从无标记视频中同时建立分组和运动模型。前提是，一段视频有同一场景的不同视图，这些视图由移动的部件组成，正确的区域分割和区域流将允许相互的视图合成，这可以从数据本身检查，而不需要任何外部监督。所提出模型从两个独立的路径开始：一个是为单一图像输出基于特征的区域分割的外观路径，另一个是为一对图像输出运动特征的运动路径。然后，它将它们结合在一个被称为"段流"的联合表示中，该表示汇集了每个区域的流动偏移，并为整个场景提供了运动区域的总特征。通过训练模型以最小化基于段流的视图合成误差，外观和运动路径自动学习区域分割和流量估计，而不需要分别从低层边缘或光流中建立它们。该模型展示了外观路径中令人惊讶的物体性的出现，超越了先前关于从图像中进行零样本目标分割、从视频中进行无监督测试时适应的移动目标分割以及通过监督微调进行语义图像分割的工作。该工作是第一个真正的端到端视频零样本目标分割。不仅为分割和跟踪开发了通用的物体性，而且在没有增强工程的情况下优于普遍的基于图像的对比学习方法。

Humans can easily segment moving objects without knowing what they are. That objectness could emerge from continuous visual observations motivates us to model grouping and movement concurrently from unlabeled videos. Our premise is that a video has different views of the same scene related by moving components, and the right region segmentation and region flow would allow mutual view synthesis which can be checked from the data itself without any external supervision. Our model starts with two separate pathways: an appearance pathway that outputs feature-based region segmentation for a single image, and a motion pathway that outputs motion features for a pair of images. It then binds them in a conjoint representation called segment flow that pools flow offsets over each region and provides a gross characterization of moving regions for the entire scene. By training the model to minimize view synthesis errors based on segment flow, our appearance and motion pathways learn region segmentation and flow estimation automatically without building them up from low-level edges or optical flows respectively. Our model demonstrates the surprising emergence of objectness in the appearance pathway, surpassing prior works on zero-shot object segmentation from an image, moving object segmentation from a video with unsupervised test-time adaptation, and semantic image segmentation by supervised fine-tuning. Our work is the first truly end-to-end zero-shot object segmentation from videos. It not only develops generic objectness for segmentation and tracking, but also outperforms prevalent image-based contrastive learning methods without augmentation engineering.

https://weibo.com/1402400261/L1eFTsgib

4、[CL] Scaling ASR Improves Zero and Few Shot Learning

A Xiao, W Zheng, G Keren, D Le, F Zhang, C Fuegen, O Kalinli, Y Saraf, A Mohamed

[Facebook AI]

自动语音识别规模化有助于改善零样本和少样本学习。本文通过来自120个国家10个不同来源的450万小时的英语语音和高达100亿个参数的模型，探索了自动语音识别的规模前沿。提出了有效扩展训练数据的数据选择技术，以便在大规模数据集中找到最有价值的样本。为了有效地扩展模型规模，利用了各种优化，如稀疏换能器损失和模型分片。本文通过训练1-10B参数的通用英语自动语音识别模型，突破了许多领域的语音识别性能的极限。所用模型在新的领域和语音风格上学习了强大的语音表示，具有零样本和少样本的能力，超过了之前在多个内部和公共基准的结果。对于因脑损伤而出现障碍的说话者，所达到最好的零样本和少样本模型在AphasiaBank测试集上分别实现了22%和60%的相对改进，同时在开放社交媒体视频上实现了最佳性能。此外，同样的通用模型在SPGISpeech金融领域数据集上以少500倍的域内数据达到了同等性能。

With 4.5 million hours of English speech from 10 different sources across 120 countries and models of up to 10 billion parameters, we explore the frontiers of scale for automatic speech recognition. We propose data selection techniques to efficiently scale training data to find the most valuable samples in massive datasets. To efficiently scale model sizes, we leverage various optimizations such as sparse transducer loss and model sharding. By training 1-10B parameter universal English ASR models, we push the limits of speech recognition performance across many domains. Furthermore, our models learn powerful speech representations with zero and few-shot capabilities on novel domains and styles of speech, exceeding previous results across multiple in-house and public benchmarks. For speakers with disorders due to brain damage, our best zero-shot and few-shot models achieve 22% and 60% relative improvement on the AphasiaBank test set, respectively, while realizing the best performance on public social media videos. Furthermore, the same universal model reaches equivalent performance with 500x less in-domain data on the SPGISpeech financial-domain dataset.

https://weibo.com/1402400261/L1eJgCtlF

5、[LG] Persia: A Hybrid System Scaling Deep Learning Based Recommenders up to 100 Trillion Parameters

X Lian, B Yuan, X Zhu...

[Kwai Inc., USA & Kuaishou Technology & ETH Zürich]

Persia：将深度学习推荐系统扩展到百万亿参数的混合系统。基于深度学习的模型已经主导了当前生产推荐系统的格局。此外，近年来，模型的规模呈指数级增长——从Google 2016年的10亿个参数的模型到Facebook最新的12万亿个参数的模型。模型容量的每一次跳跃都会带来明显的质量提升，这让我们相信100万亿参数的时代就在眼前。然而，即使在工业规模的数据中心内，这种模型的训练也是具有挑战性的。这种困难来自于训练计算的惊人的异质性——模型的嵌入层可能包括超过99.99%的总模型大小，这是极其密集的内存；而其余的神经网络则越来越密集的计算。为了支持这种巨大模型的训练，迫切需要一种高效的分布式训练系统。本文通过对优化算法和分布式系统架构的精心设计来解决这一挑战。为保证训练效率和训练精度，设计了一种新的混合训练算法，其中嵌入层和密集神经网络由不同的同步机制处理；建立了一种名为Persia(混合加速并行推荐训练系统)的系统来支持这种混合训练算法。进行了理论论证和高达100万亿个参数的实证研究，以证明Persia的系统设计和实现。

Deep learning based models have dominated the current landscape of production recommender systems. Furthermore, recent years have witnessed an exponential growth of the model scale—from Google’s 2016 model with 1 billion parameters to the latest Facebook’s model with 12 trillion parameters. Significant quality boost has come with each jump of the model capacity, which makes us believe the era of 100 trillion parameters is around the corner. However, the training of such models is challenging even within industrial scale data centers. This difficulty is inherited from the staggering heterogeneity of the training computation—the model’s embedding layer could includemore than 99.99% of the total model size, which is extremely memory-intensive; while the rest neural network is increasingly computation-intensive. To support the training of such huge models, an efficient distributed training system is in urgent need. In this paper, we resolve this challenge by careful co-design of both the optimization algorithm and the distributed system architecture. Specifically, in order to ensure both the training efficiency and the training accuracy, we design a novel hybrid training algorithm, where the embedding layer and the dense neural network are handled by different synchronization mechanisms; then we build a system called Persia (short for parallel recommendation training system with hybrid acceleration) to support this hybrid training algorithm. Both theoretical demonstration and empirical study up to 100 trillion parameters have conducted to justified the system design and implementation of Persia. We make Persia publicly available (at https://github.com/PersiaML/Persia) so that anyone would be able to easily train a recommender model at the scale of 100 trillion parameters.

https://weibo.com/1402400261/L1eN0Bgwf

另外几篇值得关注的论文：

[CV] Unsupervised Part Discovery from Contrastive Reconstruction

对比重建中的无监督部件发现

S Choudhury, I Laina, C Rupprecht, A Vedaldi

[University of Oxford]

https://weibo.com/1402400261/L1ePGnD2Q

[CV] Dance In the Wild: Monocular Human Animation with Neural Dynamic Appearance Synthesis

基于神经动态外观合成的单目人体动画

T Y. Wang, D Ceylan, K K Singh, N J. Mitra

[Adobe Research]

https://weibo.com/1402400261/L1eSldsgd

[LG] Amazon SageMaker Model Parallelism: A General and Flexible Framework for Large Model Training

Amazon SageMaker 模型并行化：大模型训练的通用灵活并行框架

C Karakus, R Huilgol, F Wu, A Subramanian, C Daniel, D Cavdar, T Xu, H Chen, A Rahnama, L Quintela

[Amazon.com]

https://weibo.com/1402400261/L1eUiqTlo

[CV] The Curious Layperson: Fine-Grained Image Recognition without Expert Labels

好奇的外行：无需专家标记的细粒度图像识别