爱可可AI前沿推介(11.21)

LG - 机器学习 CV - 计算机视觉 CL - 计算与语言 AS - 音频与语音 RO - 机器人

转自爱可可爱生活

1、[CL] XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale

A Babu, C Wang, A Tjandra, K Lakhotia, Q Xu, N Goyal, K Singh, P v Platen, Y Saraf, J Pino, A Baevski, A Conneau, M Auli

[Meta AI & Google AI & Outreach & Hugging Face]

XLS-R：大规模自监督跨语言语音表示学习。本文提出XLS-R，一种基于wav2vec 2.0的大规模跨语言语音表示学习模型。在128种语言的近50万小时公开的语音音频上训练模型，参数多达2B，比已知的最大的公共数据多一个数量级。评估涵盖了广泛的任务、领域、数据配置和语言，包括高资源和低资源。在CoVoST-2语音翻译基准上，在21个英语翻译方向上将以前的技术水平平均提高了7.4 BLEU。在语音识别方面，XLS-R比BABEL、MLS、CommonVoice以及VoxPopuli上最知名的前期工作有所提高，平均错误率降低了14-34%。XLS-R也为VoxLingua107的语言识别设定了新的最高水平。在有足够的模型规模的情况下，当把英语语音翻译成其他语言时，跨语言预训练可以和纯英语的预训练表现得一样好，而这是一个有利于单语言预训练的环境。希望XLS-R能够帮助改善世界上更多语言的语音处理任务。

This paper presents XLS-R, a large-scale model for cross-lingual speech representation learning based on wav2vec 2.0. We train models with up to 2B parameters on nearly half a million hours of publicly available speech audio in 128 languages, an order of magnitude more public data than the largest known prior work. Our evaluation covers a wide range of tasks, domains, data regimes and languages, both high and low-resource. On the CoVoST-2 speech translation benchmark, we improve the previous state of the art by an average of 7.4 BLEU over 21 translation directions into English. For speech recognition, XLS-R improves over the best known prior work on BABEL, MLS, CommonVoice as well as VoxPopuli, lowering error rates by 14-34% relative on average. XLS-R also sets a new state of the art on VoxLingua107 language identification. Moreover, we show that with sufficient model size, cross-lingual pretraining can perform as well as English-only pretraining when translating English speech into other languages, a setting which favors monolingual pretraining. We hope XLS-R can help to improve speech processing tasks for many more languages of the world. Models and code are available at www.github.com/ pytorch/fairseq/tree/master/examples/wav2vec/xlsr.1

https://weibo.com/1402400261/L2rRhnSUb

2、[LG] Covariate Shift in High-Dimensional Random Feature Regression

N Tripuraneni, B Adlam, J Pennington

[UC Berkeley & Google Research]

高维随机特征回归中的协变量漂移。开发鲁棒机器学习模型的一个重要障碍是协变量漂移，也是分布漂移的一种形式，当训练集和测试集的输入分布不同而条件标签分布保持不变时就会发生。尽管协变量漂移在现实世界的应用中很普遍，但在现代机器学习的背景下，仍然缺乏理论上的理解。本文研究了协变量漂移下随机特征回归的精确高维渐进性，并提出这种情况下极限测试误差、偏差和方差的精确特征。结果激发了对协变量漂移的自然偏序，为确定漂移何时会损害(甚至帮助)测试性能提供了充分条件。超参数化模型对协变量漂移表现出更强的稳健性，为这种有趣的现象提供了第一个理论解释。分析揭示了分布内和分布外泛化性能之间的确切线性关系，为这一令人惊讶的近期经验观察提供了解释。

A significant obstacle in the development of robust machine learning models is covariate shift, a form of distribution shift that occurs when the input distributions of the training and test sets differ while the conditional label distributions remain the same. Despite the prevalence of covariate shift in real-world applications, a theoretical understanding in the context of modern machine learning has remained lacking. In this work, we examine the exact high-dimensional asymptotics of random feature regression under covariate shift and present a precise characterization of the limiting test error, bias, and variance in this setting. Our results motivate a natural partial order over covariate shifts that provides a sufficient condition for determining when the shift will harm (or even help) test performance. We find that overparameterized models exhibit enhanced robustness to covariate shift, providing one of the first theoretical explanations for this intriguing phenomenon. Additionally, our analysis reveals an exact linear relationship between in-distribution and out-of-distribution generalization performance, offering an explanation for this surprising recent empirical observation.

https://weibo.com/1402400261/L2rWsATad

3、[CL] N-grammer: Augmenting Transformers with latent n-grams

2021

N-grammer: 基于潜N-grams增强Transformer。Transformer模型最近已成为自然语言处理的基础模型之一，作为副产品，最近人们对扩大这些模型的规模产生了极大的兴趣和投资。然而，这些大型Transformer语言模型的训练和推理成本过高，因此有必要进行更多的研究，以确定更有效的变体。本文受统计语言建模文献的启发，对Transformer架构提出了一个简单而有效的修改，即用从文本序列的离散潜表示中构建的n-grams来增强模型。对所提出的N-grammer模型在C4数据集上的语言建模进行了评估，发现它优于几个强大的基线，如Transformer和Primer。

Transformer models have recently emerged as one of the foundational models in natural language processing, and as a byproduct, there has been significant recent interest and investment in scaling these models. However, the training and inference costs of these large Transformer language models are prohibitive, thus necessitating more research in identifying more efficient variants. In this work, we propose a simple yet effective modification to the Transformer architecture inspired by the literature in statistical language modeling, by augmenting the model with n-grams constructed from a discrete latent representation of the text sequence. We evaluate our model, the N-grammer on language modeling on the C4 data-set, and find that it outperforms several strong baselines such as the Transformer and the Primer. We will open-source our model for reproducibility purposes upon acceptance.

https://weibo.com/1402400261/L2s0T5Dbq

4、[CL] Transparent Human Evaluation for Image Captioning

J Kasai, K Sakaguchi, L Dunagan, J Morrison, R L Bras, Y Choi, N A. Smith

[University of Washington & Allen Institute for AI]

对图像描述的透明人工评价。本文为图像描述模型建立了一个基于评分标准的人工评价协议。该评分标准及其定义是根据MSCOCO数据集上机器和人工生成的标题精心制定的。每个标题都沿着两个主要维度(精确度和召回率)，以及衡量文本质量的其他方面(流畅性、简洁性和包容性语言)进行评估和权衡。该评估显示了当前评估实践中的几个关键问题。人工生成的标题显示出比机器生成的标题质量高得多，特别是在突出信息的覆盖率方面(即召回率)，而所有的自动衡量标准都是相反的。基于评分标准的结果显示，CLIPScore，一种最新的使用图像特征的指标，比传统的纯文本指标与人工判断有更好的相关性，因为它对召回率更敏感。希望这项工作能够促进图像描述及其自动指标的更透明的评估协议。

We establish a rubric-based human evaluation protocol for image captioning models. Our scoring rubrics and their definitions are carefully developed based on machineand humangenerated captions on the MSCOCO dataset. Each caption is evaluated along two main dimensions in a tradeoff (precision and recall) as well as other aspects that measure the text quality (fluency, conciseness, and inclusive language). Our evaluations demonstrate several critical problems of the current evaluation practice. Human-generated captions show substantially higher quality than machine-generated ones, especially in coverage of salient information (i.e., recall), while all automatic metrics say the opposite. Our rubric-based results reveal that CLIPScore, a recent metric that uses image features, better correlates with human judgments than conventional text-only metrics because it is more sensitive to recall. We hope that this work will promote a more transparent evaluation protocol for image captioning and its automatic metrics.

https://weibo.com/1402400261/L2s4gnxDm

5、[CV] Sparse Steerable Convolutions: An Efficient Learning of SE(3)-Equivariant Features for Estimation and Tracking of Object Poses in 3D Space

J Lin, H Li, K Chen, J Lu, K Jia

[South China University of Technology & SmartMore Technology Co. Ltd]

稀疏可操纵卷积：面向3D空间目标姿态估计与跟踪的SE(3)等变特征高效学习。作为SE(3)等变深度特征学习的一个基本组成部分，可操纵卷积最近在3D语义分析中显示了其优势。然而，这些优势是由密集的体数据上昂贵的计算带来的，这使得它无法实际用于有效处理本质上是稀疏的3D数据。本文提出一种新的稀疏可操纵卷积(SS-Conv)设计来解决这一缺陷；SS-Conv大大加速了稀疏张量的可操纵卷积，同时严格保留了SE(3)-等变的特性。基于SS-Conv，提出了一个精确估计目标姿态的通用管线，其中一个关键的设计是特征操纵模块，充分利用SE(3)-等边性，能进行有效的姿态细化。为验证该设计，对3D目标语义分析的三个任务进行了全面的实验，包括实例级6D姿态估计、类别级6D姿态和尺寸估计，以及类别级6D姿态跟踪。线提出的基于SS-Conv的管道在三个任务所评估的几乎所有指标上都优于现有方法。消融研究也显示了SS-Conv在精度和效率方面都优于其他的卷积方法。

As a basic component of SE(3)-equivariant deep feature learning, steerable convolution has recently demonstrated its advantages for 3D semantic analysis. The advantages are, however, brought by expensive computations on dense, volumetric data, which prevent its practical use for efficient processing of 3D data that are inherently sparse. In this paper, we propose a novel design of Sparse Steerable Convolution (SS-Conv) to address the shortcoming; SS-Conv greatly accelerates steerable convolution with sparse tensors, while strictly preserving the property of SE(3)-equivariance. Based on SS-Conv, we propose a general pipeline for precise estimation of object poses, wherein a key design is a Feature-Steering module that takes the full advantage of SE(3)-equivariance and is able to conduct an efficient pose refinement. To verify our designs, we conduct thorough experiments on three tasks of 3D object semantic analysis, including instance-level 6D pose estimation, category-level 6D pose and size estimation, and categorylevel 6D pose tracking. Our proposed pipeline based on SS-Conv outperforms existing methods on almost all the metrics evaluated by the three tasks. Ablation studies also show the superiority of our SS-Conv over alternative convolutions in terms of both accuracy and efficiency. Our code is released publicly at https://github.com/Gorilla-Lab-SCUT/SS-Conv.

https://weibo.com/1402400261/L2s7yu9ZU

另外几篇值得关注的论文：

[CV] Towards Open Vocabulary Object Detection without Human-provided Bounding Boxes

无需人工标注边框的开放词汇目标检测

M Gao, C Xing, J C Niebles, J Li, R Xu, W Liu, C Xiong

[Salesforce Research]

https://weibo.com/1402400261/L2saI8ngC

[LG] Enhanced Membership Inference Attacks against Machine Learning Models

针对机器学习模型的增强型成员推理攻击

J Ye, A Maddi, S K Murakonda, R Shokri

[National University of Singapore & Privitar]

https://weibo.com/1402400261/L2scbbggn

[CV] Multi-View Motion Synthesis via Applying Rotated Dual-Pixel Blur Kernels

基于旋转双像素模糊核的多视图运动合成

A Abuolaim, M Afifi, M S. Brown

[York University]

https://weibo.com/1402400261/L2sgYpWDc

[CL] Time Waits for No One! Analysis and Challenges of Temporal Misalignment

时间不等人：时间错位的分析和挑战

K Luu, D Khashabi, S Gururangan, K Mandyam, N A. Smith