爱可可AI前沿推介(2.28)

LG - 机器学习 CV - 计算机视觉 CL - 计算与语言 AS - 音频与语音 RO - 机器人

转自爱可可爱生活

1、[LG] Deconstructing Distributions: A Pointwise Framework of Learning

G Kaplun, N Ghosh, S Garg, B Barak, P Nakkiran

[Harvard & UC Berkeley & CMU & UC San Diego]

分布解构：逐(样本)点学习框架。在机器学习中，传统上评估单个模型的性能，在测试输入的集合中取平均值。本文提出了一种新方法：当对单个输入点进行评估时，测量模型集的性能。特别是，研究了一个点的profile：模型在测试分布上的平均性能与它们在这个单独的点上的逐点性能之间的关系。profile可以产生对模型和数据在分布中和分布外的结构的新认识。例如，从经验上表明，真实的数据分布是由具有不同品质的profile的点组成的。一方面，有一些"兼容"的点，其逐点性能和平均性能之间有很强的相关性。另一方面，有一些点具有弱的甚至是负的相关性：在这种情况下，提高整体模型的精度实际上会损害这些输入的性能。这些实验观察与之前工作中提出的几种模拟学习模型的预测不一致。作为一个应用，用profile构建了一个称为CIFAR-10-Neg的数据集：CINIC-10的一个子集，对于标准模型来说，CIFAR-10-Neg的精度与CIFAR-10测试的精度是负相关的。它首次说明了OD数据集完全颠倒了"在线精度"。

In machine learning, we traditionally evaluate the performance of a single model, averaged over a collection of test inputs. In this work, we propose a new approach: we measure the performance of a collection of models when evaluated on a single input point. Specically, we study a point’s prole: the relationship between models’ average performance on the test distribution and their pointwise performance on this individual point. We nd that proles can yield new insights into the structure of both models and data—in and out-of-distribution. For example, we empirically show that real data distributions consist of points with qualitatively dierent proles. On one hand, there are “compatible” points with strong correlation between the pointwise and average performance. On the other hand, there are points with weak and even negative correlation: cases where improving overall model accuracy actually hurts performance on these inputs. We prove that these experimental observations are inconsistent with the predictions of several simplied models of learning proposed in prior work. As an application, we use proles to construct a dataset we call CIFAR-10-Neg: a subset of CINIC-10 such that for standard models, accuracy on CIFAR-10-Neg is negatively correlated with accuracy on CIFAR-10 test. is illustrates, for the rst time, an OOD dataset that completely inverts “accuracy-on-the-line” (Miller, Taori, Raghunathan, Sagawa, Koh, Shankar, Liang, Carmon, and Schmidt 2021).

2、[CV] A Self-Supervised Descriptor for Image Copy Detection

E Pizzi, S D Roy, S N Ravindra, P Goyal, M Douze (2022)

面向图像复制检测的自监督描述子。图像复制检测是内容管理的一项重要任务。本文提出SSCD，一种建立在最近的自监督对比训练目标上的模型。通过改变结构和训练目标，包括一种来自实例匹配文献的集合运算子，并将对比学习适应于结合图像的增强，来使该方法适应于复制检测任务。所提出方法依赖于熵正则化项，促进了描述子向量间的一致分离，本文证明了该方法明显提高了复制检测的精度，产生了一个紧凑的描述子向量，适用于现实世界的网络规模应用。来自背景图像分布的统计信息可以被纳入描述子中。在最近的DISC2021基准测试中，SSCD被证明在所有设置中都以巨大优势超过了基线复制检测模型和为图像分类设计的自监督架构。

Image copy detection is an important task for content moderation. We introduce SSCD, a model that builds on a recent self-supervised contrastive training objective. We adapt this method to the copy detection task by changing the architecture and training objective, including a pooling operator from the instance matching literature, and adapting contrastive learning to augmentations that combine images. Our approach relies on an entropy regularization term, promoting consistent separation between descriptor vectors, and we demonstrate that this significantly improves copy detection accuracy. Our method produces a compact descriptor vector, suitable for real-world web scale applications. Statistical information from a background image distribution can be incorporated into the descriptor. On the recent DISC2021 benchmark, SSCD is shown to outperform both baseline copy detection models and selfsupervised architectures designed for image classification by huge margins, in all settings. For example, SSCD outperforms SimCLR descriptors by 48% absolute.

3、[LG] Abstraction for Deep Reinforcement Learning

M Shanahan, M Mitchell

[DeepMind & Santa Fe Institute]

深度强化学习的抽象化。本文在深度强化学习背景下描述了抽象化问题。各种成熟的类比推理和联想记忆的方法都可以用来解决这个问题，但由于需要端到端的可微性，这些方法存在困难。本文回顾了人工智能和机器学习的发展，这些发展可以促进它们的采用。

We characterise the problem of abstraction in the context of deep reinforcement learning. Various well established approaches to analogical reasoning and associative memory might be brought to bear on this issue, but they present difficulties because of the need for end-to-end differentiability. We review developments in AI and machine learning that could facilitate their adoption.

4、[LG] Auto-scaling Vision Transformers without Training

W Chen, W Huang, X Du, X Song, Z Wang, D Zhou

[University of Texas & University of Technology Sydney & Google]

免训练自扩展视觉Transformer。本文的目标是自动设计和扩展视觉Transformer(ViT)。其动机来自两个痛点：1）缺乏设计和扩展ViT的有效和原则性方法；2）训练ViT的巨大计算成本，相比卷积要大得多。为解决这些问题，本文提出As-ViT，一种无需训练的ViT自扩展框架，能以高效和原则性的方式自动发现和扩展ViT。首先通过利用无训练搜索过程来设计一个"种子"ViT拓扑结构。这种极快的搜索是通过对ViT的网络复杂性的全面研究来实现的，产生了与标准真实精度的强的Kendall-tau关联性。其次，从 "种子"拓扑结构开始，通过增加不同的ViT层的宽度/深度来实现ViT的自动扩展规则。在一次运行中出现一系列具有不同参数数量的架构。基于ViT在早期训练阶段可以容忍粗糙token化的观察，本文提出一种渐进的token化策略，以更快、更便宜地训练ViT。作为一个统一框架，As-ViT在分类(ImageNet-1k上83.5%的top1)和检测(COCO上52.7%的mAP)上实现了强大的性能，而不需要任何手工制作也不需要扩展ViT架构：端到端的模型设计和扩展过程在一个V100 GPU上只需要12小时。

This work targets automated designing and scaling of Vision Transformers (ViTs). The motivation comes from two pain spots: 1) the lack of efficient and principled methods for designing and scaling ViTs; 2) the tremendous computational cost of training ViT that is much heavier than its convolution counterpart. To tackle these issues, we propose As-ViT, an auto-scaling framework for ViTs without training, which automatically discovers and scales up ViTs in an efficient and principled manner. Specifically, we first design a “seed” ViT topology by leveraging a trainingfree search process. This extremely fast search is fulfilled by a comprehensive study of ViT’s network complexity, yielding a strong Kendall-tau correlation with ground-truth accuracies. Second, starting from the “seed” topology, we automate the scaling rule for ViTs by growing widths/depths to different ViT layers. This results in a series of architectures with different numbers of parameters in a single run. Finally, based on the observation that ViTs can tolerate coarse tokenization in early training stages, we propose a progressive tokenization strategy to train ViTs faster and cheaper. As a unified framework, As-ViT achieves strong performance on classification (83.5% top1 on ImageNet-1k) and detection (52.7% mAP on COCO) without any manual crafting nor scaling of ViT architectures: the end-toend model design and scaling process cost only 12 hours on one V100 GPU. Our code is available at https://github.com/VITA-Group/AsViT.

5、[CL] Pretraining without Wordpieces: Learning Over a Vocabulary of Millions of Words

Z Feng, D Tang, C Zhou, J Liao, S Wu, X Feng, B Qin, Y Cao, S Shi (2022)

免词件BERT预训练：在几百万级词汇中学习。标准的BERT采用基于子词的token化，可能将一个词分成两个或更多的词件(例如，将 "lossless"转换为"loss"和"less")。这将在以下情况下带来不便。(1)如何获得一个被分成多个词件的词的上下文向量？(2)如何在不事先知道词件数量的情况下通过cloze测试预测一个词？本文探索了在单词的词汇量而不是词件上开发BERT式预训练模型的可能性，在没有词间单元的基于词的词汇上训练BERT式预训练模型。把这种词级的BERT模型称为WordBERT。用不同的词汇量、初始化配置和语言来训练模型。结果表明，与标准的基于词件的BERT相比，WordBERT在cloze测试和机器阅读理解方面有明显的改善。在许多其他的自然语言理解任务中，包括POS标签，分块和NER，WordBERT的表现一直比BERT好。模型分析表明，WordBERT比BERT的主要优势在于对低频词和罕见词的理解。此外，由于该管道是独立于语言的，对WordBERT进行了中文训练，并在五个自然语言理解数据集上获得了明显的收益。对推理速度的分析表明，在自然语言理解任务中，WordBERT具有与BERT相当的时间成本。

The standard BERT adopts subword-based tokenization, which may break a word into two or more wordpieces (e.g., converting “lossless” to “loss” and “less”). This will bring inconvenience in following situations: (1) what is the best way to obtain the contextual vector of a word that is divided into multiple wordpieces? (2) how to predict a word via cloze test without knowing the number of wordpieces in advance? In this work, we explore the possibility of developing BERT-style pretrained model over a vocabulary of words instead of wordpieces. We call such word-level BERT model as WordBERT. We train models with different vocabulary sizes, initialization configurations and languages. Results show that, compared to standard wordpiece-based BERT, WordBERT makes significant improvements on cloze test and machine reading comprehension. On many other natural language understanding tasks, including POS tagging, chunking and NER, WordBERT consistently performs better than BERT. Model analysis indicates that the major advantage of WordBERT over BERT lies in the understanding for lowfrequency words and rare words. Furthermore, since the pipeline is language-independent, we train WordBERT for Chinese language and obtain significant gains on five natural language understanding datasets. Lastly, the analyse on inference speed illustrates WordBERT has comparable time cost to BERT in natural language understanding tasks.

另外几篇值得关注的论文：

[LG] Information Decomposition Diagrams Applied beyond Shannon Entropy: A Generalization of Hu's Theorem

超越香农熵的信息分解图：胡氏定理的推广

L Lang, P Baudot, R Quax, P Forré

[University of Amsterdam & Median Technologies]

[AS] Wavebender GAN: An architecture for phonetically meaningful speech manipulation

Wavebender GAN：具有语音意义的语音处理架构

G T D Beck, U Wennberg, Z Malisz, G E Henter

[KTH Royal Institute of Technology]

[LG] Path of Destruction: Learning an Iterative Level Generator Using a Small Dataset

毁灭之路：基于小数据集的迭代关卡生成器学习

M Siper, A Khalifa, J Togelius

[New York University & University of Malta]

[LG] RECOVER: sequential model optimization platform for combination drug repurposing identifies novel synergistic compounds in vitro

RECOVER：用面向组合药物再利用的序列模型优化平台在体外识别新的协同化合物

P Bertin, J Rector-Brooks, D Sharma...

[Mila & Relation Therapeutics & The Scripps Research Institute...]

内容中包含的图片若涉及版权问题，请及时与我们联系删除