爱可可AI前沿推介(2.16)

LG - 机器学习 CV - 计算机视觉 CL - 计算与语言 AS - 音频与语音 RO - 机器人

转自爱可可爱生活

1、[CV] How Do Vision Transformers Work?

N Park, S Kim

[Yonsei University]

视觉Transformer如何工作？多头自注意力(MSA)在计算机视觉方面的成功现在已是不争的事实。然而，人们对MSA如何工作知之甚少。本文提出了基本的解释，以帮助更好地理解MSA的性质。证明了MSA和视觉Transformer(ViT)的以下特性：(1) MSA不仅提高了准确性，而且通过扁平化的损失景观提高了泛化性。这种改善主要归因于其数据特异性，而不是长程的依赖性。另一方面，ViT受到非凸损失的影响，大数据集和损失景观平滑可以缓解该问题；（2）MSA和卷积网络表现出相反的行为。例如，MSA是低通滤波器，但卷积网络是高通滤波器。因此，MSA和Convs是互补的；（3）多阶段神经网络的行为就像小的单独模型的串联。此外，阶段末尾的MSA在预测中起着关键作用。基于这些见解，本文提出了AlterNet模型，其中阶段末尾的卷积块被替换为MSA块。AlterNet不仅在大数据环境下，而且在小数据环境下都优于CNN。

The success of multi-head self-attentions (MSAs) for computer vision is now indisputable. However, little is known about how MSAs work. We present fundamental explanations to help better understand the nature of MSAs. In particular, we demonstrate the following properties of MSAs and Vision Transformers (ViTs): (1) MSAs improve not only accuracy but also generalization by flattening the loss landscapes. Such improvement is primarily attributable to their data specificity, not long-range dependency. On the other hand, ViTs suffer from non-convex losses. Large datasets and loss landscape smoothing methods alleviate this problem; (2) MSAs and Convs exhibit opposite behaviors. For example, MSAs are low-pass filters, but Convs are high-pass filters. Therefore, MSAs and Convs are complementary; (3) Multi-stage neural networks behave like a series connection of small individual models. In addition, MSAs at the end of a stage play a key role in prediction. Based on these insights, we propose AlterNet, a model in which Conv blocks at the end of a stage are replaced with MSA blocks. AlterNet outperforms CNNs not only in large data regimes but also in small data regimes. The code is available at this https URL.

2、[LG] Perspectives in machine learning for wildlife conservation | Nature Communications

D Tuia, B Kellenberger, S Beery...

[EPFL & Caltech & Max Planck Institute of Animal Behavior & University of Konstanz...]

机器学习在野生动物保护方面的前景。廉价的和易获得的传感器正在加速动物生态学的数据采集。这些技术对于大规模的生态理解有着巨大的潜力，但受到当前处理方法的限制，这些方法不能有效地将数据提炼为相关信息。动物生态学家可以通过将机器学习方法与领域知识相结合，利用现代传感器产生的大量数据集。将机器学习纳入生态学工作流程可以改善生态学模型的输入，实现综合混合建模工具。这种方法需要密切的跨学科合作，以确保新方法的质量，并在生态学和保护方面培训新一代的数据科学家。本文介绍了一系列机器学习和动物生态学跨界的成功案例。强调了在采用基于机器学习和新一代传感器的解决方案时观察到的一些性能改进。尽管通常很壮观，但这种改进需要生态学家和机器学习专家之间更紧密的合作，因为最近的方法比以往更复杂，需要严格的质量控制和详细的设计知识。由于企业（如Wildlife Insights）和研究（AIDE、MegaDetector、DeepLabCut）的努力，最先进的机器学习概念在动物生态学中已经存在比较成熟的应用，但在跨学科研究的推动下，真正的新概念仍有很大空间(和需要)，特别是在混合模型和新的生境分布模型的规模上。计算机科学家尚未将生态学知识，如潜在的生物过程纳入机器学习模型，而目前的深度学习模型缺乏透明度，到目前为止，是将机器学习纳入生态学研究的主要障碍。

Inexpensive and accessible sensors are accelerating data acquisition in animal ecology. These technologies hold great potential for large-scale ecological understanding, but are limited by current processing approaches which inefficiently distill data into relevant information. We argue that animal ecologists can capitalize on large datasets generated by modern sensors by combining machine learning approaches with domain knowledge. Incorporating machine learning into ecological workflows could improve inputs for ecological models and lead to integrated hybrid modeling tools. This approach will require close interdisciplinary collaboration to ensure the quality of novel approaches and train a new generation of data scientists in ecology and conservation.

3、[CL] Repairing the Cracked Foundation: A Survey of Obstacles in Evaluation Practices for Generated Text

S Gehrmann, E Clark, T Sellam

[Google Research]

破裂基础的修复：生成文本评估实践障碍综述。自然语言生成(NLG)的评估实践有许多已知的缺陷，但改进的评估方法很少被广泛采用。这个问题已经变得更加紧迫，因为神经NLG模型已经改进到了一个地步，即通常不再能够根据老的衡量标准所依赖的表面特征来区分它们。本文调研了过去20年来人工和自动模型评估以及在NLG中常用的数据集的问题，总结、分类并讨论了研究人员是如何解决这些问题的，以及他们的发现对于模型评估的现状意味着什么。在这些见解的基础上，为无法律约束力文书的评价提出了一个长期的愿景，并为研究人员提出了改善其评价过程的具体步骤。最后，分析了最近NLP会议上的66篇NLG论文，看看它们在多大程度上遵循了这些建议，并确定哪些领域需要对现状进行更大幅度的改变。

Evaluation practices in natural language generation (NLG) have many known flaws, but improved evaluation approaches are rarely widely adopted. This issue has become more urgent, since neural NLG models have improved to the point where they can often no longer be distinguished based on the surfacelevel features that older metrics rely on. This paper surveys the issues with human and automatic model evaluations and with commonly used datasets in NLG that have been pointed out over the past 20 years. We summarize, categorize, and discuss how researchers have been addressing these issues and what their findings mean for the current state of model evaluations. Building on those insights, we lay out a long-term vision for NLG evaluation and propose concrete steps for researchers to improve their evaluation processes. Finally, we analyze 66 NLG papers from recent NLP conferences in how well they already follow these suggestions and identify which areas require more drastic changes to the status quo.

4、[CL] Compositionality as Lexical Symmetry

E Akyürek, J Andreas

[MIT]

基于数据分布对称性约束的词汇数据增强。标准的深度网络模型缺乏在语义解析、翻译和问答等任务中泛化成分所需的归纳偏差。自然语言处理中的大量工作试图通过新的模型架构来克服这一局限性，这些模型强制执行句子解释的组合过程。本文提出用于组合建模的领域通用框架，一种新的基于词库的数据增强方法，来自于对组合性原则的新描述，即对数据分布对称性的约束，以及使用token级排列自动识别这些对称性的程序，提高了神经模型在多个领域的组合泛化。对于任何可被分解为词汇和组合函数的任务，存在一个数据转换函数族，当应用于训练数据时，保证产生新的、形式良好的样本。即使在组合函数未知的情况下(例如，不知道如何编写或推断一个符号语法时)，也有可能识别这些数据转换。利用这些转换函数对普通的RNN和Transformer序列模型进行数据增强，在CLEVR-CoGenT视觉问答数据集上获得了最先进的结果，并在COGS语义解析数据集上获得了与专门模型架构相媲美的结果。得到的结果强调了这样一个事实，即NLP中专门模型所针对的许多归纳性偏差可以被替代性地、而且往往更灵活地表达为关于要建模的数据集结构的假设。

Standard deep network models lack the inductive biases needed to generalize compositionally in tasks like semantic parsing, translation, and question answering. A large body of work in natural language processing seeks to overcome this limitation with new model architectures that enforce a compositional process of sentence interpretation. In this paper, we present a domain-general framework for compositional modeling that instead formulates compositionality as a constraint on data distributions. We prove that for any task factorizable into a lexicon and a composition function, there exists a family of data transformation functions that are guaranteed to produce new, well-formed examples when applied to training data. We further show that it is possible to identify these data transformations even when the composition function is unknown (e.g. when we do not know how to write or infer a symbolic grammar). Using these transformation functions to perform data augmentation for ordinary RNN and transformer sequence models, we obtain state-of-the-art results on the CLEVR-CoGenT visual question answering dataset, and results comparable to specialized model architectures on the COGS semantic parsing dataset.

5、[CL] A Contrastive Framework for Neural Text Generation

Y Su, T Lan, Y Wang, D Yogatama, L Kong, N Collier

[University of Cambridge & Tencent AI Lab & DeepMind & The University of Hong Kong]

神经文本生成对比框架。文本生成对于许多自然语言处理应用来说非常重要。然而，神经语言模型的基于最大化的解码方法(如集束搜索)往往会导致退化的解决方案——生成的文本是不自然的，并包含不受欢迎的重复内容。现有方法通过采样引入随机性，或修改训练目标以减少某些token的概率(例如，非似然训练）。然而，它们往往导致解决方案缺乏连贯性。本文表明，模型退化的一个根本原因是token表示的各向异性分布，提出一种对比解决方案：(i)SimCTG，一种用于校准模型表示空间的对比训练目标，以及(ii)一种解码方法——对比搜索——鼓励多样性，同时保持生成文本的一致性。在两种语言的三个基准上进行了广泛实验和分析，自动和人工评估都表明，所提出方法大大减少了模型的退化，并明显优于目前最先进的文本生成方法。

Text generation is of great importance to many natural language processing applications. However, maximization-based decoding methods (e.g. beam search) of neural language models often lead to degenerate solutions—the generated text is unnatural and contains undesirable repetitions. Existing approaches introduce stochasticity via sampling or modify training objectives to decrease probabilities of certain tokens (e.g., unlikelihood training). However, they often lead to solutions that lack coherence. In this work, we show that an underlying reason for model degeneration is the anisotropic distribution of token representations. We present a contrastive solution: (i) SimCTG, a contrastive training objective to calibrate the model’s representation space, and (ii) a decoding method—contrastive search—to encourage diversity while maintaining coherence in the generated text. Extensive experiments and analyses on three benchmarks from two languages demonstrate that our proposed approach outperforms state-of-the-art text generation methods as evaluated by both human and automatic metrics.1