爱可可AI前沿推介(10.22)

LG - 机器学习 CV - 计算机视觉 CL - 计算与语言 AS - 音频与语音 RO - 机器人

本文转自爱可可爱生活

1、[CL] NormFormer: Improved Transformer Pretraining with Extra Normalization

S Shleifer, J Weston, M Ott

[Facebook AI Research]

NormFormer：基于额外归一化改进Transformer预训练。在预训练期间，Pre-LayerNorm Transformer存在梯度大小不匹配的问题：早期层的梯度比后期层的梯度大很多。这些问题可以通过本文提出的NormFormer架构来缓解，为每一层增加了三个规范化操作：自注意力后的层规范化，自注意力输出的头缩放，以及第一个全连接层后的层规范化。额外操作产生了可以忽略不计的计算成本(+0.4%的参数增加)，但改善了因果和掩码语言模型预训练困惑度和下游任务性能，参数范围从1.25亿到27亿。例如，在最强的13亿参数基线上增加NormFormer，可以在相同的计算预算下，以24%的速度达到相同的困惑度，或者收敛0.27的困惑度。模型达到GPT3-Large(1.3B)的零样本性能要快60%。对于掩码语言建模，NormFormer平均将微调GLUE的性能提高了1.9%。

During pretraining, the Pre-LayerNorm transformer suffers from a gradient magnitude mismatch: gradients at early layers are much larger than at later layers. These issues can be alleviated by our proposed NormFormer architecture, which adds three normalization operations to each layer: a Layer Norm after self attention, head-wise scaling of self-attention outputs, and a Layer Norm after the first fully connected layer. The extra operations incur negligible compute cost (+0.4% parameter increase), but improve pretraining perplexity and downstream task performance for both causal and masked language models ranging from 125 Million to 2.7 Billion parameters. For example, adding NormFormer on top of our strongest 1.3B parameter baseline can reach equal perplexity 24% faster, or converge 0.27 perplexity better in the same compute budget. This model reaches GPT3-Large (1.3B) zero shot performance 60% faster. For masked language modeling, NormFormer improves fine-tuned GLUE performance by 1.9% on average. Code to train NormFormer models is available in fairseq.

https://weibo.com/1402400261/KDQnHCmRv

2、[LG] Amortized Tree Generation for Bottom-up Synthesis Planning and Synthesizable Molecular Design

W Gao, R Mercado, C W. Coley

[MIT]

面向自下而上合成规划和可合成分子设计的摊销树生成。分子设计和合成规划是分子发现过程中的两个关键步骤，本文建议将其表述为有条件合成途径生成的单一共享任务。提出一种摊销的方法，将合成途径生成为以目标分子嵌入为条件的马尔科夫决策过程。该方法使得能以自下而上的方式进行合成规划，并通过对优化条件码进行解码来设计可合成的分子，证明了同时解决设计和合成问题的潜力。该方法利用神经网络对合成树进行概率建模，一次一个反应步骤，根据反应模板的离散行动空间中编码的反应性规则。对从可购买的化合物库和专家策划的模板列表中产生的数十万条人工途径进行了训练。用以下方法来验证该方法：(a)利用条件生成恢复分子；(b)识别可合成的结构类似物；(c)优化与药物发现有关的分子结构。

Molecular design and synthesis planning are two critical steps in the process of molecular discovery that we propose to formulate as a single shared task of conditional synthetic pathway generation. We report an amortized approach to generate synthetic pathways as a Markov decision process conditioned on a target molecular embedding. This approach allows us to conduct synthesis planning in a bottom-up manner and design synthesizable molecules by decoding from optimized conditional codes, demonstrating the potential to solve both problems of design and synthesis simultaneously. The approach leverages neural networks to probabilistically model the synthetic trees, one reaction step at a time, according to reactivity rules encoded in a discrete action space of reaction templates. We train these networks on hundreds of thousands of artificial pathways generated from a pool of purchasable compounds and a list of expert-curated templates. We validate our method with (a) the recovery of molecules using conditional generation, (b) the identification of synthesizable structural analogs, and (c) the optimization of molecular structures given oracle functions relevant to drug discovery.

https://weibo.com/1402400261/KDQrUfYrF

3、[CL] Evaluating the Faithfulness of Importance Measures in NLP by Recursively Masking Allegedly Important Tokens and Retraining

A Madsen, N Meade, V Adlakha, S Reddy

[Mila]

通过递归掩码据称重要标记和重新训练来评估NLP中重要性度量的忠实度。为了解释NLP模型，许多方法告知哪些输入标记对预测是重要的。然而，一个开放问题是，这些方法是否准确地反映了模型的逻辑，这一属性通常被称为忠实度。本文改编并改进了Hooker等人(2019)最近提出的一种来自计算机视觉的忠实度基准，称为ROAR(RemOve And Retrain)。通过递归去除数据集冗余来改进ROAR，否则会干扰ROAR。将ROAR自适应并应用于流行的NLP重要性度量，即注意力、梯度和综合梯度。用互信息作为额外基线，在一套分类任务上进行了评估，这套任务在注意力的忠实度文献中经常使用。提出了一个标量的忠实度指标，可以很容易地在不同的论文中比较结果。结果发现，被认为对计算机视觉任务不忠实的重要性度量在NLP任务中表现良好，重要性度量的忠实度与任务有关，而综合梯度的计算开销很少是合理的。

To explain NLP models, many methods inform which inputs tokens are important for a prediction. However, an open question is if these methods accurately reflect the model’s logic, a property often called faithfulness. In this work, we adapt and improve a recently proposed faithfulness benchmark from computer vision called ROAR (RemOve And Retrain), by Hooker et al. (2019). We improve ROAR by recursively removing dataset redundancies, which otherwise interfere with ROAR. We adapt and apply ROAR, to popular NLP importance measures, namely attention, gradient, and integrated gradients. Additionally, we use mutual information as an additional baseline. Evaluation is done on a suite of classification tasks often used in the faithfulness of attention literature. Finally, we propose a scalar faithfulness metric, which makes it easy to compare results across papers. We find that, importance measures considered to be unfaithful for computer vision tasks perform favorably for NLP tasks, the faithfulness of an importance measure is task-dependent, and the computational overhead of integrated gradient is rarely justified.

https://weibo.com/1402400261/KDQuEoN37

4、[CL] Leveraging Automated Unit Tests for Unsupervised Code Translation

B Roziere, J M. Zhang, F Charton, M Harman, G Synnaeve, G Lample

[Facebook AI Research & University College London]

基于自动化单元测试的无监督代码翻译。由于编程语言几乎没有可用的平行数据，无监督方法非常适合于源代码翻译。然而，大多数无监督机器翻译方法都依赖于回译，这种方法是在自然语言翻译的背景下发展起来的，本质上涉及到对噪声输入的训练。不幸的是，源代码对微小变化非常敏感；一个符号就可能导致编译失败或错误的程序，与自然语言不同，小的不准确可能不会改变一个句子的意思。为解决这个问题，本文提出利用自动化单元测试系统来过滤无效翻译，创建一个经过充分测试的平行语料库。实验发现，用这个过滤过的数据集来微调无监督模型，可以大大减少这样产生的翻译中的噪音，在所有研究的语言对中，都能轻松地超过最先进水平。特别是，对于Java→Python和Python→C++，其表现分别超过了之前最好的方法16%和24%，错误率减少了35%以上。

With little to no parallel data available for programming languages, unsupervised methods are well-suited to source code translation. However, the majority of unsupervised machine translation approaches rely on back-translation, a method developed in the context of natural language translation and one that inherently involves training on noisy inputs. Unfortunately, source code is highly sensitive to small changes; a single token can result in compilation failures or erroneous programs, unlike natural languages where small inaccuracies may not change the meaning of a sentence. To address this issue, we propose to leverage an automated unit-testing system to filter out invalid translations, thereby creating a fully tested parallel corpus. We found that fine-tuning an unsupervised model with this filtered data set significantly reduces the noise in the translations so-generated, comfortably outperforming the state-of-the-art for all language pairs studied. In particular, for Java→ Python and Python→ C++ we outperform the best previous methods by more than 16% and 24% respectively, reducing the error rate by more than 35%.

https://weibo.com/1402400261/KDQy64uBd

5、[LG] Compositional Attention: Disentangling Search and Retrieval

S Mittal, S C Raparthy, I Rish, Y Bengio, G Lajoie

[Mila]

复合注意力：搜索和检索的解缠。多头、键-值注意力是广泛成功的Transformer模型及其变体的骨干。这种注意力机制使用了多个平行键-值注意力块(称为头)，其中每个都在进行两个基本计算：(1)搜索——通过查询-键交互从一个集合中选择一个相关实体，和(2)检索——通过一个值矩阵从所选实体中提取相关特征。重要的是，标准的注意力头在搜索和检索之间学习一种僵化的映射。本文首先强调了这种配对的静态特性如何可能：(a) 导致在某些任务中学习多余的参数，以及(b) 阻碍泛化。为了缓解这个问题，提出了一种新的注意力机制，称为复合注意力，取代了标准的头结构。所提出的机制将搜索和检索解缠，并通过查询-键组合和值配对之间的额外软竞争阶段，以动态、灵活和环境依赖的方式对它们进行组合。通过一系列数字实验，表明它在各种任务上的表现优于标准的多头注意力，包括一些分布外的设置。通过定性分析，证明了复合注意力导致了基于所需检索类型的动态特别化。所提出的机制泛化了多头注意力，允许搜索和检索的独立扩展，并且可以很容易地在任意网络结构中代替标准注意力头来实现。

Multi-head, key-value attention is the backbone of the widely successful Transformer model and its variants. This attention mechanism uses multiple parallel key-value attention blocks (called heads), each performing two fundamental computations: (1) search - selection of a relevant entity from a set via query-key interactions, and (2) retrieval - extraction of relevant features from the selected entity via a value matrix. Importantly, standard attention heads learn a rigid mapping between search and retrieval. In this work, we first highlight how this static nature of the pairing can potentially: (a) lead to learning of redundant parameters in certain tasks, and (b) hinder generalization. To alleviate this problem, we propose a novel attention mechanism, called Compositional Attention, that replaces the standard head structure. The proposed mechanism disentangles search and retrieval and composes them in a dynamic, flexible and context-dependent manner through an additional soft competition stage between the query-key combination and value pairing. Through a series of numerical experiments, we show that it outperforms standard multi-head attention on a variety of tasks, including some out-of-distribution settings. Through our qualitative analysis, we demonstrate that Compositional Attention leads to dynamic specialization based on the type of retrieval needed. Our proposed mechanism generalizes multi-head attention, allows independent scaling of search and retrieval, and can easily be implemented in lieu of standard attention heads in any network architecture.

https://weibo.com/1402400261/KDQAnmoHW

另外几篇值得关注的论文：

[CV] No RL, No Simulation: Learning to Navigate without Navigating

不强化学习，不模拟：不用导航(交互)的自监督导航学习

M Hahn, D Chaplot, S Tulsiani, M Mukadam, J M. Rehg, A Gupta

[Georgia Institute of Technology & Facebook AI Research]

https://weibo.com/1402400261/KDQEKxAb1

[LG] LCS: Learning Compressible Subspaces for Adaptive Network Compression at Inference Time

LCS：推理时自适应网络压缩可压缩子空间学习

E Nunez, M Horton, A Prabhu, A Ranjan, A Farhadi, M Rastegari

[University of California Los Angeles & Apple Inc]

https://weibo.com/1402400261/KDQGYk4Ff

[CL] UniPELT: A Unified Framework for Parameter-Efficient Language Model Tuning

UniPELT：参数高效语言模型微调统一框架

Y Mao, L Mathias, R Hou, A Almahairi, H Ma, J Han, W Yih, M Khabsa

[University of Illinois Urbana-Champaign & Facebook AI]

https://weibo.com/1402400261/KDQIEAg5p

[RO] Offline Meta-Reinforcement Learning for Industrial Insertion

工业(连接器)插入的离线元强化学习

T Z. Zhao, J Luo, O Sushkov, R Pevceviciute, N Heess, J Scholz, S Schaal, S Levine

[The Moonshot Factory & Intrinsic Innovation LLC & Deepmind & Google Brain]

https://weibo.com/1402400261/KDQKhFVMK

内容中包含的图片若涉及版权问题，请及时与我们联系删除