爱可可AI前沿推介(9.7)

LG - 机器学习 CV - 计算机视觉 CL - 计算与语言 AS - 音频与语音 RO - 机器人

1、[CL] Finetuned Language Models Are Zero-Shot Learners

J Wei, M Bosma, V Y. Zhao, K Guu, A W Yu, B Lester, N Du, A M. Dai, Q V. Le

[Google Research]

用指令微调提高语言模型零样本学习能力。本文探讨了一种提高语言模型零样本学习能力的简单方法。指令微调——在通过指令描述的任务集合上对语言模型进行微调——可以大幅提高在未见任务上的零样本性能。采用一个137B参数的预训练语言模型，在60多个通过自然语言指令模板口述的NLP任务上对其进行指令微调。在未见过的任务类型上评估该指令微调的模型，称之为FLAN。FLAN极大提高了其未修改的对应模型的性能，在所评估的25个任务中的19个任务上超过了零样本的175B GPT-3。在ANLI、RTE、BoolQ、AI2-ARC、OpenbookQA和StoryCloze上，FLAN甚至以很大的优势超过了少样本的GPT-3。消融研究显示，任务数量和模型规模是指令微调成功的关键因素。

This paper explores a simple method for improving the zero-shot learning abilities of language models. We show that instruction tuning—finetuning language models on a collection of tasks described via instructions—substantially boosts zero-shot performance on unseen tasks. We take a 137B parameter pretrained language model and instruction-tune it on over 60 NLP tasks verbalized via natural language instruction templates. We evaluate this instructiontuned model, which we call FLAN, on unseen task types. FLAN substantially improves the performance of its unmodified counterpart and surpasses zero-shot 175B GPT-3 on 19 of 25 tasks that we evaluate. FLAN even outperforms few-shot GPT-3 by a large margin on ANLI, RTE, BoolQ, AI2-ARC, OpenbookQA, and StoryCloze. Ablation studies reveal that number of tasks and model scale are key components to the success of instruction tuning.

https://weibo.com/1402400261/Kx0fDcNt2

2、[CL] Multimodal Conditionality for Natural Language Generation

M Sollami, A Jain

[Salesforce Einstein]

自然语言生成的多模态条件性。大规模预训练语言模型在语言理解任务上表现出最先进的性能。它们的应用最近已经扩展到多模态学习中，导致了结合视觉和语言的改进表示。然而，在微调语言模型以实现有条件的自然语言生成(NLG)方面的进展仅限于单一模式，通常是文本。本文提出MAnTiS(Multimodal Adaptation for Text Synthesis)，一种在基于transformer的NLG模型中实现多模态条件的一般方法，将每种模态的输入通过特定模态编码器，投射到文本token空间，最后连接起来形成一个条件性前缀。对预训练语言模型和编码器进行微调，用条件性前缀来指导生成。将MAnTiS应用于产品描述生成任务中，在产品图片和标题上微调一个网络以生成描述性文本。证明了MAnTiS在标准NLG评分指标上优于强大的基线方法。此外，定性评估表明，MAnTiS能生成与给定的多模态输入一致的人工质量描述。

Large scale pretrained language models have demonstrated state-of-the-art performance in language understanding tasks. Their application has recently expanded into multimodality learning, leading to improved representations combining vision and language. However, progress in adapting language models towards conditional Natural Language Generation (NLG) has been limited to a single modality, generally text. We propose MAnTiS, Multimodal Adaptation for Text Synthesis, a general approach for multimodal conditionality in transformer-based NLG models. In this method, we pass inputs from each modality through modality-specific encoders, project to textual token space, and finally join to form a conditionality prefix. We fine-tune the pretrained language model and encoders with the conditionality prefix guiding the generation. We apply MAnTiS to the task of product description generation, conditioning a network on both product images and titles to generate descriptive text. We demonstrate that MAnTiS outperforms strong baseline approaches on standard NLG scoring metrics. Furthermore, qualitative assessments demonstrate that MAnTiS can generate human quality descriptions consistent with given multimodal inputs.

https://weibo.com/1402400261/Kx0kfh9VZ

3、[CL] Language Modeling, Lexical Translation, Reordering: The Training Process of NMT through the Lens of Classical SMT

E Voita, R Sennrich, I Titov

[University of Edinburgh & University of Zurich]

语言建模，词汇翻译，重新排序：从经典统计机器翻译的角度看神经(网络)机器翻译训练过程。与传统统计机器翻译将翻译任务分解为不同的单独学习组件不同，神经(网络)机器翻译使用单一神经网络来模拟整个翻译过程。尽管神经(网络)机器翻译是事实上的标准，但仍不清楚神经(网络)机器翻译模型如何在训练过程中获得不同的能力，以及这如何反映了传统统计机器翻译中的不同模型。本文研究了与三个核心统计机器翻译组件相关的能力，发现在训练过程中，神经(网络)机器翻译首先专注于学习目标语言建模，然后提高翻译质量，接近逐字翻译，最后学习更复杂的重排模式。这种行为对几个模型和语言对都是成立的。解释了对训练过程的这种理解在实践中如何有用，并作为一个例子，展示了它如何通过指导教师的模型选择来改善vanilla非回归神经(网络)机器翻译。

Differently from the traditional statistical MT that decomposes the translation task into distinct separately learned components, neural machine translation uses a single neural network to model the entire translation process. Despite neural machine translation being defacto standard, it is still not clear how NMT models acquire different competences over the course of training, and how this mirrors the different models in traditional SMT. In this work, we look at the competences related to three core SMT components and find that during training, NMT first focuses on learning targetside language modeling, then improves translation quality approaching word-by-word translation, and finally learns more complicated reordering patterns. We show that this behavior holds for several models and language pairs. Additionally, we explain how such an understanding of the training process can be useful in practice and, as an example, show how it can be used to improve vanilla nonautoregressive neural machine translation by guiding teacher model selection.

https://weibo.com/1402400261/Kx0n301Uf

4、[CL] Learning Neural Models for Natural Language Processing in the Face of Distributional Shift

P Michel

[CMU]

分布式偏移下的自然语言处理神经网络模型学习。训练一个强大的神经预测器，在特定数据集上执行一项任务，这种占主导地位的NLP范式，在各种应用中取得了最先进的性能(例如，情感分类、基于跨度预测的问答或机器翻译)。然而，它建立在数据分布是静止的假设之上，即数据在训练和测试时都是从一个固定的分布中取样。这种训练方式与人类如何从不断变化的信息流中学习和操作是不一致的，也不适合现实世界的使用情况，在这些情况下，数据分布会在模型的生命周期内发生变化。本文的第一个目标是描述这种偏移在自然语言处理背景下的不同形式，并提出基准和评估指标来衡量其对当前深度学习架构的影响。之后，着手采取措施，减轻分布偏移对NLP模型的影响。开发了基于分布鲁棒优化框架的参数化重构的方法。通过实验证明了这些方法产生了更鲁棒的模型，并在一些现实问题上得到了证明。本文第三部分，探讨了如何将现有模型有效地适应于新的领域或任务。对这一主题的贡献是从信息几何学中得到的灵感，推导出一个新的梯度更新规则，以缓解拟合过程中灾难性遗忘问题。

The dominating NLP paradigm of training a strong neural predictor to perform one task on a specific dataset has led to state-of-the-art performance in a variety of applications (eg. sentiment classification, span-prediction based question answering or machine translation). However, it builds upon the assumption that the data distribution is stationary, ie. that the data is sampled from a fixed distribution both at training and test time. This way of training is inconsistent with how we as humans are able to learn from and operate within a constantly changing stream of information. Moreover, it is ill-adapted to real-world use cases where the data distribution is expected to shift over the course of a model’s lifetime. The first goal of this thesis is to characterize the different forms this shift can take in the context of natural language processing, and propose benchmarks and evaluation metrics to measure its effect on current deep learning architectures. We then proceed to take steps to mitigate the effect of distributional shift on NLP models. To this end, we develop methods based on parametric reformulations of the distributionally robust optimization framework. Empirically, we demonstrate that these approaches yield more robust models as demonstrated on a selection of realistic problems. In the third and final part of this thesis, we explore ways of efficiently adapting existing models to new domains or tasks. Our contribution to this topic takes inspiration from information geometry to derive a new gradient update rule which alleviate catastrophic forgetting issues during adaptation.

https://weibo.com/1402400261/Kx0uE358v

5、[CL] Challenges in Generalization in Open Domain Question Answering

L Liu, P Lewis, S Riedel, P Stenetorp

[University College London]

开放域问答的泛化挑战。最近关于开放域问答的工作表明，在新的测试问题和那些与训练问题基本重合的问题之间，模型性能存在很大差异。然而，到目前为止，还不清楚是哪些方面的新问题使其具有挑战性。借鉴关于系统泛化的研究，本文根据衡量不同水平和种类泛化的三个类别来提出和解释问题：训练集重叠、成分泛化(compgen)和新实体泛化(novelentity)。在评估六种流行的参数和非参数模型时，发现对已经建立的自然问题和TriviaQA数据集，即使是最强的模型在comp-gen/novel-entity方面的表现也比全部测试集的表现低13.1/5.4%和9.6/1.5%——表明这些类型的问题带来了挑战。虽然非参数模型可以处理包含新实体的问题，但它们在处理那些需要组合泛化的问题时却很困难。通过彻底分析，发现关键的问题难度因素是：来自检索组件的连带错误、问题模式的频率和实体的频率。

Recent work on Open Domain Question Answering has shown that there is a large discrepancy in model performance between novel test questions and those that largely overlap with training questions. However, it is as of yet unclear which aspects of novel questions that make them challenging. Drawing upon studies on systematic generalization, we introduce and annotate questions according to three categories that measure different levels and kinds of generalization: training set overlap, compositional generalization (compgen), and novel entity generalization (novelentity). When evaluating six popular parametric and non-parametric models, we find that for the established Natural Questions and TriviaQA datasets, even the strongest model performance for comp-gen/novel-entity is 13.1/5.4% and 9.6/1.5% lower compared to that for the full test set – indicating the challenge posed by these types of questions. Furthermore, we show that whilst non-parametric models can handle questions containing novel entities, they struggle with those requiring compositional generalization. Through thorough analysis we find that key question difficulty factors are: cascading errors from the retrieval component, frequency of question pattern, and frequency of the entity.

https://weibo.com/1402400261/Kx0ytzal7