爱可可AI前沿推介(5.13)

LG - 机器学习 CV - 计算机视觉 CL - 计算与语言 AS - 音频与语音 RO - 机器人

转自爱可可爱生活

摘要：统一的语言学习范式、少次参数高效微调比上下文学习更好更便宜、从数据学习基于主体模型的研究、数据分布特性有助于Transformer的少次学习、无需完全标注视频教检测器进行追踪、基于共面感知GAN的无监督单应性估计、用扩散模型调节输入噪声以生成受控图像、面向观点摘要的高效少次微调、移动机器人的可靠蒙特卡罗定位

1、[CL] Unifying Language Learning Paradigms

Y Tay, M Dehghani, V Q. Tran, X Garcia, D Bahri, T Schuster, H S Zheng, N Houlsby, D Metzler

[Google Research]

统一的语言学习范式。现有的预训练模型，一般都是针对某一类问题的。到目前为止，对于什么是正确的架构和预训练设置，似乎仍然没有共识。本文提出一种统一的预训练模型框架，在不同的数据集和设置中都是有效的。将架构原型与预训练目标分开，这两个概念通常被混为一谈。为NLP中的自监督提出一种普遍而统一的观点，并展示了不同的预训练目标是如何相互投射的，以及不同目标之间的插值是如何奏效的。本文提出Mixture-of-Denoisers（MoD），一种将不同的预训练范式结合起来的预训练目标。提出了模式切换的概念，下游的微调与特定的预训练方案相关。进行了广泛的消融实验来比较多种预训练目标，发现所提出方法在多种不同的设置中超越了T5和/或类似GPT的模型，从而推动了Pareto-frontier的发展。将所提出模型扩展到20B参数，在50个公认的有监督NLP任务上取得了SOTA性能，这些任务包括语言生成(自动和人工评估)、语言理解、文本分类、问题回答、常识推理、长文本推理、结构化知识基础和信息检索。所提出模型在语境学习方面也取得了很好的效果，在零次SuperGLUE上超过了175B GPT-3，在单次摘要上是T5-XXL性能的三倍。

Existing pre-trained models are generally geared towards a particular class of problems. To date, there seems to be still no consensus on what the right architecture and pre-training setup should be. This paper presents a unified framework for pre-training models that are universally effective across datasets and setups. We begin by disentangling architectural archetypes with pre-training objectives -- two concepts that are commonly conflated. Next, we present a generalized and unified perspective for self-supervision in NLP and show how different pre-training objectives can be cast as one another and how interpolating between different objectives can be effective. We then propose Mixture-of-Denoisers (MoD), a pre-training objective that combines diverse pre-training paradigms together. We furthermore introduce a notion of mode switching, wherein downstream fine-tuning is associated with specific pre-training schemes. We conduct extensive ablative experiments to compare multiple pre-training objectives and find that our method pushes the Pareto-frontier by outperforming T5 and/or GPT-like models across multiple diverse setups. Finally, by scaling our model up to 20B parameters, we achieve SOTA performance on 50 well-established supervised NLP tasks ranging from language generation (with automated and human evaluation), language understanding, text classification, question answering, commonsense reasoning, long text reasoning, structured knowledge grounding and information retrieval. Our model also achieve strong results at in-context learning, outperforming 175B GPT-3 on zero-shot SuperGLUE and tripling the performance of T5-XXL on one-shot summarization. We release Flax-based T5X model checkpoints for the 20B model at this https URL.

https://arxiv.org/abs/2205.05131

2、[LG] Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning

H Liu, D Tam, M Muqeeth, J Mohta, T Huang, M Bansal, C Raffel

[University of North Carolina at Chapel Hill]

少次参数高效微调比上下文学习更好更便宜。少次上下文学习(ICL)使预训练的语言模型在没有任何基于梯度训练的情况下，通过输入少量的训练样本来完成一个之前没见过的任务。ICL产生了大量计算、内存和存储成本，因为其涉及到每次预测时对所有训练样本的处理。参数高效的微调(如适配器模块、提示微调、稀疏更新方法等)提供了另一种范式，即训练一小部分参数以使模型能执行新任务。本文严格比较了少次ICL和参数高效微调，并证明后者提供了更好的精度，以及显著降低的计算成本。在此过程中，提出一种新的参数高效微调方法(IA)³，通过习得向量来扩展激活，在引入相对较少新参数的情况下获得更强性能。本文还提出一种基于T0模型的简单方法T-Few，可应用于新任务，而不需要特定的任务微调或修改。将T-Few应用于RAFT基准，验证了它在完全未见过的任务上的有效性，首次达到了超人类的性能，并优于最先进性能6%。

Few-shot in-context learning (ICL) enables pre-trained language models to perform a previously-unseen task without any gradient-based training by feeding a small number of training examples as part of the input. ICL incurs substantial computational, memory, and storage costs because it involves processing all of the training examples every time a prediction is made. Parameter-efficient fine-tuning (e.g. adapter modules, prompt tuning, sparse update methods, etc.) offers an alternative paradigm where a small set of parameters are trained to enable a model to perform the new task. In this paper, we rigorously compare few-shot ICL and parameter-efficient fine-tuning and demonstrate that the latter offers better accuracy as well as dramatically lower computational costs. Along the way, we introduce a new parameter-efficient fine-tuning method called (IA)3 that scales activations by learned vectors, attaining stronger performance while only introducing a relatively tiny amount of new parameters. We also propose a simple recipe based on the T0 model called T-Few that can be applied to new tasks without task-specific tuning or modifications. We validate the effectiveness of T-Few on completely unseen tasks by applying it to the RAFT benchmark, attaining super-human performance for the first time and outperforming the state-of-the-art by 6% absolute. All of the code used in our experiments is publicly available.

https://arxiv.org/abs/2205.05638

3、[LG] On learning agent-based models from data

C Monti, M Pangallo, G D F Morales, F Bonchi

[Centai & Sant’Anna School of Advanced Studies]

从数据学习基于主体模型的研究。基于主体模型(ABM)被用于多个领域，从微观层面的假设研究复杂系统的演变。然而，ABM通常不能估计特定主体(或"微观")变量：这是一个主要限制，使ABM无法利用微观层面的数据，大大限制了其预测能力。本文提出一种协议，从数据中学习ABM的潜微观变量。该协议第一步将ABM简化为一个概率模型，将其刻画为一个可计算的似然。这种缩减遵循两个一般的设计原则：随机性和数据可用性的平衡，以及用可微的近似值替换不可观察的离散选择。该协议通过基于梯度的期望最大化算法，使潜变量似然最大化。通过将其应用于住房市场的ABM来证明该协议，在ABM中，不同收入的主体为居住在高收入社区而付出更高的价格。证明了所提出模型允许对潜变量的精确估计，同时保留了ABM的一般行为。该估计可用于样本外的预测。所提出方案可以被看作是黑箱数据同化方法的一个替代方案，迫使建模者将模型的假设暴露出来，思考推理过程，并发现潜在的识别问题。

Agent-Based Models (ABMs) are used in several fields to study the evolution of complex systems from micro-level assumptions. However, ABMs typically can not estimate agent-specific (or "micro") variables: this is a major limitation which prevents ABMs from harnessing micro-level data availability and which greatly limits their predictive power. In this paper, we propose a protocol to learn the latent micro-variables of an ABM from data. The first step of our protocol is to reduce an ABM to a probabilistic model, characterized by a computationally tractable likelihood. This reduction follows two general design principles: balance of stochasticity and data availability, and replacement of unobservable discrete choices with differentiable approximations. Then, our protocol proceeds by maximizing the likelihood of the latent variables via a gradient-based expectation maximization algorithm. We demonstrate our protocol by applying it to an ABM of the housing market, in which agents with different incomes bid higher prices to live in high-income neighborhoods. We demonstrate that the obtained model allows accurate estimates of the latent variables, while preserving the general behavior of the ABM. We also show that our estimates can be used for out-of-sample forecasting. Our protocol can be seen as an alternative to black-box data assimilation methods, that forces the modeler to lay bare the assumptions of the model, to think about the inferential process, and to spot potential identification problems.

https://arxiv.org/abs/2205.05052

4、[CL] Data Distributional Properties Drive Emergent Few-Shot Learning in Transformers

S C.Y. Chan, A Santoro, A K. Lampinen, J X. Wang, A Singh, P H. Richemond...

[DeepMind & University College London]

数据分布特性有助于Transformer的少次学习。基于Transformer的大型语言模型能进行少次学习(也称为上下文学习)，而无需经过明确的训练。本文假设，自然语言的特定分布特性可能会驱动这一新兴现象，因为这些特性可能会导致一种介于少次元训练(旨在引发快速的少次学习)和标准的监督训练(旨在引起渐进的上下文学习)之间的训练。本文还假设，这些分布特性可能导致在语言以外的领域出现少次学习。在该想法的启发下，在一个标准的基于图像的少次数据集上进行了一系列实验。发现一些数据属性确实促进了Transformer模型中少次学习的出现。所有这些属性都存在于自然语言中——突发性、长尾性、以及多对一或一对多的标签映射。这些数据影响了模型是否偏向于少次学习和记忆其权重中的信息；模型通常只能在其中一个方面表现良好。一个额外的分布属性可以让这两种能力在同一个模型中共存——一种倾斜的、Zipfian的类别分布——这也发生在语言中。值得注意的是，能在Transformer中引起少次学习的训练数据无法在递归模型中引起少次学习。本文发现，只有将正确的架构应用于正确的数据分布，才会达成少次学习；任何一个组成部分单独来看都是不够的。

Large transformer-based language models are able to perform few-shot learning (also known as in-context learning), without having been explicitly trained for it. We hypothesized that specific distributional properties of natural language might drive this emergent phenomenon, as these characteristics might lead to a kind of interpolation between few-shot meta-training (designed to elicit rapid few-shot learning) and standard supervised training (designed to elicit gradual in-weights learning). We also hypothesized that these distributional properties could lead to emergent few-shot learning in domains outside of language. Inspired by this idea, we ran a series of experiments on a standard image-based few-shot dataset. We discovered that a number of data properties did indeed promote the emergence of few-shot learning in transformer models. All of these properties are present in natural language -- burstiness, long-tailedness, and many-to-one or one-to-many label mappings. The data influenced whether models were biased towards either few-shot learning vs. memorizing information in their weights; models could generally perform well at only one or the other. However, we discovered that an additional distributional property could allow the two capabilities to co-exist in the same model -- a skewed, Zipfian distribution over classes -- which occurs in language as well. Notably, training data that could elicit few-shot learning in transformers were unable to elicit few-shot learning in recurrent models. In sum, we find that few-shot learning emerges only from applying the right architecture to the right data distribution; neither component is sufficient on its own.

https://arxiv.org/abs/2205.05055

5、[CV] TDT: Teaching Detectors to Track without Fully Annotated Videos

S Yu, G Wu, C Gu, M E. Fathy

[Duke University & Google LLC]

TDT：无需完全标注视频教检测器进行追踪。最近，使用联合模型来预测检测和外观嵌入的单阶段追踪器受到了广泛关注，并在多目标追踪(MOT)基准上取得了最先进的成果。然而，其成功取决于是否有充分标注了追踪数据的视频，而这是昂贵且难以获得的，可能会限制模型的通用性。相比之下，两阶段方法，即分别进行检测和嵌入，速度较慢，但更容易训练，因为数据更容易标注。本文建议通过数据蒸馏的方法来结合两者的优点。用一个在Re-ID数据集上训练的教师嵌入器，为检测数据集生成伪外观嵌入标签。用增强的数据集来训练一个检测器，该检测器也能以全卷积的方式回归这些伪嵌入。所提出的单阶段解决方案在质量上与两阶段的方案相匹配，但速度快3倍。即使教师嵌入器在训练过程中没有看到任何跟踪数据，所提出的跟踪器与一些用完全标注追踪数据训练的流行追踪器(如JDE)相比，取得了有竞争力的性能。

Recently, one-stage trackers that use a joint model to predict both detections and appearance embeddings in one forward pass received much attention and achieved state-ofthe-art results on the Multi-Object Tracking (MOT) benchmarks. However, their success depends on the availability of videos that are fully annotated with tracking data, which is expensive and hard to obtain. This can limit the model generalization. In comparison, the two-stage approach, which performs detection and embedding separately, is slower but easier to train as their data are easier to annotate. We propose to combine the best of the two worlds through a data distillation approach. Specifically, we use a teacher embedder, trained on Re-ID datasets, to generate pseudo appearance embedding labels for the detection datasets. Then, we use the augmented dataset to train a detector that is also capable of regressing these pseudo-embeddings in a fully-convolutional fashion. Our proposed one-stage solution matches the two-stage counterpart in quality but is 3 times faster. Even though the teacher embedder has not seen any tracking data during training, our proposed tracker achieves competitive performance with some popular trackers (e.g. JDE) trained with fully labeled tracking data.

https://arxiv.org/abs/2205.05583