爱可可AI前沿推介(5.26)

LG - 机器学习 CV - 计算机视觉 CL - 计算与语言 AS - 音频与语音 RO - 机器人

转自爱可可爱生活

摘要：大型语言模型也是零样本推理器、免CAD模型的单样本目标姿态估计、回到训练数据追溯语言模型中的知识、关于从数据中学习推理的悖论、重复数据学习的扩展律和可解释性、面向单幅图像的基于物理室内场景照明编辑、机器学习中统一的认知和伦理意义、通过最大矩限制进行近端推理的深度学习方法、用预训练多语种语言模型控制翻译形式

1、[CL] Large Language Models are Zero-Shot Reasoners

T Kojima, S S Gu, M Reid, Y Matsuo, Y Iwasawa

[The University of Tokyo & Google Research]

大型语言模型也是零样本推理器。预训练的大型语言模型(LLM)广泛用于自然语言处理(NLP)的许多子领域，并被普遍认为是具备特定任务样本时的优秀少样本学习器。值得注意的是，一项最近的技术——思维链(CoT)提示，通过一步一步的答案样本，来激发复杂的多步骤推理，在算术和符号推理中取得了最先进的表现，这些困难的系统2任务并不遵循LLM的标准缩放规律。虽然这些成功通常被归因于LLM的少样本学习能力，但本文表明，只要在每个答案前加上"让我们一步一步地思考(Let’s think step by step)"，LLM就是一个不错的零样本推理器。实验结果表明，所提出的Zero-shot-CoT用相同的单一提示模板，在不同的基准推理任务中，包括算术(MultiArith、GSM8K、AQUA-RAT、SVAMP)、符号推理(Last Letter、Coin Flip)和其他逻辑推理任务(Date Understanding、Tracking Shuffled Objects)，在没有任何手工制作的少样本示例的情况下，明显优于Zero-shot LLM的表现。例如，用现成的175B参数模型将MultiArith的准确率从17.7%提高到78.7%，GSM8K的准确率从10.4%提高到40.7%。这种单一的提示在不同推理任务中的通用性暗示了LLM未被开发和研究的基本零样本能力，表明高水平、多任务的广泛认知能力可以通过简单的提示来提取。希望本文工作不仅可以作为具有挑战性的推理基准的最小最强的零样本基准，而且还强调了在制作微调数据集或零点样本示例之前仔细探索和分析隐藏在LLM内部的巨大零样本知识的重要性。

Pretrained large language models (LLMs) are widely used in many sub-fields of natural language processing (NLP) and generally known as excellent few-shot learners with task-specific exemplars. Notably, chain of thought (CoT) prompting, a recent technique for eliciting complex multi-step reasoning through step-bystep answer examples, achieved the state-of-the-art performances in arithmetics and symbolic reasoning, difficult system-2 tasks that do not follow the standard scaling laws for LLMs. While these successes are often attributed to LLMs’ ability for few-shot learning, we show that LLMs are decent zero-shot reasoners by simply adding “Let’s think step by step” before each answer. Experimental results demonstrate that our Zero-shot-CoT, using the same single prompt template, significantly outperforms zero-shot LLM performances on diverse benchmark reasoning tasks including arithmetics (MultiArith, GSM8K, AQUA-RAT, SVAMP), symbolic reasoning (Last Letter, Coin Flip), and other logical reasoning tasks (Date Understanding, Tracking Shuffled Objects), without any hand-crafted fewshot examples, e.g. increasing the accuracy on MultiArith from 17.7% to 78.7% and GSM8K from 10.4% to 40.7% with an off-the-shelf 175B parameter model. The versatility of this single prompt across very diverse reasoning tasks hints at untapped and understudied fundamental zero-shot capabilities of LLMs, suggesting high-level, multi-task broad cognitive capabilities may be extracted through simple prompting. We hope our work not only serves as the minimal strongest zeroshot baseline for the challenging reasoning benchmarks, but also highlights the importance of carefully exploring and analyzing the enormous zero-shot knowledge hidden inside LLMs before crafting finetuning datasets or few-shot exemplars.

https://arxiv.org/abs/2205.11916

2、[CV] OnePose: One-Shot Object Pose Estimation without CAD Models

J Sun, Z Wang, S Zhang, X He, H Zhao, G Zhang, X Zhou

[Zhejiang University & SenseTime Research & TUM]

OnePose: 免CAD模型的单样本目标姿态估计。本文提出一种名为OnePose的目标姿态估计新方法。与现有的实例级或类别级方法不同，OnePose不依赖于CAD模型，可处理任意类别的目标，无需进行针对实例或类别的网络训练。OnePose借鉴了视觉定位的思路，只需要对目标进行简单的RGB视频扫描，就可以建立目标的稀疏SfM模型。然后，该模型被注册到具有通用特征匹配网络的新查询图像上。为了缓解现有视觉定位方法的缓慢运行时间，提出一种新的图注意网络，该网络直接将查询图像中的2D兴趣点与SfM模型中的3D点进行匹配，从而实现高效和鲁棒的姿态估计。与基于特征的姿态跟踪器相结合，OnePose能稳定地检测和实时跟踪日常家用物品的6D姿态。本文还收集了一个大规模数据集，其中包括150个目标的450个序列。

We propose a new method named OnePose for object pose estimation. Unlike existing instance-level or category-level methods, OnePose does not rely on CAD models and can handle objects in arbitrary categories without instanceor category-specific network training. OnePose draws the idea from visual localization and only requires a simple RGB video scan of the object to build a sparse SfM model of the object. Then, this model is registered to new query images with a generic feature matching network. To mitigate the slow runtime of existing visual localization methods, we propose a new graph attention network that directly matches 2D interest points in the query image with the 3D points in the SfM model, resulting in efficient and robust pose estimation. Combined with a feature-based pose tracker, OnePose is able to stably detect and track 6D poses of everyday household objects in real-time. We also collected a large-scale dataset that consists of 450 sequences of 150 objects. Code and data are available at the project page: https://zju3dv.github.io/onepose/.

https://arxiv.org/abs/2205.12257

3、[CL] Tracing Knowledge in Language Models Back to the Training Data

E Akyürek, T Bolukbasi, F Liu, B Xiong, I Tenney, J Andreas, K Guu

[MIT CSAIL & Google Research]

回到训练数据追溯语言模型中的知识。神经语言模型(LM)已经被证明可以记忆大量的事实性知识。但是，当神经语言模型产生一个断言时，往往很难确定它从哪里学到这些信息以及这些信息是否真实。本文提出一个事实追踪新基准：将语言模型的断言追溯到为这些预测提供证据的训练样本。之前的工作表明，数据集级影响方法可能提供了一个有效的框架，用于将预测追溯到训练数据。然而，这种方法还没有被用作事实追踪评估，研究人员主要通过定性分析或作为分类/回归任务的数据清理技术来研究它们。本文提出第一个用公认的信息检索(IR)指标评估事实追踪影响方法的实验。比较了两个流行的影响方法系列——基于梯度的和基于嵌入的——并表明这两种方法都不能可靠地追踪事实；事实上，这两种方法的表现都没有超过甚至没有访问LM的IR基线(BM25)。本文探讨了为什么会出现这种情况(例如，梯度饱和)，并证明现有的影响方法在可靠地归因于LM中的事实预测之前必须得到显著的改善。

Neural language models (LMs) have been shown to memorize a great deal of factual knowledge. But when an LM generates an assertion, it is often difficult to determine where it learned this information and whether it is true. In this paper, we introduce a new benchmark for fact tracing: tracing language models’ assertions back to the training examples that provided evidence for those predictions. Prior work has suggested that dataset-level influence methods might offer an effective framework for tracing predictions back to training data. However, such methods have not been evaluated for fact tracing, and researchers primarily have studied them through qualitative analysis or as a data cleaning technique for classification/regression tasks. We present the first experiments that evaluate influence methods for fact tracing, using well-understood information retrieval (IR) metrics. We compare two popular families of influence methods – gradient-based and embedding-based – and show that neither can fact-trace reliably; indeed, both methods fail to outperform an IR baseline (BM25) that does not even access the LM. We explore why this occurs (e.g., gradient saturation) and demonstrate that existing influence methods must be improved significantly before they can reliably attribute factual predictions in LMs.1

https://arxiv.org/abs/2205.11482

4、[CL] On the Paradox of Learning to Reason from Data

H Zhang, L H Li, T Meng, K Chang, G V d Broeck

[University of California, Los Angeles]

关于从数据中学习推理的悖论。在广泛的NLP任务中都需要逻辑推理。BERT模型能否被训练成端到端来解决自然语言中的逻辑推理问题？本文试图在一个有限的问题空间中回答该问题，在这个问题空间中存在着一组完美模拟逻辑推理的参数。观察结果似乎是相互矛盾的。BERT在分布内的测试样本上达到了近乎完美的准确性，而在完全相同的问题空间上却无法泛化到其他数据分布。本文研究为这一悖论提供了解释：BERT不是学习模仿正确的推理函数，而是事实上学习了逻辑推理问题中固有的统计特征。本文还表明，从数据中联合去除统计特征是不可行的，说明了推理学习的困难之处。本文结果自然延伸到其他神经模型，并揭示了学习推理和学习使用统计特征在NLP基准上取得高性能之间的根本区别。

Logical reasoning is needed in a wide range of NLP tasks. Can a BERT model be trained end-to-end to solve logical reasoning problems presented in natural language? We attempt to answer this question in a confined problem space where there exists a set of parameters that perfectly simulates logical reasoning. We make observations that seem to contradict each other: BERT attains near-perfect accuracy on in-distribution test examples while failing to generalize to other data distributions over the exact same problem space. Our study provides an explanation for this paradox: instead of learning to emulate the correct reasoning function, BERT has in fact learned statistical features that inherently exist in logical reasoning problems. We also show that it is infeasible to jointly remove statistical features from data, illustrating the difficulty of learning to reason in general. Our result naturally extends to other neural models and unveils the fundamental difference between learning to reason and learning to achieve high performance on NLP benchmarks using statistical features.

https://arxiv.org/abs/2205.11502

5、[LG] Scaling Laws and Interpretability of Learning from Repeated Data

D Hernandez, T Brown, T Conerly, N DasSarma, D Drain, S El-Showk, N Elhage, Z Hatfield-Dodds...

[Anthropic]

重复数据学习的扩展律和可解释性。最近的大型语言模型是在庞大的数据集上训练的，但也经常在重复数据上进行训练，有的是为了提高高质量数据的权重而有意为之，有的是因为数据去重不完善而无意中使模型暴露在句子、段落或文档层面的重复数据中。一些工作研究了这种重复数据对性能的实质性负面影响。本文试图系统地研究重复数据，并从机制上理解其影响。为做到这一点，训练了一个模型族，其中大部分数据是唯一的，但有一小部分数据是多次重复的。发现一个明显的双降现象，其中重复数据会导致测试损失在训练的中途增加。一个可预测的重复频率范围会导致令人惊讶的严重性能下降。例如，一个800M参数的模型的性能可以通过重复100次0.1%的数据而下降到一个一半小的模型(400M参数)，尽管其他90%的训练token仍然是唯一的。怀疑这中间有一个范围，在该范围内，数据可以被记忆，而这样做会消耗模型的很大一部分容量，这可能是退化的峰值发生的地方。最后，将这些观察结果与最近的机制可解释性工作——试图对模型进行的详细计算进行逆向工程——联系起来，表明数据重复不成比例地损害了与泛化有关的复制和内部结构，如归纳头，为从泛化到记忆的转变提供了一种可能的机制。综上所述，这些结果提供了一个假设，说明为什么在大型语言模型中重复相对较小的一部分数据会导致对性能不成比例的巨大损害。

Recent large language models have been trained on vast datasets, but also often on repeated data, either intentionally for the purpose of upweighting higher quality data, or unintentionally because data deduplication is not perfect and the model is exposed to repeated data at the sentence, paragraph, or document level. Some works have reported substantial negative performance effects of this repeated data. In this paper we attempt to study repeated data systematically and to understand its effects mechanistically. To do this, we train a family of models where most of the data is unique but a small fraction of it is repeated many times. We find a strong double descent phenomenon, in which repeated data can lead test loss to increase midway through training. A predictable range of repetition frequency leads to surprisingly severe degradation in performance. For instance, performance of an 800M parameter model can be degraded to that of a 2x smaller model (400M params) by repeating 0.1% of the data 100 times, despite the other 90% of the training tokens remaining unique. We suspect there is a range in the middle where the data can be memorized and doing so consumes a large fraction of the model’s capacity, and this may be where the peak of degradation occurs. Finally, we connect these observations to recent mechanistic interpretability work — attempting to reverse engineer the detailed computations performed by the model — by showing that data repetition disproportionately damages copying and internal structures associated with generalization, such as induction heads, providing a possible mechanism for the shift from generalization to memorization. Taken together, these results provide a hypothesis for why repeating a relatively small fraction of data in large language models could lead to disproportionately large harms to performance.

https://arxiv.org/abs/2205.10487