爱可可AI前沿推介(10.31)

LG - 机器学习 CV - 计算机视觉 CL - 计算与语言 AS - 音频与语音 RO - 机器人

转自爱可可爱生活

摘要：VectorAdam旋转等变几何优化、预训练强化的面向多变量时间序列预测的空-时图神经网络、基于简单扩散的分类SDE、语言模型能否处理递归嵌套语法结构、截断采样作为语言模型去平滑、超越最小编辑反事实以实现更丰富的数据增强、更大模型和更长训练可提高词汇泛化能力、文本到语音的大规模多语言语音-文本联合半监督学习、基于排序损失的微调T5文本排序

1、[LG] VectorAdam for Rotation Equivariant Geometry Optimization

S Ling, N Sharp, A Jacobson
[University of Toronto]
VectorAdam旋转等变几何优化。事实证明，Adam优化算法对机器学习中的优化问题，甚至是几何学处理中的传统任务都非常有效。同时，等变方法的发展，即在旋转或其他变换的作用下保持其输出，已被证明对这些领域的几何问题非常重要。本文观察到Adam——当被视为将初始条件映射到优化结果的函数时——由于每个坐标矩的更新，对于矢量值的参数不是旋转等变的。这导致了在实践中出现严重的伪影和偏差。本文建议用VectorAdam来解决该缺陷，这是一个简单的修改，通过考虑优化变量的矢量结构，使Adam具有旋转等变性。在机器学习和传统的几何优化问题上展示了该方法，表明当应用于矢量值数据时，等变的VectorAdam解决了传统Adam的伪影和偏差，而且收敛率相当甚至提高。

The Adam optimization algorithm has proven remarkably effective for optimization problems across machine learning and even traditional tasks in geometry processing. At the same time, the development of equivariant methods, which preserve their output under the action of rotation or some other transformation, has proven to be important for geometry problems across these domains. In this work, we observe that Adam − when treated as a function that maps initial conditions to optimized results − is not rotation equivariant for vector-valued parameters due to per-coordinate moment updates. This leads to significant artifacts and biases in practice. We propose to resolve this deficiency with VectorAdam, a simple modification which makes Adam rotation-equivariant by accounting for the vector structure of optimization variables. We demonstrate this approach on problems in machine learning and traditional geometric optimization, showing that equivariant VectorAdam resolves the artifacts and biases of traditional Adam when applied to vector-valued data, with equivalent or even improved rates of convergence.

https://arxiv.org/abs/2205.13599

2、[LG] Pre-training Enhanced Spatial-temporal Graph Neural Network for Multivariate Time Series Forecasting

Z Shao, Z Zhang, F Wang, Y Xu
[Chinese Academy of Sciences]
预训练强化的面向多变量时间序列预测的空-时图神经网络。多变量时间序列(MTS)预测在广泛的应用中起着重要作用。最近，空-时图神经网络(STGNN)已经成为越来越受欢迎的MTS预测方法。STGNN通过图神经网络和序列模型对MTS的空间和时间模式进行联合建模，大大提高了预测精度。但受限于模型的复杂性，大多数STGNN只考虑短期历史MTS数据，如过去一小时的数据。然而，时间序列的模式和它们之间的依赖关系(即时间和空间模式)需要基于长期历史MTS数据进行分析。为解决该问题，本文提出一种新框架，其中STGNN由一个可扩展的时间序列预训练模型(STEP)强化。本文设计了一种预训练模型，从非常长期的历史时间序列(例如，过去两周)中有效地学习时间模式，并生成片段级表示。这些表示为输入STGNN的短期时间序列提供了上下文信息，并促进了时间序列之间依赖关系的建模。在三个开放的真实世界数据集上的实验表明，所提出框架能大大增强下游的STGNN，而且所提出预训练模型能够恰当地捕捉到时间模式。

Multivariate Time Series (MTS) forecasting plays a vital role in a wide range of applications. Recently, Spatial-Temporal Graph Neural Networks (STGNNs) have become increasingly popular MTS forecasting methods. STGNNs jointly model the spatial and temporal patterns of MTS through graph neural networks and sequential models, significantly improving the prediction accuracy. But limited by model complexity, most STGNNs only consider short-term historical MTS data, such as data over the past one hour. However, the patterns of time series and the dependencies between them (i.e., the temporal and spatial patterns) need to be analyzed based on long-term historical MTS data. To address this issue, we propose a novel framework, in which STGNN is Enhanced by a scalable time series Pre-training model (STEP). Specifically, we design a pre-training model to efficiently learn temporal patterns from very long-term history time series (e.g., the past two weeks) and generate segment-level representations. These representations provide contextual information for short-term time series input to STGNNs and facilitate modeling dependencies between time series. Experiments on three public real-world datasets demonstrate that our framework is capable of significantly enhancing downstream STGNNs, and our pre-training model aptly captures temporal patterns.

https://arxiv.org/abs/2206.09113

3、[LG] Categorical SDEs with Simplex Diffusion

P H. Richemond, S Dieleman, A Doucet
[DeepMind]
基于简单扩散的分类SDE。扩散模型通常通过产生连续值的数据点在标准的生成模型框架内运行。为此，它们依赖于对原始数据分布的渐进式高斯平滑，这就需要对涉及标准布朗运动增量的SDE进行解释。然而，一些应用，如文本生成或强化学习，自然可以通过扩散分类值数据，即把扩散提升到一个概率分布空间来更好地服务。为此，本文提出了简单扩散，一种直接扩散位于n维概率单形上数据点的方法。本文展示了这与单形上的Dirichlet分布的关系，以及类似的SDE是如何通过多维Cox-Ingersoll-Ross过程(简称CIR)来实现的，该过程以前在经济学和数学金融中使用。最后，本文讨论了CIR过程轨迹的数值实现，并讨论该方法的一些局限性。

Diffusion models typically operate in the standard framework of generative modelling by producing continuously-valued datapoints. To this end, they rely on a progressive Gaussian smoothing of the original data distribution, which admits an SDE interpretation involving increments of a standard Brownian motion. However, some applications such as text generation or reinforcement learning might naturally be better served by diffusing categorical-valued data, i.e., lifting the diffusion to a space of probability distributions. To this end, this short theoretical note proposes Simplex Diffusion, a means to directly diffuse datapoints located on an n-dimensional probability simplex. We show how this relates to the Dirichlet distribution on the simplex and how the analogous SDE is realized thanks to a multi-dimensional Cox-Ingersoll-Ross process (abbreviated as CIR), previously used in economics and mathematical finance. Finally, we make remarks as to the numerical implementation of trajectories of the CIR process, and discuss some limitations of our approach.

https://arxiv.org/abs/2210.14784

4、[CL] Can language models handle recursively nested grammatical structures? A case study on comparing models and humans

A K Lampinen
[DeepMind]
语言模型能否处理递归嵌套语法结构？模型与人的比较案例研究。应该如何比较语言模型和人的能力？本文考虑一个案例：处理递归嵌套语法结构。之前的工作表明，语言模型不能像人那样可靠地处理这些结构。然而，人在被评估之前得到了指示和训练，而语言模型的评估则是零起点。因此，本文试图通过向语言模型提供少样本提示来更好地匹配评估范式。一个简单的提示，包含的内容比训练人工使用的内容少得多，使大型语言模型的表现总能优于人工的结果。同样的提示甚至允许推断出比人工测试更深的嵌套条件。此外，对之前人工实验的重新分析表明，人最初在困难的结构中的表现可能不超过随机。这些结果表明，大型语言模型事实上可以处理递归嵌套的语法结构，与人相当。这个案例研究强调了实验中特定上下文数量的差异是如何混淆语言模型和人的比较的。本文用这个案例研究来反思比较人和模型能力的更广泛的挑战，并提出评估特定现象的认知模型和评估广泛训练的模型之间存在着重要的区别。

How should we compare the capabilities of language models and humans? Here, I consider a case study: processing of recursively nested grammatical structures. Prior work has suggested that language models cannot handle these structures as reliably as humans can. However, the humans were provided with instructions and training before being evaluated, while the language models were evaluated zero-shot. I therefore attempt to more closely match the evaluation paradigms by providing language models with few-shot prompts. A simple prompt, which contains substantially less content than the human training, allows large language models to consistently outperform the human results. The same prompt even allows extrapolation to more-deeply-nested conditions than have been tested in humans. Further, a reanalysis of the prior human experiments suggests that the humans may not perform above chance at the difficult structures initially. These results suggest that large language models can in fact process recursively nested grammatical structures comparably to humans. This case study highlights how discrepancies in the quantity of experiment-specific context can confound comparisons of language models and humans. I use this case study to reflect on the broader challenge of comparing human and model capabilities, and to suggest that there is an important difference between evaluating cognitive models of a specific phenomenon and evaluating broadly-trained models.

https://arxiv.org/abs/2210.15303

5、[CL] Truncation Sampling as Language Model Desmoothing

J Hewitt, C D. Manning, P Liang
[Stanford University]
截断采样作为语言模型去平滑。来自神经语言模型的长文本样本的质量可能很差。截断采样算法——如top-p或top-k——通过在每一步将一些词的概率设置为零来解决该问题。这项工作为截断的目的提供了框架，并为该目的提供了改进的算法。本文建议将神经语言模型视为真实分布和平滑分布的混合物，以避免无限的困惑。在这种情况下，截断算法的目的是进行去平滑化，估计真实分布的支持度的一个子集。找到一个好的子集是至关重要的：本文表明top-p不必要地截断了高概率的词，例如导致它截断以Donald开头除了Trump的所有词。本文提出η-sampling，它截断了低于熵相关概率阈值的词。与之前的算法相比，η-采样产生的长篇英语文档根据人类的说法更加可信，更善于突破重复，并且在一系列测试分布中表现得更加合理。

Long samples of text from neural language models can be of poor quality. Truncation sampling algorithms--like top-p or top-k -- address this by setting some words' probabilities to zero at each step. This work provides framing for the aim of truncation, and an improved algorithm for that aim. We propose thinking of a neural language model as a mixture of a true distribution and a smoothing distribution that avoids infinite perplexity. In this light, truncation algorithms aim to perform desmoothing, estimating a subset of the support of the true distribution. Finding a good subset is crucial: we show that top-p unnecessarily truncates high-probability words, for example causing it to truncate all words but Trump for a document that starts with Donald. We introduce η-sampling, which truncates words below an entropy-dependent probability threshold. Compared to previous algorithms, η-sampling generates more plausible long English documents according to humans, is better at breaking out of repetition, and behaves more reasonably on a battery of test distributions.

https://arxiv.org/abs/2210.15191

另外几篇值得关注的论文：

[CL] NeuroCounterfactuals: Beyond Minimal-Edit Counterfactuals for Richer Data Augmentation

NeuroCounterfactuals：超越最小编辑反事实以实现更丰富的数据增强
P Howard, G Singer, V Lal, Y Choi, S Swayamdipta
[Intel Labs & University of Washington & Allen Institute for AI]
https://arxiv.org/abs/2210.12365

[CL] Lexical Generalization Improves with Larger Models and Longer Training

更大模型和更长训练可提高词汇泛化能力
E Bandel, Y Goldberg, Y Elazar
[Bar Ilan University & Allen Institute for Artificial Intelligence]
https://arxiv.org/abs/2210.12673

[AS] Virtuoso: Massive Multilingual Speech-Text Joint Semi-Supervised Learning for Text-To-Speech

Virtuoso：文本到语音的大规模多语言语音-文本联合半监督学习
T Saeki, H Zen, Z Chen, N Morioka, G Wang, Y Zhang, A Bapna, A Rosenberg, B Ramabhadran
[Google]
https://arxiv.org/abs/2210.15447

[IR] RankT5: Fine-Tuning T5 for Text Ranking with Ranking Losses

RankT5：基于排序损失的微调T5文本排序
H Zhuang, Z Qin, R Jagerman, K Hui, J Ma, J Lu, J Ni, X Wang, M Bendersky
[Google Research]
https://arxiv.org/abs/2210.10634

内容中包含的图片若涉及版权问题，请及时与我们联系删除