爱可可AI前沿推介(2.6)

LG - 机器学习 CV - 计算机视觉 CL - 计算与语言 AS - 音频与语音 RO - 机器人

转自爱可可爱生活

1、[LG] ETSformer: Exponential Smoothing Transformers for Time-series Forecasting

G Woo, C Liu, D Sahoo, A Kumar, S Hoi

[Salesforce Research Asia & Singapore Management University]

ETSformer：面向时间序列预测的指数平滑Transformer。近年来，Transformer被积极研究用于时间序列预测。虽然在各种情况下经常显示出有希望的结果，但传统的Transformer并不是为了充分利用时间序列数据的特点而设计的，因此存在一些基本的局限性，例如，它们通常缺乏分解能力和可解释性，对于长期预测既没有效果也没有效率。本文提出ETSFormer，一种新的时间序列Transformer架构，利用指数平滑原理改进用于时间序列预测的Transformer。受时间序列预测中经典的指数平滑方法启发，提出了新的指数平滑注意力(ESA)和频率注意力(FA)，以取代vanilla Transformer中的自注意力机制，从而提高精度和效率，实现了O(L logL)的复杂度，其中L是回看窗口长度。在此基础上，重新设计了带有模块化分解块的Transformer架构，使其能学习将时间序列数据分解为可解释的时间序列成分，如水平、增长和季节性。在六个真实世界的数据集上取得了最先进的性能，在多元和单变量预测方面，分别在40个和23个设置中的35个和17个中击败了竞争基线，验证了所提出方法的功效和优势。

Transformers have been actively studied for timeseries forecasting in recent years. While often showing promising results in various scenarios, traditional Transformers are not designed to fully exploit the characteristics of time-series data and thus suffer some fundamental limitations, e.g., they generally lack of decomposition capability and interpretability, and are neither effective nor efficient for long-term forecasting. In this paper, we propose ETSFormer, a novel time-series Transformer architecture, which exploits the principle of exponential smoothing in improving Transformers for time-series forecasting. In particular, inspired by the classical exponential smoothing methods in time-series forecasting, we propose the novel exponential smoothing attention (ESA) and frequency attention (FA) to replace the selfattention mechanism in vanilla Transformers, thus improving both accuracy and efficiency. Based on these, we redesign the Transformer architecture with modular decomposition blocks such that it can learn to decompose the time-series data into interpretable time-series components such as level, growth and seasonality. Extensive experiments on various time-series benchmarks validate the efficacy and advantages of the proposed method. The code and models of our implementations will be released.

2、[AS] DiffGAN-TTS: High-Fidelity and Efficient Text-to-Speech with Denoising Diffusion GANs

S Liu, D Su, D Yu

[Tencent AI Lab]

DiffGAN-TTS：去噪扩散GAN高保真高效语音合成。去噪扩散概率模型(DDPM)是富有表现力的生成模型，已被用于解决各种语音合成问题。然而，由于其高采样成本，DDPM很难在实时语音处理应用中使用。本提出DiffGAN-TTS，一种新的基于DDPM的文本到语音(TTS)模型，可实现高保真和高效的语音合成。DiffGAN-TTS基于去噪扩散生成对抗网络(GAN)，采用对抗性训练的表达模型近似去噪分布。通过多说话人TTS实验表明，DiffGAN-TTS只需4个去噪步骤就能生成高保真语音样本。提出一种主动浅层扩散机制，以进一步加快推理的速度。提出一种两阶段训练方案，在第一阶段训练的基本TTS声学模型为第二阶段训练的DDPM提供了宝贵的先验信息。实验表明，DiffGAN-TTS只需一个去噪步骤就能达到很高的合成性能。

Denoising diffusion probabilistic models (DDPMs) are expressive generative models that have been used to solve a variety of speech synthesis problems. However, because of their high sampling costs, DDPMs are difficult to use in real-time speech processing applications. In this paper, we introduce DiffGAN-TTS, a novel DDPM-based text-to-speech (TTS) model achieving high-fidelity and efficient speech synthesis. DiffGAN-TTS is based on denoising diffusion generative adversarial networks (GANs), which adopt an adversarially-trained expressive model to approximate the denoising distribution. We show with multi-speaker TTS experiments that DiffGAN-TTS can generate high-fidelity speech samples within only 4 denoising steps. We present an active shallow diffusion mechanism to further speed up inference. A two-stage training scheme is proposed, with a basic TTS acoustic model trained at stage one providing valuable prior information for a DDPM trained at stage two. Our experiments show that DiffGAN-TTS can achieve high synthesis performance with only 1 denoising step.

3、[CL] Unified Scaling Laws for Routed Language Models

A Clark, D d l Casas, A Guy, A Mensch, M Paganini, J Hoffmann, B Damoc, B Hechtman, T Cai, S Borgeaud...

[DeepMind]

路由语言模型的统一缩放律。语言模型的性能已被证明可以有效地建模为其参数数量的幂律。本文研究了路由网络的扩展行为：在处理输入时有条件地只使用其参数的某个子集的架构。对于这些模型，参数数量和计算需求形成了两个独立的轴，沿着两轴的增加会导致更好的性能。本文推导并论证了定义在这两个变量上的缩放律，概括了已知的标准语言模型，描述了通过三种不同技术训练的广泛的路由结构的性能。提供了这些定律的两个应用：推导出一个有效参数计数，沿着该数值，所有模型都以相同速度扩展；用缩放系数对所考虑的三种路由技术进行了定量比较。

The performance of a language model has been shown to be effectively modeled as a power-law in its parameter count. Here we study the scaling behaviors of Routing Networks: architectures that conditionally use only a subset of their parameters while processing an input. For these models, parameter count and computational requirement form two independent axes along which an increase leads to better performance. In this work we derive and justify scaling laws defined on these two variables which generalize those known for standard language models and describe the performance of a wide range of routing architectures trained via three different techniques. Afterwards we provide two applications of these laws: first deriving an Effective Parameter Count along which all models scale at the same rate, and then using the scaling coefficients to give a quantitative comparison of the three routing techniques considered. Our analysis derives from an extensive evaluation of Routing Networks across five orders of magnitude of size, including models with hundreds of experts and hundreds of billions of parameters.

4、[LG] Don't Change the Algorithm, Change the Data: Exploratory Data for Offline Reinforcement Learning

D Yarats, D Brandfonbrener, H Liu, M Laskin, P Abbeel, A Lazaric, L Pinto

[New York University & UC Berkeley & Facebook AI Research]

算法不变数据变：离线强化学习的探索性数据。深度学习的最新进展依赖于对大型和多样化数据集的访问。这种数据驱动的进展，在离线强化学习(RL)中并不明显，因为离线RL数据的收集通常是为了优化特定目标任务，限制了数据的多样性。本文提出离线强化学习的探索性数据(ExORL)，一种以数据为中心的离线强化学习方法。ExORL通过无监督的无奖励探索产生数据，在用离线强化学习训练策略前，用下游奖励重新标记这些数据。探索性数据允许vanilla off-policy的离线策略强化学习算法，在没有任何离线特定修改的情况下，在下游任务上的表现优于或匹配最先进的离线强化学习算法。对于离线强化学习来说，数据的生成与算法的进步同样重要，需要社区认真思考。

Recent progress in deep learning has relied on access to large and diverse datasets. Such datadriven progress has been less evident in offline reinforcement learning (RL), because offline RL data is usually collected to optimize specific target tasks limiting the data’s diversity. In this work, we propose Exploratory data for Offline RL (ExORL), a data-centric approach to offline RL. ExORL first generates data with unsupervised reward-free exploration, then relabels this data with a downstream reward before training a policy with offline RL. We find that exploratory data allows vanilla off-policy RL algorithms, without any offline-specific modifications, to outperform or match state-of-the-art offline RL algorithms on downstream tasks. Our findings suggest that data generation is as important as algorithmic advances for offline RL and hence requires careful consideration from the community.

5、[LG] Quantifying Relevance in Learning and Inference

M Marsili, Y Roudi

[The Abdus Salam International Centre for Theoretical Physics & Norwegian University of Science and Technology (NTNU)]

学习和推理的相关性量化。学习是智能行为的一个显著特征。高通量实验数据和大数据有望为细胞、大脑或我们的社会等复杂系统打开新的窗口。然而，人工智能和机器学习令人费解的成功表明，我们对学习的概念性理解仍然很差。这些应用将统计推理推向了未知的领域，在那里，数据是高维和稀缺的，而关于"真实"模型的先验信息即使不是完全没有，也非常少。本文回顾了最近在学习理解方面的进展，基于"相关性"的概念。这里定义的相关性，量化了一个数据集或一个学习机器的内部表示所包含的关于数据生成模型的信息量。一方面可以定义信息量最大的样本，另一方面可以定义最优的学习机。这些是样本和机器的理想极限，在给定的分辨率(或压缩程度)下，它们包含了关于未知生成过程的最大信息量。统计学意义上，这两个理想极限都表现出关键的特征。信息量最大的样本具有幂律频率分布的特征(统计临界性)，而优化学习机器很容易受到影响。分辨率(压缩)和相关性之间的权衡区分了噪声表示场景和有损压缩场景。它们被一个由Zipf's law统计刻画的特殊点分开。这确定了服从Zipf's law的样本是压缩程度最高的无损表示，在相关性最大化的意义上是最优的。最佳学习机器的临界性表现为能级的指数退行性，这导致了不寻常的热力学特性。这一独特的特征与输出的粗粒度下的分类不变性是一致的，这也是学习机器的一个理想属性。这一理论框架得到了实证分析的证实，显示出：i）相关性的概念如何有助于识别高维推理中的相关变量；ii）广泛使用的机器学习架构在它们被训练的数据范围内合理地接近了最佳学习机器的理想极限。

Learning is a distinctive feature of intelligent behaviour. High-throughput experimental data and Big Data promise to open new windows on complex systems such as cells, the brain or our societies. Yet, the puzzling success of Artificial Intelligence and Machine Learning shows that we still have a poor conceptual understanding of learning. These applications push statistical inference into uncharted territories where data is high-dimensional and scarce, and prior information on “true” models is scant if not totally absent. Here we review recent progress on understanding learning, based on the notion of ”relevance”. The relevance, as we define it here, quantifies the amount of information that a dataset or the internal representation of a learning machine contains on the generative model of the data. This allows us to define maximally informative samples, on one hand, and optimal learning machines on the other. These are ideal limits of samples and of machines, that contain the maximal amount of information about the unknown generative process, at a given resolution (or level of compression). Both ideal limits exhibit critical features in the statistical sense: Maximally informative samples are characterised by a power-law frequency distribution (statistical criticality) and optimal learning machines by an anomalously large susceptibility. The trade-off between resolution (i.e. compression) and relevance distinguishes the regime of noisy representations from that of lossy compression. These are separated by a special point characterised by Zipf’s law statistics. This identifies samples obeying Zipf’s law as the most compressed loss-less representations that are optimal in the sense of maximal relevance. Criticality in optimal learning machines manifests in an exponential degeneracy of energy levels, that leads to unusual thermodynamic properties. This distinctive feature is consistent with the invariance of the classification under coarse graining of the output, which is a desirable property of learning machines. This theoretical framework is corroborated by empirical analysis showing i) how the concept of relevance can be useful to identify relevant variables in high-dimensional inference and ii) that widely used machine learning architectures approach reasonably well the ideal limit of optimal learning machines, within the limits of the data with which they are trained.