爱可可AI前沿推介(9.4)

LG - 机器学习 CV - 计算机视觉 CL - 计算与语言 AS - 音频与语音 RO - 机器人

转自爱可可爱生活

摘要：面向深度模型更快优化的自适应Nesterov动量算法、全纯均衡传播用有限尺寸振荡计算精确梯度、损失的几何与微积分、离线强化学习还是行为克隆、重要性加权核贝叶斯法则、面向分割的互补CNN和Transformer编码器、面向大规模检索的词表启示稠密检索器、基于自监督Transformer和Normalized Cut的图像和视频目标分割、极端地下环境SLAM的现状和未来

1、[LG] Adan: Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models

X Xie, P Zhou, H Li, Z Lin, S Yan

[Sea AI Lab & Peking University & Nankai University]

Adan: 面向深度模型更快优化的自适应Nesterov动量算法。自适应梯度算法借用了重球加速的移动平均思想来估计梯度的准确一阶和二阶矩，以加速收敛。然而，在理论上和许多经验案例中，Nesterov加速器比重球加速器收敛得更快，但在自适应梯度设置下，对它的研究却很少。本文提出了ADAptive Nesterov momentum算法，简称Adan，以有效加快深度神经网络训练。Adan首先对vanilla Nesterov加速进行了重构，开发了一种新的Nesterov动量估计(NME)方法，避免了在外推点计算梯度的额外计算和内存开销。然后，Adan采用NME来估计自适应梯度算法中梯度的一阶和二阶矩，以实现收敛加速。此外，本文证明Adan在非凸随机问题(如深度学习问题)上，在O(ϵ-3.5)的随机梯度复杂度内找到了一个ϵ-近似的一阶稳定点，与最知名的下界相匹配。广泛的实验结果表明，Adan在视觉Transformer(ViTs)和CNN上都超过了相应的SoTA优化器，并为许多流行的网络，如ResNet、ConvNext、ViT、Swin、MAE、LSTM、Transformer-XL和BERT，设定了新的SoTA。更令人惊讶的是，Adan可以用SoTA优化器一半的训练成本(epochs)，在ViT和ResNet等网络上获得更高的或相当的性能，并且对大范围的minibatch大小，例如从1k到32k，也表现出极大的宽容度。

Adaptive gradient algorithms borrow the moving average idea of heavy ball acceleration to estimate accurate first- and second-order moments of gradient for accelerating convergence. However, Nesterov acceleration which converges faster than heavy ball acceleration in theory and also in many empirical cases is much less investigated under the adaptive gradient setting. In this work, we propose the ADAptive Nesterov momentum algorithm, Adan for short, to speed up the training of deep neural networks effectively. Adan first reformulates the vanilla Nesterov acceleration to develop a new Nesterov momentum estimation (NME) method, which avoids the extra computation and memory overhead of computing gradient at the extrapolation point. Then Adan adopts NME to estimate the first- and second-order moments of the gradient in adaptive gradient algorithms for convergence acceleration. Besides, we prove that Adan finds an ϵ-approximate first-order stationary point within O(ϵ−3.5) stochastic gradient complexity on the nonconvex stochastic problems (e.g., deep learning problems), matching the best-known lower bound. Extensive experimental results show that Adan surpasses the corresponding SoTA optimizers on both vision transformers (ViTs) and CNNs, and sets new SoTAs for many popular networks, e.g., ResNet, ConvNext, ViT, Swin, MAE, LSTM, Transformer-XL, and BERT. More surprisingly, Adan can use half of the training cost (epochs) of SoTA optimizers to achieve higher or comparable performance on ViT and ResNet, e.t.c., and also shows great tolerance to a large range of minibatch size, e.g., from 1k to 32k. We hope Adan can contribute to the development of deep learning by reducing training cost and relieving engineering burden of trying different optimizers on various architectures. Code is released at this https URL.

https://arxiv.org/abs/2208.06677

2、[LG] Holomorphic Equilibrium Propagation Computes Exact Gradients Through Finite Size Oscillations

A Laborieux, F Zenke

[Friedrich Miescher Institute for Biomedical Research] (2022)

全纯均衡传播用有限尺寸振荡计算精确梯度。平衡传播(EP)是反向传播(BP)的一种替代方案，允许用局部学习规则训练深度神经网络。因此，它为训练神经形态系统和理解神经生物学的学习提供了一个引人注目的框架。然而，EP需要无限小的教学信号，从而限制了它在含噪物理系统中的适用性。此外，该算法需要单独的时间阶段，并且没有被应用于大规模问题。本文通过将EP扩展到全纯网络来解决这些问题。通过分析表明，这种扩展自然会导致精确的梯度，即使是有限振幅的教学信号。重要的是，梯度可以作为连续时间内有限神经元活动振荡的第一个傅里叶系数来计算，而不需要单独的阶段。此外，本文在数值模拟中证明，所提出方法允许在有噪声的情况下对梯度进行鲁棒的估计，更深层次的模型从有限教学信号中受益。最后，本文在ImageNet 32x32数据集上为EP建立了第一个基准，并表明它与用BP训练的同等网络的性能相匹配。

Equilibrium propagation (EP) is an alternative to backpropagation (BP) that allows the training of deep neural networks with local learning rules. It thus provides a compelling framework for training neuromorphic systems and understanding learning in neurobiology. However, EP requires infinitesimal teaching signals, thereby limiting its applicability in noisy physical systems. Moreover, the algorithm requires separate temporal phases and has not been applied to large-scale problems. Here we address these issues by extending EP to holomorphic networks. We show analytically that this extension naturally leads to exact gradients even for finite-amplitude teaching signals. Importantly, the gradient can be computed as the first Fourier coefficient from finite neuronal activity oscillations in continuous time without requiring separate phases. Further, we demonstrate in numerical simulations that our approach permits robust estimation of gradients in the presence of noise and that deeper models benefit from the finite teaching signals. Finally, we establish the first benchmark for EP on the ImageNet 32x32 dataset and show that it matches the performance of an equivalent network trained with BP. Our work provides analytical insights that enable scaling EP to large-scale problems and establishes a formal framework for how oscillations could support learning in biological and neuromorphic systems.

https://arxiv.org/abs/2209.00530

3、[LG] The Geometry and Calculus of Losses

R C. Williamson, Z Cranko [University of Tubingen]

损失的几何与微积分。统计决策问题是统计机器学习的基础。最简单的问题是二元和多元分类以及类概率估计。其定义的核心是损失函数的选择，也是评估解决方案质量的手段。本文从一个新的角度系统地发展了此类问题的损失函数理论，其基本要素是具有特殊结构的凸集。损失函数被定义为凸集的支持函数的子梯度。因此，它是自动适当的(为概率估计校准)。这一观点提供了三个新的机会。它使损失和(反)范式之间的基本关系得到发展，这一点以前似乎没有被注意到。第二，使得能发展由凸集微积分引起的损失计算，允许在不同的损失之间进行插值，因此是一种潜在的有用的设计工具，可以使损失适应特定的问题。在此过程中，本文建立并大大扩展了关于凸集M和的现有结果。第三，该观点导致了"极性"(或 "反向")损失函数的自然理论，这些函数来自于定义损失的凸集的极性对偶，并且形成了Vovk的聚合算法的自然通用替换函数。

Statistical decision problems are the foundation of statistical machine learning. The simplest problems are binary and multiclass classification and class probability estimation. Central to their definition is the choice of loss function, which is the means by which the quality of a solution is evaluated. In this paper we systematically develop the theory of loss functions for such problems from a novel perspective whose basic ingredients are convex sets with a particular structure. The loss function is defined as the subgradient of the support function of the convex set. It is consequently automatically proper (calibrated for probability estimation). This perspective provides three novel opportunities. It enables the development of a fundamental relationship between losses and (anti)-norms that appears to have not been noticed before. Second, it enables the development of a calculus of losses induced by the calculus of convex sets which allows the interpolation between different losses, and thus is a potential useful design tool for tailoring losses to particular problems. In doing this we build upon, and considerably extend, existing results on M-sums of convex sets. Third, the perspective leads to a natural theory of `polar' (or `inverse') loss functions, which are derived from the polar dual of the convex set defining the loss, and which form a natural universal substitution function for Vovk's aggregating algorithm.

https://arxiv.org/abs/2209.00238

4、[LG] Should I Run Offline Reinforcement Learning or Behavioral Cloning?

A Kumar, J Hong, A Singh, S Levine

[UC Berkeley]

离线强化学习还是行为克隆？离线强化学习(RL)算法可以只用之前收集的经验来获得有效的策略，而不需要任何在线交互。虽然人们普遍认为离线RL即使从高度次优的数据中也能提取好的策略，但在实践中，离线RL经常被用于类似于演示的数据。在这种情况下，人们也可以用行为克隆(BC)算法，通过监督学习来模仿数据集的一个子集。似乎很自然地要问：什么时候应该选择离线RL而不是BC？本文的目标是描述这样一种环境和数据集的组成，使得离线RL会比BC带来更好的性能。特别是，描述允许离线RL方法比BC方法表现更好的环境属性，即使只提供了专家数据。此外，本文表明，在有足够噪声的次优数据上训练的策略甚至可以达到比有专家数据的BC算法更好的性能，特别是在长周期问题上。通过在诊断和高维领域的广泛实验来验证所提出的理论结果，包括机器人操纵、迷宫导航和Atari游戏，当从各种数据源学习时。可以观察到，在几个实际问题中，现代离线RL方法在稀疏奖励领域的次优、噪声数据上的训练效果优于克隆专家数据。

Offline reinforcement learning (RL) algorithms can acquire effective policies by utilizing only previously collected experience, without any online interaction. While it is widely understood that offline RL is able to extract good policies even from highly suboptimal data, in practice offline RL is often used with data that resembles demonstrations. In this case, one can also use behavioral cloning (BC) algorithms, which mimic a subset of the dataset via supervised learning. It seems natural to ask: When should we prefer offline RL over BC? In this paper, our goal is to characterize environments and dataset compositions where offline RL leads to better performance than BC. In particular, we characterize the properties of environments that allow offline RL methods to perform better than BC methods even when only provided with expert data. Additionally, we show that policies trained on suboptimal data that is sufficiently noisy can attain better performance than even BC algorithms with expert data, especially on long-horizon problems. We validate our theoretical results via extensive experiments on both diagnostic and high-dimensional domains including robot manipulation, maze navigation and Atari games, when learning from a variety of data sources. We observe that modern offline RL methods trained on suboptimal, noisy data in sparse reward domains outperform cloning the expert data in several practical problems.

https://openreview.net/forum?id=AP1MKT37rJ

5、[LG] Importance Weighted Kernel Bayes’ Rule

L Xu, Y Chen, A Doucet, A Gretton

[Gatsby Unit & DeepMind]

重要性加权核贝叶斯法则。本文研究了一种通过特征手段进行贝叶斯计算的非参数方法，其中先验特征期望值被更新以产生预期的后验特征，基于对观察结果的核或神经网络特征的回归。参与贝叶斯更新的所有数量都是从观测数据中学习的，使该方法完全没有模型。由此产生的算法是核贝叶斯法则(KBR)的一个新实例。所提出方法是基于重要性加权的，这使得KBR的数值稳定性优于现有方法，后者需要算子反转。通过对重要性加权估计器在无穷大规范下的新的一致性分析，显示了估计器的收敛性。在具有挑战性的合成基准上评估了KBR，包括一个涉及高维图像观测的状态空间模型的过滤问题。所提出的方法比现有的KBR产生了一致的更好的经验性能，并且与其他竞争方法相比具有竞争力的性能。

We study a nonparametric approach to Bayesian computation via feature means, where the expectation of prior features is updated to yield expected posterior features, based on regression from kernel or neural net features of the observations. All quantities involved in the Bayesian update are learned from observed data, making the method entirely model-free. The resulting algorithm is a novel instance of a kernel Bayes’ rule (KBR). Our approach is based on importance weighting, which results in superior numerical stability to the existing approach to KBR, which requires operator inversion. We show the convergence of the estimator using a novel consistency analysis on the importance weighting estimator in the infinity norm. We evaluate our KBR on challenging synthetic benchmarks, including a filtering problem with a state-space model involving high dimensional image observations. The proposed method yields uniformly better empirical performance than the existing KBR, and competitive performance with other competing methods.

https://proceedings.mlr.press/v162/xu22a.html