爱可可AI前沿推介(2.18)

LG - 机器学习 CV - 计算机视觉 CL - 计算与语言 AS - 音频与语音 RO - 机器人

转自爱可可爱生活

1、[LG] Compute Trends Across Three Eras of Machine Learning

J Sevilla, L Heim, A Ho, T Besiroglu, M Hobbhahn, P Villalobos

[University of Aberdeen & Centre for the Governance of AI & MIT & University of Tübingen & Complutense University of Madrid]

机器学习三个时代的计算趋势。计算、数据和算法的进步，是指导现代机器学习(ML)进展的三个基本因素。本文研究了最容易量化的因素——计算——的趋势，通过100多个里程碑式的机器学习系统的训练计算量数据集，来研究计算量的趋势，并利用这些数据分析这一趋势是如何随时间增长的。在2010年之前，训练计算量的增长与摩尔定律相一致，大约每20个月翻一番。自从2010年代初深度学习出现以来，训练计算的扩展速度加快，大约每6个月翻一番。2015年末，随着企业开发出大规模的机器学习模型，例如AlphaGo，对训练计算的要求提高了10到100倍，出现了一个新的趋势。基于这些观察，将机器学习的计算史划分为三个时代：前深度学习时代、深度学习时代和大规模时代。用这三个时代来表示计算量的发展趋势，有助于解释在数据中观察到的不连续现象。本文工作强调了训练高级机器学习系统的快速增长的计算需求。

Compute, data, and algorithmic advances are the three fundamental factors that guide the progress of modern Machine Learning (ML). In this paper we study trends in the most readily quantified factor – compute. We show that before 2010 training compute grew in line with Moore’s law, doubling roughly every 20 months. Since the advent of Deep Learning in the early 2010s, the scaling of training compute has accelerated, doubling approximately every 6 months. In late 2015, a new trend emerged as firms developed large-scale ML models with 10 to 100-fold larger requirements in training compute. Based on these observations we split the history of compute in ML into three eras: the Pre Deep Learning Era , the Deep Learning Era and the Large-Scale Era . Overall, our work highlights the fast-growing compute requirements for training advanced ML systems.

2、[CV] StandardSim: A Synthetic Dataset For Retail Environments

C Mata, N Locascio, M A Sheikh, K Kihara, D Fischetti

[Stony Brook University & Standard Cognition]

StandardSim：零售环境合成数据集。自主结账系统依靠视觉和感官输入，在零售环境中进行精细的场景理解。与典型室内场景相比，零售环境具有独特的挑战，因为它有大量密集的、独特而相似的物体。当只有RGB输入时，该问题变得更加困难，特别是对数据要求高的任务，如实例分割。为解决零售业缺乏数据集的问题，本文提出StandardSim，一个大规模的逼真合成数据集，具有语义分割、实例分割、深度估计和目标检测的标注。该数据集为每个场景提供了多个视图，实现了多视图表示学习。引入了一项对自主检查至关重要的新任务，即变化检测，需要对物体随时间变化的取、放和移动进行像素级分类。在该数据集上对广泛使用的分割和深度估计模型进行了基准测试，表明该测试集与目前较小规模的数据集相比构成了一个困难的基准，为模型提供了自主结账任务的关键信息。

Autonomous checkout systems rely on visual and sensory inputs to carry out fine-grained scene understanding in retail environments. Retail environments present unique challenges compared to typical indoor scenes owing to the vast number of densely packed, unique yet similar objects. The problem becomes even more difficult when only RGB input is available, especially for data-hungry tasks such as instance segmentation. To address the lack of datasets for retail, we present StandardSim, a large-scale photorealistic synthetic dataset featuring annotations for semantic segmentation, instance segmentation, depth estimation, and object detection. Our dataset provides multiple views per scene, enabling multi-view representation learning. Further, we introduce a novel task central to autonomous checkout called change detection, requiring pixel-level classification of takes, puts and shifts in objects over time. We benchmark widely-used models for segmentation and depth estimation on our dataset, show that our test set constitutes a difficult benchmark compared to current smaller-scale datasets and that our training set provides models with crucial information for autonomous checkout tasks.

3、[CV] Don't Lie to Me! Robust and Efficient Explainability with Verified Perturbation Analysis

T Fel, M Ducoffe, D Vigouroux, R Cadene, M Capelle, C Nicodeme, T Serre

[Brown University & Airbus AI Research & IRT Saint-Exupery]

基于验证性扰动分析的鲁棒高效可解释性。人们提出了各种方法，试图解释深度神经网络是如何做出决策的。这些方法的关键，是需要对像素空间进行有效的采样，以得出重要性图。然而，事实证明，到目前为止使用的采样方法引入了偏差和其他伪影，导致对单像素重要性的不准确估计，严重限制了当前可解释性方法的可靠性。不幸的是，另一种方法——对图像空间进行详尽的采样，在计算上是很困难的。本文提出EVA(基于验证性扰动分析进行解释)——第一种保证对扰动空间进行详尽探索的可解释性方法。利用验证性扰动分析的有利特性——时间效率、可操作性和保证对流形的完全覆盖——来有效描述最有可能驱动模型决策的输入变量。对所提出方法进行了系统的评估，并在多个基准上展示了最先进的结果。

A variety of methods have been proposed to try to explain how deep neural networks make their decisions. Key to those approaches is the need to sample the pixel space efficiently in order to derive importance maps. However, it has been shown that the sampling methods used to date introduce biases and other artifacts, leading to inaccurate estimates of the importance of individual pixels and severely limit the reliability of current explainability methods. Unfortunately, the alternative – to exhaustively sample the image space is computationally prohibitive. In this paper, we introduce EVA (Explaining using Verified perturbation Analysis) – the first explainability method guarantee to have an exhaustive exploration of a perturbation space. Specifically, we leverage the beneficial properties of verified perturbation analysis – time efficiency, tractability and guaranteed complete coverage of a manifold – to efficiently characterize the input variables that are most likely to drive the model decision. We evaluate the approach systematically and demonstrate state-of-the-art results on multiple benchmarks.

4、[LG] The Carbon Footprint of Machine Learning Training Will Plateau, Then Shrink

D Patterson, J Gonzalez, U Hölzle...

[Google & UC Berkeley]

机器学习训练碳排放将趋于平稳并逐步缩减。机器学习(ML)工作负载的重要性迅速增加，但引起了人们对其碳排放的关注。本文展示了四种最佳实践，即4M(model, machine, mechanization, map)，可以将机器学习训练的能源减少100倍，二氧化碳排放减少1000倍，最近的论文将机器学习训练的成本和碳排放高估了100倍-100,000倍。通过遵循最佳实践，在过去三年里，整体的机器学习能源使用(包括研究、开发和生产)稳定在Google总能源使用的15%以下。如果整个机器学习领域都采用最佳实践，预测到2030年，训练的总碳排放量将有所减少。

Machine Learning (ML) workloads have rapidly grown in importance, but raised concerns about their carbon footprint. We show four best practices to reduce ML training energy by up to 100x and CO2 emissions up to 1000x, and that recent papers overestimated the cost and carbon footprint of ML training by 100x–100,000x. Finally, we show that by following best practices, overall ML energy use (across research, development, and production) held steady at <15% of Google’s total energy use for the past three years. If the whole ML field adopts best practices, we predict that by 2030 total carbon emissions from training will reduce.

5、[LG] General-purpose, long-context autoregressive modeling with Perceiver AR

C Hawthorne, A Jaegle, C Cangea, S Borgeaud, C Nash, M Malinowski, S Dieleman, O Vinyals, M Botvinick, I Simon, H Sheahan, N Zeghidour, J Alayrac, J Carreira, J Engel

[Google Research & DeepMind]

基于Perceiver AR的通用长上下文自回归建模。现实世界的数据是高维的：书、图像或音乐演奏即使在压缩后也能轻易包含数十万元素。然而，最常用的自回归模型——Transformer，如果要扩展到捕捉这种长范围结构所需的输入和层数，成本太高。本文提出Perceiver AR，一种自回归、模式无关的架构，用交叉注意力将长程输入映射到少量潜变量，同时保持端到端因果掩码，旨在实现较长背景下的自回归建模。Perceiver AR可以直接关注超过十万Token，可以扩展到真实世界数据的密度估计所需的深度，实现实用的长语境密度估计，不需要手工提供的稀疏模式或记忆机制。当对图像或音乐进行训练时，Perceiver AR产生的输出具有明确的长期一致性和结构。Perceiver AR将处理许多输入的计算要求与构建深度网络的要求相分离，在长序列基准上获得了最先进的可能性，包括64×64 ImageNet图像和PG-19书籍。

Real-world data is high-dimensional: a book, image, or musical performance can easily contain hundreds of thousands of elements even after compression. However, the most commonly used autoregressive models, Transformers, are prohibitively expensive to scale to the number of inputs and layers needed to capture this longrange structure. We develop Perceiver AR, an autoregressive, modality-agnostic architecture which uses cross-attention to map long-range inputs to a small number of latents while also maintaining end-to-end causal masking. Perceiver AR can directly attend to over a hundred thousand tokens, enabling practical long-context density estimation without the need for hand-crafted sparsity patterns or memory mechanisms. When trained on images or music, Perceiver AR generates outputs with clear long-term coherence and structure. Our architecture also obtains state-of-the-art likelihood on long-sequence benchmarks, including 64 × 64 ImageNet images and PG-19 books.

另外几篇值得关注的论文

[RO] Autonomous Vehicles on the Edge: A Survey on Autonomous Vehicle Racing

无人驾驶赛车综述

J Betz, H Zheng, A Liniger, U Rosolia, P Karle, M Behl, V Krovi, R Mangharam

[University of Pennsylvania & ETH Zurich & California Institute of Technology & Technical University of Munich & University of Virginia & Clemson University]