爱可可AI前沿推介(6.1)

LG - 机器学习 CV - 计算机视觉 CL - 计算与语言 AS - 音频与语音 RO - 机器人

转自爱可可爱生活

摘要：教模型用语言表达不确定性、神经PDE求解器的Lie点对称性数据增强、机器人装配的物理学模拟、IO感知的快速内存高效精确注意力、用多模态掩码自编码器学习可迁移表示、无需Softmax的高性能线性视觉Transformer、多博弈决策Transformer、基于判别预训练模型的少样本学习、CNN看局部就够了

1、[CL] Teaching Models to Express Their Uncertainty in Words

S Lin, J Hilton, O Evans

[University of Oxford & OpenAI]

教模型用语言表达不确定性。本文表明，GPT-3模型可以学习用自然语言表达其答案的不确定性——而非模型对数。给定一个问题，该模型同时产生一个答案和一个置信水平(例如"90%置信"或 "高置信")。这些级别映射为经过良好校准的概率。该模型在分布漂移情况下也保持适度的校准，并对其自身答案的不确定性很敏感，而不是模仿人工的样本。这是第一次有模型被证明可以用自然语言表达其答案的校准的不确定性。为测试校准，提出了CalibratedMath的任务套件。比较了用语言表达的不确定性("口头概率")和从模型对数中提取的不确定性的校准。这两种不确定性都能在分布漂移的情况下泛化校准。证据表明，GPT-3的泛化校准能力取决于预训练的潜表示，这些表示与对其答案的认识不确定性相关。

We show that a GPT-3 model can learn to express uncertainty about its own answers in natural language – without use of model logits. When given a question, the model generates both an answer and a level of confidence (e.g. “90% confidence” or “high confidence”). These levels map to probabilities that are well calibrated. The model also remains moderately calibrated under distribution shift, and is sensitive to uncertainty in its own answers, rather than imitating human examples. To our knowledge, this is the first time a model has been shown to express calibrated uncertainty about its own answers in natural language. For testing calibration, we introduce the CalibratedMath suite of tasks. We compare the calibration of uncertainty expressed in words (“verbalized probability”) to uncertainty extracted from model logits. Both kinds of uncertainty are capable of generalizing calibration under distribution shift. We also provide evidence that GPT-3’s ability to generalize calibration depends on pre-trained latent representations that correlate with epistemic uncertainty over its answers.

https://arxiv.org/abs/2205.14334

2、[LG] Lie Point Symmetry Data Augmentation for Neural PDE Solvers

J Brandstetter, M Welling, D E. Worrall

[University of Amsterdam & Qualcomm AI Research]

神经PDE求解器的Lie点对称性数据增强。神经网络越来越多地被用来求解偏微分方程(PDE)，取代了较慢的数值求解器。然而，一个关键问题是，神经PDE求解器需要高质量的基础真实数据，而这些数据通常必须来自于它们被设计用来替代的求解器。因此，会遇到一个众所周知的鸡和蛋的问题。本文提出一种方法，通过提高神经PDE求解器的样本复杂度——Lie点对称性数据增强(LPSDA)，可以部分缓解该问题。在PDE的背景下，事实证明，能根据有关PDE的Lie点对称组，定量得出一个详尽的数据转换表，这在其他应用领域是不可能的。本文提出了该框架，并证明它如何能容易地部署，以提高神经PDE求解器的样本复杂度数量级。

Neural networks are increasingly being used to solve partial differential equations (PDEs), replacing slower numerical solvers. However, a critical issue is that neural PDE solvers require highquality ground truth data, which usually must come from the very solvers they are designed to replace. Thus, we are presented with a proverbial chicken-and-egg problem. In this paper, we present a method, which can partially alleviate this problem, by improving neural PDE solver sample complexity—Lie point symmetry data augmentation (LPSDA). In the context of PDEs, it turns out that we are able to quantitatively derive an exhaustive list of data transformations, based on the Lie point symmetry group of the PDEs in question, something not possible in other application areas. We present this framework and demonstrate how it can easily be deployed to improve neural PDE solver sample complexity by an order of magnitude.

https://arxiv.org/abs/2202.07643

3、[RO] Factory: Fast Contact for Robotic Assembly

Y Narang, K Storey, I Akinola, M Macklin, P Reist, L Wawrzyniak, Y Guo, A Moravanszky, G State, M Lu, A Handa, D Fox

[NVIDIA Corporation]

Factory：机器人装配的物理学模拟。机器人装配是机器人技术中最古老和最具挑战性的应用之一。在机器人技术的其他领域，如感知和抓取，模拟已经迅速加快了研究进展，特别是当与现代深度学习相结合时。然而，准确、高效、鲁棒地模拟装配中的一系列接触丰富的互动仍然是一个长期的挑战。本文提出Factory，一套用于此类应用的物理学模拟方法和机器人学习工具。其实现了对广泛的富于接触的场景的实时或更快的模拟，包括同时模拟1000个螺母和螺栓的相互作用。提供了60个精心设计的零件模型、3个机器人装配环境和7个机器人控制器，用于训练和测试虚拟机器人。对螺母和螺栓装配的强化学习策略进行训练和评估。Factory的目标是让"工厂 "打开大门，将仿真技术用于机器人装配，以及机器人技术中许多其他接触丰富的应用。

Robotic assembly is one of the oldest and most challenging applications of robotics. In other areas of robotics, such as perception and grasping, simulation has rapidly accelerated research progress, particularly when combined with modern deep learning. However, accurately, efficiently, and robustly simulating the range of contact-rich interactions in assembly remains a longstanding challenge. In this work, we present Factory, a set of physics simulation methods and robot learning tools for such applications. We achieve real-time or faster simulation of a wide range of contact-rich scenes, including simultaneous simulation of 1000 nut-and-bolt interactions. We provide 60 carefullydesigned part models, 3 robotic assembly environments, and 7 robot controllers for training and testing virtual robots. Finally, we train and evaluate proof-of-concept reinforcement learning policies for nut-and-bolt assembly. We aim for Factory to open the doors to using simulation for robotic assembly, as well as many other contact-rich applications in robotics. Please see our website for supplementary content, including videos. 1

https://arxiv.org/abs/2205.03532

4、[LG] FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

T Dao, D Y. Fu, S Ermon, A Rudra, C Ré

[Stanford University & University at Buffalo]

FlashAttention：IO感知的快速内存高效精确注意力。由于自注意力的时间和内存复杂度与序列长度成二次方，因此Transformer在长序列上很慢且耗费内存。近似注意力方法试图通过牺牲模型质量来解决该问题，以降低计算复杂度，但往往不能实现实际运行时的速度提升。本文认为，一个缺失的原则是使注意力算法具有IO感知——考虑到GPU内存级间的读写。本文提出FlashAttention，一种IO感知精确注意力算法，用平铺来减少GPU高带宽内存(HBM)和GPU片上SRAM之间的内存读/写次数。本文分析了FlashAttention的IO复杂性，表明它比标准注意力需要更少的HBM访问，且对于一系列的SRAM尺寸来说是最佳的。将FlashAttention扩展到块稀疏注意力，产生了一种近似的注意力算法，其速度比任何现有的近似注意力方法都快。FlashAttention训练Transformer的速度比现有基线快。

Transformers are slow and memory-hungry on long sequences, since the time and memory complexity of self-attention are quadratic in sequence length. Approximate attention methods have attempted to address this problem by trading off model quality to reduce the compute complexity, but often do not achieve wall-clock speedup. We argue that a missing principle is making attention algorithms IOaware—accounting for reads and writes between levels of GPU memory. We propose FlashAttention, an IO-aware exact attention algorithm that uses tiling to reduce the number of memory reads/writes between GPU high bandwidth memory (HBM) and GPU on-chip SRAM. We analyze the IO complexity of FlashAttention, showing that it requires fewer HBM accesses than standard attention, and is optimal for a range of SRAM sizes. We also extend FlashAttention to block-sparse attention, yielding an approximate attention algorithm that is faster than any existing approximate attention method. FlashAttention trains Transformers faster than existing baselines: 15% end-to-end wall-clock speedup on BERT-large (seq. length 512) compared to the MLPerf 1.1 training speed record, 3× speedup on GPT-2 (seq. length 1K), and 2.4× speedup on long-range arena (seq. length 1K-4K). FlashAttention and block-sparse FlashAttention enable longer context in Transformers, yielding higher quality models (0.7 better perplexity on GPT-2 and 6.4 points of lift on long-document classification) and entirely new capabilities: the first Transformers to achieve better-than-chance performance on the Path-X challenge (seq. length 16K, 61.4% accuracy) and Path-256 (seq. length 64K, 63.1% accuracy).

https://arxiv.org/abs/2205.14135

5、[CV] Multimodal Masked Autoencoders Learn Transferable Representations

X Geng, H Liu, L Lee, D Schuurams, S Levine, P Abbeel

[UC Berkeley & Google Brain]

用多模态掩码自编码器学习可迁移表示。建立可扩展的模型来学习不同的多模态数据，仍然是一个公开的挑战。对于视觉-语言数据，主要的方法是基于对比学习目标，为每种模式训练一个单独的编码器。虽然有效，但对比学习方法会引入采样偏差，这取决于所使用的数据增强，从而降低下游任务的性能。此外，这些方法仅限于成对的图像-文本数据，而不能利用广泛存在的非成对数据。本文研究了一个大型的多模态模型，在不使用特定模态编码器或对比学习的情况下，是否可以为下游任务学习可迁移表示，而纯粹通过掩码token预测进行训练。本文提出一种简单的、可扩展的网络结构——多模态掩码自编码器(M3AE)，通过掩码token预测为视觉和语言数据学习一个统一的编码器。本文提供了一个在大规模图像-文本数据集上训练的M3AE的实证研究，发现M3AE能学习可泛化的表示，并很好地迁移到下游任务中。M3AE得益于较高的文本掩码率(50-90%)，而BERT的标准掩码率为15%，这是由于两种数据模态的联合训练。本文提供了定性分析，表明学习到的表示包含了来自图像和语言的有意义的信息。实验表明M3AE在更大的模型规模和训练时间下的可扩展性，以及它在配对图像-文本数据和非配对数据上训练的灵活性。

Building scalable models to learn from diverse, multimodal data remains an open challenge. For vision-language data, the dominant approaches are based on contrastive learning objectives that train a separate encoder for each modality. While effective, contrastive learning approaches introduce sampling bias depending on the data augmentations used, which can degrade performance on downstream tasks. Moreover, these methods are limited to paired image-text data, and cannot leverage widely-available unpaired data. In this paper, we investigate whether a large multimodal model trained purely via masked token prediction, without using modality-specific encoders or contrastive learning, can learn transferable representations for downstream tasks. We propose a simple and scalable network architecture, the Multimodal Masked Autoencoder (M3AE), which learns a unified encoder for both vision and language data via masked token prediction. We provide an empirical study of M3AE trained on a large-scale image-text dataset, and find that M3AE is able to learn generalizable representations that transfer well to downstream tasks. Surprisingly, we find that M3AE benefits from a higher text mask ratio (50-90%), in contrast to BERT whose standard masking ratio is 15%, due to the joint training of two data modalities. We also provide qualitative analysis showing that the learned representation incorporates meaningful information from both image and language. Lastly, we demonstrate the scalability of M3AE with larger model size and training time, and its flexibility to train on both paired image-text data as well as unpaired data.

https://arxiv.org/abs/2205.14204