爱可可AI前沿推介(2.26)

LG - 机器学习 CV - 计算机视觉 CL - 计算与语言 AS - 音频与语音 RO - 机器人

转自爱可可爱生活

1、[LG] Bayesian Model Selection, the Marginal Likelihood, and Generalization

S Lotfi, P Izmailov, G Benton, M Goldblum, A G Wilson

[New York University]

贝叶斯模型选择、边际似然与泛化。如何在与观察结果完全一致的假说之间进行比较？边际似然(贝叶斯证据)代表了从先验中产生观察结果的概率，为这个基础性问题提供了一种独特的方法，自动编码了奥卡姆剃刀。尽管已经观察到边际似然可能会过拟合，并且对先验假设很敏感，但它对超参数学习和离散模型比较的局限性还没有被彻底研究。虽然边际似然为假设检验提供了一个强大的机制，并且可以实用于超参数调整，但它在很多方面与泛化不一致。本文重温了边际似然对于学习约束和假设检验的吸引人的特性。强调了用边际似然作为泛化代理的概念和实际问题。展示了边际似然如何与泛化负相关，对神经结构搜索有影响，并可能导致超参数学习中的欠拟合和过拟合。通过条件边际似然提供了部分补救措施，条件边际似然与泛化更加一致，对大规模的超参数学习有实际价值，例如在深度核学习中。条件边际似然在学习深度核超参数方面提供了特别引人注目的性能，特别是在较小的数据集和迁移学习问题上。

How do we compare between hypotheses that are entirely consistent with observations? The marginal likelihood (aka Bayesian evidence), which represents the probability of generating our observations from a prior, provides a distinctive approach to this foundational question, automatically encoding Occam’s razor. Although it has been observed that the marginal likelihood can overfit and is sensitive to prior assumptions, its limitations for hyperparameter learning and discrete model comparison have not been thoroughly investigated. We first revisit the appealing properties of the marginal likelihood for learning constraints and hypothesis testing. We then highlight the conceptual and practical issues in using the marginal likelihood as a proxy for generalization. Namely, we show how marginal likelihood can be negatively correlated with generalization, with implications for neural architecture search, and can lead to both underfitting and overfitting in hyperparameter learning. We provide a partial remedy through a conditional marginal likelihood, which we show is more aligned with generalization, and practically valuable for large-scale hyperparameter learning, such as in deep kernel learning.

2、[CV] FreeSOLO: Learning to Segment Objects without Annotations

X Wang, Z Yu, S D Mello, J Kautz, A Anandkumar, C Shen, J M. Alvarez

[The University of Adelaide & NVIDIA & Zhejiang University]

FreeSOLO：免标注目标分割学习。实例分割是一项基本的视觉任务，旨在识别和分割图像中的每个物体。然而，它需要昂贵的标注，如边框和分割掩码来进行学习。本文提出一种完全无监督的学习方法，在没有任何标注的情况下学习类别不可知的实例分割。提出FreeSOLO，一种建立在简单实例分割方法SOLO之上的自监督实例分割框架。提出一在新的定位感知预训练框架，可以以无监督方式从复杂场景中发现物体。FreeSOLO在具有挑战性的COCO数据集上实现了9.8%的AP50，这甚至超过了几个使用人工标注的分割生成方法。FreeSOLO的框定位明显优于最先进的无监督目标检测/发现方法。FreeSOLO进一步展示了作为一个强大的预训练方法的优越性，在仅用5%的COCO掩码对实例分割进行微调时，其AP超过了最先进的自监督预训练方法，达到了+9.8%。

Instance segmentation is a fundamental vision task that aims to recognize and segment each object in an image. However, it requires costly annotations such as bounding boxes and segmentation masks for learning. In this work, we propose a fully unsupervised learning method that learns class-agnostic instance segmentation without any annotations. We present FreeSOLO, a self-supervised instance segmentation framework built on top of the simple instance segmentation method SOLO. Our method also presents a novel localization-aware pre-training framework, where objects can be discovered from complicated scenes in an unsupervised manner. FreeSOLO achieves 9.8% AP_{50} on the challenging COCO dataset, which even outperforms several segmentation proposal methods that use manual annotations. For the first time, we demonstrate unsupervised class-agnostic instance segmentation successfully. FreeSOLO's box localization significantly outperforms state-of-the-art unsupervised object detection/discovery methods, with about 100% relative improvements in COCO AP. FreeSOLO further demonstrates superiority as a strong pre-training method, outperforming state-of-the-art self-supervised pre-training methods by +9.8% AP when fine-tuning instance segmentation with only 5% COCO masks.

3、[CV] Near Perfect GAN Inversion

Q Feng, V Shah, R Gadde, P Perona, A Martinez

[Amazon]

近乎完美的GAN逆映射。为了用生成对抗网络(GAN)编辑一张真实的照片，需要一种GAN逆映射算法来确定完美再现照片的潜向量。不幸的是，虽然现有的逆映射算法可以合成与真实照片相似的图像，但无法生成大多数应用中需要的完全相同的克隆。本文推导出一种能实现近乎完美重建照片的算法，不依靠基于编码器的优化方法来寻找固定生成器G(·)上的逆映射，而是推导出一种局部调整G(·)的方法，以更优化地代表希望合成的照片。这是通过局部调整学习到的映射G(·)来实现的，即"‖x - G(z)‖ < ε，其中x是希望复制的照片，z是潜向量，‖ · ‖是一个适当的度量，ε > 0是一个小标量。这种方法不仅可以产生与希望复制的真实照片没有差别的合成图像，而且这些图像还可以随时编辑。在包括人脸、动物和汽车在内的各种数据集上证明了衍生算法的有效性，定量评估产生的结果比目前最先进的算法好一个数量级(或更多)，讨论了其对多样性和包容性的重要性。

To edit a real photo using Generative Adversarial Networks (GANs), we need a GAN inversion algorithm to identify the latent vector that perfectly reproduces it. Unfortunately, whereas existing inversion algorithms can synthesize images similar to real photos, they cannot generate the identical clones needed in most applications. Here, we derive an algorithm that achieves near perfect reconstructions of photos. Rather than relying on encoderor optimizationbased methods to find an inverse mapping on a fixed generator G(·), we derive an approach to locally adjust G(·) to more optimally represent the photos we wish to synthesize. This is done by locally tweaking the learned mapping G(·) s.t. ‖x − G(z)‖ < ε, with x the photo we wish to reproduce, z the latent vector, ‖ · ‖ an appropriate metric, and > 0 a small scalar. We show that this approach can not only produce synthetic images that are indistinguishable from the real photos we wish to replicate, but that these images are readily editable. We demonstrate the effectiveness of the derived algorithm on a variety of datasets including human faces, animals, and cars, and discuss its importance for diversity and inclusion.

4、[LG] All You Need Is Supervised Learning: From Imitation Learning to Meta-RL With Upside Down RL

K Arulkumaran, D R. Ashley, J Schmidhuber, R K. Srivastava

[ARAYA, Inc & The Swiss AI Lab IDSIA & NNAISENSE]

适用于模仿学习和元强化学习的监督学习颠倒强化学习。颠倒强化学习(UDRL)通过将收益作为输入并预测行动，颠倒了强化学习中将收益用作目标函数的传统用法。UDRL纯粹基于监督学习，并绕过了强化学习中的一些突出问题：bootstrapping、off-policy修正和折扣因素。虽然UDRL之前的工作是在传统在线强化学习环境下进行的，本文表明，这种单一的算法也可以在模仿学习和离线强化学习环境下工作，可以扩展到目标条件强化学习环境，甚至是元强化学习环境。通过一种通用的智能体架构，一个单一的UDRL智能体可以跨越所有范式进行学习。

Upside down reinforcement learning (UDRL) flips the conventional use of the return in the objective function in RL upside down, by taking returns as input and predicting actions. UDRL is based purely on supervised learning, and bypasses some prominent issues in RL: bootstrapping, off-policy corrections, and discount factors. While previous work with UDRL demonstrated it in a traditional online RL setting, here we show that this single algorithm can also work in the imitation learning and offline RL settings, be extended to the goal-conditioned RL setting, and even the meta-RL setting. With a general agent architecture, a single UDRL agent can learn across all paradigms.

5、[CV] When Vision Transformers Outperform ResNets without Pre-training or Strong Data Augmentations

X Chen, C Hsieh, B Gong

[Google Research & UCLA]

基于尖锐度感知最小化的视觉Transformer。视觉Transformer(ViT)和MLP标志着在用通用神经架构取代手写特征或归纳偏差方面的进一步努力。现有的工作通过大规模预训练和/或重复的强大数据增强加强模型的能力，并且仍然报告了与优化有关的问题(例如，对初始化和学习率的敏感性)。因此，本文从损失几何的角度研究了ViT和MLP-混合器，旨在提高模型在训练和推理时的数据效率。可视化和Hessian揭示了已收敛模型的极其尖锐的局部最小值。通过用最近提出的锐度感知优化器来提高平滑度，大幅提高了ViT和MLP-混合器在各种任务中的准确性和鲁棒性，包括监督学习、对抗学习、对比学习和迁移学习(例如，用简单的Inception式预处理，ViT-B/16和MixerB/16在ImageNet上的最高精度分别为+5.3%和+11.0%)。平滑度的提高归因于前几层的激活神经元更加稀疏。由此产生的ViT在ImageNet上从头开始训练时，在没有大规模预训练或强大数据增强的情况下，其表现优于类似规模和吞吐量的ResNets，还拥有更敏锐的注意力图。

Vision Transformers (ViTs) and MLPs signal further efforts on replacing handwired features or inductive biases with general-purpose neural architectures. Existing works empower the models by massive data, such as large-scale pre-training and/or repeated strong data augmentations, and still report optimization-related problems (e.g., sensitivity to initialization and learning rates). Hence, this paper investigates ViTs and MLP-Mixers from the lens of loss geometry, intending to improve the models’ data efficiency at training and generalization at inference. Visualization and Hessian reveal extremely sharp local minima of converged models. By promoting smoothness with a recently proposed sharpness-aware optimizer, we substantially improve the accuracy and robustness of ViTs and MLP-Mixers on various tasks spanning supervised, adversarial, contrastive, and transfer learning (e.g., +5.3% and +11.0% top-1 accuracy on ImageNet for ViT-B/16 and MixerB/16, respectively, with the simple Inception-style preprocessing). We show that the improved smoothness attributes to sparser active neurons in the first few layers. The resultant ViTs outperform ResNets of similar size and throughput when trained from scratch on ImageNet without large-scale pre-training or strong data augmentations. They also possess more perceptive attention maps.