爱可可AI前沿推介 (3.5)

LG - 机器学习 CV - 计算机视觉 CL - 计算与语言 AS - 音频与语音 RO - 机器人

转自爱可可爱生活

1、[LG] Fine-Tuning can Distort Pretrained Features and Underperform Out-of-Distribution

A Kumar, A Raghunathan, R Jones, T Ma, P Liang

[Stanford University]

微调会扭曲预训练特征并降低分布外性能。当把预训练模型迁移到下游任务时，两种流行的方法是全微调(更新模型所有参数)和线性探测(只更新最后的线性层--"头")。众所周知，微调会得到更好的分部内(ID)精度。然而，本文发现，当预训练特征很好且分布漂移较大时，微调可能会得到比分布外(OOD)线性探测更差的精度。在10个分布漂移数据集(Breeds-Living17, Breeds-Entity30, DomainNet, CIFAR → STL, CIFAR10.1, FMoW, ImageNetV2, ImageNet-R, ImageNet-A, ImageNet-Sketch)上，微调平均获得比线性探测高2%的ID准确率，但OD准确率低7%。本文从理论上证明，即使在简单的环境中，也会出现ID和OOD精度之间的权衡：微调过参数化两层线性网络。当用固定或随机头初始化时，微调的OOD误差很高——这是因为当微调学习头时，神经网络的底层同时发生变化，扭曲了预训练特征。分析表明，简单的两步策略，即线性探测然后全微调(LP-FT)，有时被用作微调的启发式方法，结合了微调和线性探测的优点。根据经验，在上述数据集上，LP-FT的性能优于微调和线性探测（ID比全微调好1%，OOD好10%)。

When transferring a pretrained model to a downstream task, two popular methods are full fine-tuning (updating all the model parameters) and linear probing (updating only the last linear layer—the “head”). It is well known that fine-tuning leads to better accuracy in-distribution (ID). However, in this paper, we find that fine-tuning can achieve worse accuracy than linear probing out-of-distribution (OOD) when the pretrained features are good and the distribution shift is large. On 10 distribution shift datasets (Breeds-Living17, Breeds-Entity30, DomainNet, CIFAR → STL, CIFAR10.1, FMoW, ImageNetV2, ImageNet-R, ImageNet-A, ImageNet-Sketch), fine-tuning obtains on average 2% higher accuracy ID but 7% lower accuracy OOD than linear probing. We show theoretically that this tradeoff between ID and OOD accuracy arises even in a simple setting: fine-tuning overparameterized two-layer linear networks. We prove that the OOD error of fine-tuning is high when we initialize with a fixed or random head—this is because while fine-tuning learns the head, the lower layers of the neural network change simultaneously and distort the pretrained features. Our analysis suggests that the easy two-step strategy of linear probing then full fine-tuning (LP-FT), sometimes used as a fine-tuning heuristic, combines the benefits of both fine-tuning and linear probing. Empirically, LP-FT outperforms both fine-tuning and linear probing on the above datasets (1% better ID, 10% better OOD than full fine-tuning).

2、[CV] Efficient Video Instance Segmentation via Tracklet Query and Proposal

J Wu, S Yarram, H Liang, T Lan, J Yuan, J Eledath, G Medioni

[State University of New York at Buffalo & Amazon]

基于追踪片段查询和建议的高效视频实例分割。视频实例分割(VIS)旨在同时对视频中的多个物体实例进行分类、分割和追踪。最近的视频片段级VIS将简短的视频片段作为输入，每次都显示出比帧级VIS(通过分割进行追踪)更强的性能，因为有来自于多帧的更多时间上下文。然而，大多数片段级方法既不是端到端可学习的，也不是实时的。最近的VIS transformer(VisTR)解决了这些限制，在一个片段中执行端到端的VIS。然而，VisTR由于其逐帧密集的注意力需要很长的训练时间。此外，VisTR在多视频片段中并不是完全的端到端学习，需要手工的数据关联来连接连续片段间的实例轨迹。本文提出EfficientVIS，一种具有高效训练和推理能力的完全端到端框架，其核心是追踪片段查询和追踪片段建议，通过迭代查询-视频交互，在空间和时间上关联和分割感兴趣区域(RoI)。本文还提出一种对应学习的方法，使视频片段间的归集片段链接成可学习端到端。与VisTR相比，EfficientVIS需要的训练次数减少了15倍，同时在YouTubeVIS基准测试中达到了最先进的精度。同时，该方法可以在单一的端到端过程中实现整个视频实例的分割，完全不需要数据关联。

Video Instance Segmentation (VIS) aims to simultaneously classify, segment, and track multiple object instances in videos. Recent clip-level VIS takes a short video clip as input each time showing stronger performance than frame-level VIS (tracking-by-segmentation), as more temporal context from multiple frames is utilized. Yet, most clip-level methods are neither end-to-end learnable nor real-time. These limitations are addressed by the recent VIS transformer (VisTR) [25] which performs VIS end-to-end within a clip. However, VisTR suffers from long training time due to its frame-wise dense attention. In addition, VisTR is not fully end-to-end learnable in multiple video clips as it requires a hand-crafted data association to link instance tracklets between successive clips. This paper proposes EfficientVIS, a fully end-to-end framework with efficient training and inference. At the core are tracklet query and tracklet proposal that associate and segment regions-of-interest (RoIs) across space and time by an iterative query-video interaction. We further propose a correspondence learning that makes tracklets linking between clips end-to-end learnable. Compared to VisTR, EfficientVIS requires 15× fewer training epochs while achieving state-of-the-art accuracy on the YouTubeVIS benchmark. Meanwhile, our method enables whole video instance segmentation in a single end-to-end pass without data association at all.

3、[RO] NeRF-Supervision: Learning Dense Object Descriptors from Neural Radiance Fields

L Yen-Chen, P Florence, J T. Barron, T Lin, A Rodriguez, P Isola

[MIT & Google & Nvidia]

NeRF-Supervision: 从神经辐射场学习稠密物体描述子。薄而反光的物体，如叉子和拂尘，在日常生活中很常见，但它们对机器人的感知特别具有挑战性，因为用商用RGB-D相机或多视角立体技术很难重建它们。虽然传统的管道在处理这样的物体时很吃力，但神经辐射场(NeRF)最近被证明在对具有薄结构或反射材料的物体进行视图合成时非常有效。本文探讨了使用NeRF作为强大的机器人视觉系统的一个新的监督来源。证明了一个场景的NeRF表示可以用来训练稠密物体描述子。用优化的NeRF来提取一个物体多个视图间的密集对应关系，然后用这些对应关系作为训练数据来学习该物体的视图不变表示。NeRF对密度场的使用使得能用一种新的深度分布范式重新表述对应问题，而不是用深度图的传统方法。用所提出方法监督的稠密对应模型比现成的习得描述子好106%(PCK@3px指标，性能提高一倍以上)，比用多视图立体监督的基线要好29%。证明了所学的稠密描述子使机器人能准确地对薄而反光的物体进行6自由度(6-DoF)拾取和放置。

Thin, reflective objects such as forks and whisks are common in our daily lives, but they are particularly challenging for robot perception because it is hard to reconstruct them using commodity RGB-D cameras or multi-view stereo techniques. While traditional pipelines struggle with objects like these, Neural Radiance Fields (NeRFs) have recently been shown to be remarkably effective for performing view synthesis on objects with thin structures or reflective materials. In this paper we explore the use of NeRF as a new source of supervision for robust robot vision systems. In particular, we demonstrate that a NeRF representation of a scene can be used to train dense object descriptors. We use an optimized NeRF to extract dense correspondences between multiple views of an object, and then use these correspondences as training data for learning a view-invariant representation of the object. NeRF’s usage of a density field allows us to reformulate the correspondence problem with a novel distribution-of-depths formulation, as opposed to the conventional approach of using a depth map. Dense correspondence models supervised with our method significantly outperform off-the-shelf learned descriptors by 106% (PCK@3px metric, more than doubling performance) and outperform our baseline supervised with multi-view stereo by 29%. Furthermore, we demonstrate the learned dense descriptors enable robots to perform accurate 6-degree of freedom (6-DoF) pick and place of thin and reflective objects.

4、[LG] Understanding Failure Modes of Self-Supervised Learning

N M Kalibhat, K Narang, L Tan, H Firooz, M Sanjabi, S Feizi

[University of Marylandan & Meta AI]

自监督学习失败模式的理解。自监督学习方法在下游分类任务中显示了令人印象深刻的结果。然而，在理解其失败模式和解释这些模型的习得表示方面的工作有限。本文深入研究了这些问题，通过了解下游任务中错误分类的根本原因来研究自监督模型的表示空间。在几个最先进的自监督模型中，包括SimCLR、SwaV、MoCo V2和BYOL，正确分类的样本的表示有很少与其他特征相比有很大偏差的鉴别性特征。这与被错误分类的样本的表示形成了明显的对比。表示空间中的含噪特征往往对应于图像中的虚假属性，使得模型的可解释性降低。在这些观察的基础上，本文提出了每样本自监督表示质量得分(Q-Score)，在没有获得任何标签信息的情况下，能预测一个给定样本在下游任务中是否可能被错误分类，实现了高达0.90的AUPRC。Q-Score也可以作为正则化来补救低质量表示，从而使SimCLR在ImageNet-100上的准确性相对提高3.26%。Q-Score正则化增加了表示的稀疏性，减少了噪声并通过梯度热图提高了可解释性。

Self-supervised learning methods have shown impressive results in downstream classification tasks. However, there is limited work in understanding their failure models and interpreting the learned representations of these models. In this paper, we tackle these issues and study the representation space of self-supervised models by understanding the underlying reasons for misclassifications in a downstream task. Over several state-of-the-art self-supervised models including SimCLR, SwaV, MoCo V2 and BYOL, we observe that representations of correctly classified samples have few discriminative features with highly deviated values compared to other features. This is in a clear contrast with representations of misclassified samples. We also observe that noisy features in the representation space often correspond to spurious attributes in images making the models less interpretable. Building on these observations, we propose a sample-wise Self-Supervised Representation Quality Score (or, Q-Score) that, without access to any label information, is able to predict if a given sample is likely to be misclassified in the downstream task, achieving an AUPRC of up to 0.90. Q-Score can also be used as a regularization to remedy low-quality representations leading to 3.26% relative improvement in accuracy of SimCLR on ImageNet-100. Moreover, we show that Q-Score regularization increases representation sparsity, thus reducing noise and improving interpretability through gradient heatmaps. Feature Indices Class Accuracy: 100.0

5、[LG] Evolving Curricula with Regret-Based Environment Design

J Parker-Holder, M Jiang, M Dennis, M Samvelyan, J Foerster, E Grefenstette, T Rocktäschel

[University of Oxford & UCL & UC Berkeley & Meta AI]

基于回归环境设计的课程进化。用强化学习(RL)来训练具有普遍能力的智能体仍然是一个重大挑战。提高强化学习智能体的鲁棒性的一个有希望的途径是通过使用课程。其中一类方法将环境设计设定为学生和老师之间的游戏，使用基于后悔的目标，在学生智能体能力的前沿产生环境实例(或水平)。这些方法得益于它们的通用性，在理论上保证了平衡，但它们往往难以在具有挑战性的设计空间中找到有效的水平。相比之下，进化方法寻求逐步改变环境的复杂性，导致潜在的开放式学习，但往往依赖于特定领域的启发式方法和大量的计算资源。本文建议在一个有原则的、基于遗憾的课程中利用进化的力量。所提出方法，即通过编辑水平的逆向复合复杂性(ACCEL)，旨在不断地在智能体的能力前沿产生水平，从而使课程开始简单，但变得越来越复杂。ACCEL保持了先前基于遗憾的方法的理论优势，同时在各种环境中提供了显著的经验收益。

It remains a significant challenge to train generally capable agents with reinforcement learning (RL). A promising avenue for improving the robustness of RL agents is through the use of curricula. One such class of methods frame environment design as a game between a student and a teacher, using regret-based objectives to produce environment instantiations (or levels) at the frontier of the student agent’s capabilities. These methods benefit from their generality, with theoretical guarantees at equilibrium, yet they often struggle to find effective levels in challenging design spaces. By contrast, evolutionary approaches seek to incrementally alter environment complexity, resulting in potentially open-ended learning, but often rely on domain-specific heuristics and vast amounts of computational resources. In this paper we propose to harness the power of evolution in a principled, regret-based curriculum. Our approach, which we call Adversarially Compounding Complexity by Editing Levels (ACCEL), seeks to constantly produce levels at the frontier of an agent’s capabilities, resulting in curricula that start simple but become increasingly complex. ACCEL maintains the theoretical benefits of prior regret-based methods, while providing significant empirical gains in a diverse set of environments. An interactive version of the paper is available at accelagent.github.io.