爱可可AI前沿推介(5.16)

LG - 机器学习 CV - 计算机视觉 CL - 计算与语言 AS - 音频与语音 RO - 机器人

转自爱可可爱生活

摘要：基于习得顶点下降的3D人体模型拟合、非对比自监督学习预测头机制、面向文本声音图像视频和代码的稀疏激活方法、用深度强化学习和模拟到模拟迁移玩“火箭联盟”初探、基于单类监督的多类3D目标检测、UniMorph 4.0通用形态学、面向多兴趣候选检索的局部平滑嵌入混合、老照片的修复与着色、基于生成模型的开放词表极限分类

1、[CV] Learned Vertex Descent: A New Direction for 3D Human Model Fitting

E Corona, G Pons-Moll, G Alenyà, F Moreno-Noguer

[CSIC-UPC & University of Tubingen]

习得顶点下降：3D人体模型拟合新方向。本文为图像和扫描中的3D人体模型拟合提出一种新的基于优化的范式。与现有的从输入图像直接回归低维统计人体模型(如SMPL)参数的方法相比，训练一个每顶点神经场网络的集成。该网络以分布式方式，根据在当前顶点投影中提取的神经特征，预测顶点向真实值的下降方向。推理过程中，将这个称为LVD的网络应用于梯度下降优化管道中，直到其收敛，即使将所有顶点初始化为一个点，通常也只需要几分之一秒。详尽的评估表明，所提出方法能捕捉到具有非常不同身体形状的着装者的基本人体，与最先进的方法相比，取得了显著的改进。LVD也适用于人类和手部的3D模型拟合，对于这些模型，用更简单、更快速的方法显示了对SOTA的显著改进。

We propose a novel optimization-based paradigm for 3D human model fitting on images and scans. In contrast to existing approaches that directly regress the parameters of a low-dimensional statistical body model (e.g. SMPL) from input images, we train an ensemble of per-vertex neural fields network. The network predicts, in a distributed manner, the vertex descent direction towards the ground truth, based on neural features extracted at the current vertex projection. At inference, we employ this network, dubbed LVD, within a gradient-descent optimization pipeline until its convergence, which typically occurs in a fraction of a second even when initializing all vertices into a single point. An exhaustive evaluation demonstrates that our approach is able to capture the underlying body of clothed people with very different body shapes, achieving a significant improvement compared to state-of-the-art. LVD is also applicable to 3D model fitting of humans and hands, for which we show a significant improvement to the SOTA with a much simpler and faster method.

https://arxiv.org/abs/2205.06254

2、[LG] The Mechanism of Prediction Head in Non-contrastive Self-supervised Learning

Z Wen, Y Li

[CMU]

非对比自监督学习预测头机制。最近，Grill等人在Bootstrap Your Own Latent(BYOL)方法中的惊人发现表明，在网络结构中加入所谓的预测头，打破正向对之间的对称性，就可以去除对比损失中的负项。这开启了非对比自监督学习的研究。很神秘的是，为什么即使存在局部坍缩全局最优解，通过(随机)梯度下降训练的神经网络仍然可以学习有竞争力的表示，避免坍缩解。这种现象是深度学习优化中最典型的隐性偏差的例子之一，而其背后的机制至今仍鲜为人知。本文提出关于非对比自监督学习方法中预测头的机制的经验和理论发现。从经验上看，当预测头被初始化为一个单位矩阵，只有其对角线项被训练时，网络可以学习有竞争力的表示，即使训练目标中仍然存在局部最优值。此外，在训练过程中，非对角线项的上升和下降轨迹是一致的。证据表明，理解单位初始化的预测头是理解可训练预测头机制的一个良好起点。理论上，本文提出了一个框架来理解可训练但单位初始化的预测头的行为。在一个简单的设定下，描述了预测头在训练过程中的替代效应和加速效应。当在一些神经元中学习较强的特征时，替代效应就会发生，通过更新预测头可以替代在其他神经元中学习这些特征。而加速效应发生在被替代的特征可以加速其他较弱特征的学习，以防止它们被忽略。这两种效应共同使神经网络能够学习所有的特征，而不是只关注学习较强的特征，这可能是维度坍缩现象的原因。

Recently the surprising discovery of Bootstrap Your Own Latent (BYOL) method by Grill et al. shows the negative term in contrastive loss can be removed if we add the so-called prediction head to the network architecture, which breaks the symmetry between the positive pairs. This initiated the research of non-contrastive self-supervised learning. It is mysterious why even when trivial collapsed global optimal solutions exist, neural networks trained by (stochastic) gradient descent can still learn competitive representations and avoid collapsed solutions. This phenomenon is one of the most typical examples of implicit bias in deep learning optimization, and its underlying mechanism remains little understood to this day. In this work, we present our empirical and theoretical discoveries about the mechanism of prediction head in non-contrastive self-supervised learning methods. Empirically, we find that when the prediction head is initialized as an identity matrix with only its off-diagonal entries being trained, the network can learn competitive representations even though the trivial optima still exist in the training objective. Moreover, we observe a consistent rise and fall trajectory of off-diagonal entries during training. Our evidence suggests that understanding the identity-initialized prediction head is a good starting point for understanding the mechanism of the trainable prediction head. Theoretically, we present a framework to understand the behavior of the trainable, but identity-initialized prediction head. Under a simple setting, we characterized the substitution effect and acceleration effect of the prediction head during the training process. The substitution effect happens when learning the stronger features in some neurons can substitute for learning these features in other neurons through updating the prediction head. And the acceleration effect happens when the substituted features can accelerate the learning of other weaker features to prevent them from being ignored. These two effects together enable the neural networks to learn all the features rather than focus only on learning the stronger features, which is likely the cause of the dimensional collapse phenomenon. To the best of our knowledge, this is also the first end-to-end optimization guarantee for non-contrastive methods using nonlinear neural networks with a trainable prediction head and normalization.

https://arxiv.org/abs/2205.06226

3、[CL] One Model, Multiple Modalities: A Sparsely Activated Approach for Text, Sound, Image, Video and Code

Y Dai, D Tang, L Liu, M Tan, C Zhou, J Wang, Z Feng, F Zhang, X Hu, S Shi

[Tencent AI Lab]

一个模型，多种模态：面向文本、声音、图像、视频和代码的稀疏激活方法。人通过多种感官来感知世界(例如，通过听声音、读文字和看物体)。然而，大多数现有的人工智能系统只处理个别模态。本文提出一种方法，用单个模型处理多种模态的信息。在所提出的"SkillNet"模型中，参数的不同部分被专门用于处理不同的模态。与总是激活所有模型参数的传统密集模型不同，所提出模型稀疏激活与任务相关的技能的部分参数。这样的模型设计使SkillNet能以一种更可解释的方式学习技能。为五种模态开发了模型，包括文本、图像、声音、视频和代码。结果显示，SkillNet的表现与五个特定模态的微调模型相当。此外，该模型支持以相同的疏散激活方式进行自监督预训练，从而为不同模态提供更好的初始化参数。预训练极大地提高了SkillNet在五种模态上的性能，与特定模态预训练的基线相当，甚至更好。在中文文本到图像检索任务中，所得到的最终系统比现有的领先系统(包括WukongViT-B和Wenlan 2.0)取得了更高的精度，同时使用的激活参数数量更少。

People perceive the world with multiple senses (e.g., through hearing sounds, reading words and seeing objects). However, most existing AI systems only process an individual modality. This paper presents an approach that excels at handling multiple modalities of information with a single model. In our “SkillNet” model, different parts of the parameters are specialized for processing different modalities. Unlike traditional dense models that always activate all the model parameters, our model sparsely activates parts of the parameters whose skills are relevant to the task. Such model design enables SkillNet to learn skills in a more interpretable way. We develop our model for five modalities including text, image, sound, video and code. Results show that, SkillNet performs comparably to five modality-specific fine-tuned models. Moreover, our model supports selfsupervised pretraining with the same sparsely activated way, resulting in better initialized parameters for different modalities. We find that pretraining significantly improves the performance of SkillNet on five modalities, on par with or even better than baselines with modality-specific pretraining. On the task of Chinese text-to-image retrieval, our final system achieves higher accuracy than existing leading systems including WukongViT-B and Wenlan 2.0 while using less number of activated parameters.

https://arxiv.org/abs/2205.06126

4、[LG] On the Verge of Solving Rocket League using Deep Reinforcement Learning and Sim-to-sim Transfer

M Pleines, K Ramthun...

[TU Dortmund University & Rhine-Waal University of Applied Sciences & LIACS Universiteit Leiden]

用深度强化学习和模拟到模拟迁移玩“火箭联盟”初探。自主训练的智能体应该能合理地玩好视频游戏，要么依赖于快速的模拟速度，要么依赖于在数千台机器同时运行的大规模并行化。本文探索了机器人学中的第三种方式，即模拟到现实的迁移，或者如果游戏本身被认为是一种模拟，则是模拟到模拟的迁移。在《火箭联盟》的案例中，本文证明了守门员和前锋的单一行为可以在模拟环境中用深度强化学习成功学习，并迁移到原始游戏中。尽管实施的训练模拟在某种程度上是不准确的，但守门员智能体一旦迁移，就能挽救其面对的近100%的射门，而前锋智能体在大约75%的情况下都能得分。因此，训练后的智能体足够强大，能推广到火箭联盟的目标域。

Autonomously trained agents that are supposed to play video games reasonably well rely either on fast simulation speeds or heavy parallelization across thousands of machines running concurrently. This work explores a third way that is established in robotics, namely sim-to-real transfer, or if the game is considered a simulation itself, sim-to-sim transfer. In the case of Rocket League, we demonstrate that single behaviors of goalies and strikers can be successfully learned using Deep Reinforcement Learning in the simulation environment and transferred back to the original game. Although the implemented training simulation is to some extent inaccurate, the goalkeeping agent saves nearly 100% of its faced shots once transferred, while the striking agent scores in about 75% of cases. Therefore, the trained agent is robust enough and able to generalize to the target domain of Rocket League.

https://arxiv.org/abs/2205.05061

5、[CV] Multi-Class 3D Object Detection with Single-Class Supervision

M Ye, C Liu, M Yao, W Wang, Z Leng, C R. Qi, D Anguelov

[The University of Texas at Austin & Waymo]

基于单类监督的多类3D目标检测。虽然在许多机器人应用中需要多类3D检测器，但用完全标记的数据集来训练它们，标记成本会很昂贵。另一种方法是在不相干数据样本上有针对性的单类标签。本文对训练一个多类3D目标检测模型感兴趣，同时使用这些单类标记数据。首先详细介绍了"单类监督"(SCS)设置与相关概念(如部分监督和半监督)之间的独特立场。然后，基于训练多类稀疏网(RSN)的案例研究，调整了一系列算法——从监督学习到伪标签——以充分利用SCS设置的属性，并进行广泛的消融研究以确定最有效的算法和做法。在Waymo开放数据集上的实证实验表明，SCS下的适当训练可以接近或匹配全监督训练，同时节省标记成本。

While multi-class 3D detectors are needed in many robotics applications, training them with fully labeled datasets can be expensive in labeling cost. An alternative approach is to have targeted single-class labels on disjoint data samples. In this paper, we are interested in training a multi-class 3D object detection model, while using these single-class labeled data. We begin by detailing the unique stance of our “Single-Class Supervision” (SCS) setting with respect to related concepts such as partial supervision and semi supervision. Then, based on the case study of training the multi-class version of Range Sparse Net (RSN), we adapt a spectrum of algorithms — from supervised learning to pseudolabeling — to fully exploit the properties of our SCS setting, and perform extensive ablation studies to identify the most effective algorithm and practice. Empirical experiments on the Waymo Open Dataset show that proper training under SCS can approach or match full supervision training while saving labeling costs.

https://arxiv.org/abs/2205.05703

另外几篇值得关注的论文：

[CL] UniMorph 4.0: Universal Morphology

UniMorph 4.0：通用形态学

K Batsuren, O Goldman, S Khalifa, N Habash...

[National University of Mongolia & Bar-Ilan University & Johns Hopkins University & University of Trento...]

https://arxiv.org/abs/2205.03608

[IR] kNN-Embed: Locally Smoothed Embedding Mixtures For Multi-interest Candidate Retrieval

kNN-Embed：面向多兴趣候选检索的局部平滑嵌入混合

A El-Kishky, T Markovich, K Leung, F Portman, A Haghighi

[Twitter Cortex]

https://arxiv.org/abs/2205.06205

[CV] Pik-Fix: Restoring and Colorizing Old Photos

Pik-Fix：老照片的修复与着色

R Xu, Z Tu, Y Du, X Dong, J Li, Z Meng... [UCLA & UT-Austin & George Mason University & Northwestern University & Cleveland State University & Innopeak Technology Inc] (2022) https://arxiv.org/abs/2205.01902