爱可可AI前沿推介(4.25)

LG - 机器学习 CV - 计算机视觉 CL - 计算与语言 AS - 音频与语音 RO - 机器人

转自爱可可爱生活

摘要：DNN表示瓶颈的发现和解释、图神经网络的表达能力和逼近特性、基于跨实例一致性的单视角重建、稀疏专家混合表示坍缩研究、从外部获得骨骼形状、基于自混合图像的Deepfake深度换脸检测、面向代码切换语音的端到端语音翻译、基于Transformer的空-时视频Grounding、非确定性双人面部动作建模

1、[LG] Discovering and Explaining the Representation Bottleneck of DNNs

H Deng, Q Ren, H Zhang, Q Zhang

[Shanghai Jiao Tong University]

DNN表示瓶颈的发现和解释。本文从深度神经网络(DNN)中编码的输入变量间相互作用的复杂度角度，探讨了深度神经网络的特征表示瓶颈问题。关注输入变量间的多阶交互，其中阶数代表交互的复杂度。发现一个DNN更有可能同时编码过于简单的交互和过于复杂的交互，但通常不能学习中间复杂度的交互。这样的现象在不同任务的不同DNN中广泛存在。该现象表明DNN和人类在认知上存在差距，即表示瓶颈。从理论上证明了表示瓶颈的根本原因。此外，提出一种鼓励/惩罚学习特定复杂度相互作用的损失，并分析了不同复杂度相互作用的表示能力。

This paper explores the bottleneck of feature representations of deep neural networks (DNNs), from the perspective of the complexity of interactions between input variables encoded in DNNs. To this end, we focus on the multi-order interaction between input variables, where the order represents the complexity of interactions. We discover that a DNN is more likely to encode both too simple interactions and too complex interactions, but usually fails to learn interactions of intermediate complexity. Such a phenomenon is widely shared by different DNNs for different tasks. This phenomenon indicates a cognition gap between DNNs and human beings, and we call it a representation bottleneck. We theoretically prove the underlying reason for the representation bottleneck. Furthermore, we propose a loss to encourage/penalize the learning of interactions of specific complexities, and analyze the representation capacities of interactions of different complexities.

https://arxiv.org/abs/2111.06236

2、[LG] Expressiveness and Approximation Properties of Graph Neural Networks

F Geerts, J L. Reutter

[University of Antwerp & Pontificia Universidad Catolica de Chile]

图神经网络的表达能力和逼近特性。刻画图神经网络(GNN)的分离能力，可以了解其在图学习任务中的限制。然而，关于分离能力的结果，通常是针对特定GNN架构的，而且通常缺乏理解任意GNN架构的工具。本文提供一种优雅的方法，来轻松获得GNN在Weisfeiler-Leman(WL)测试方面的分离能力的界，而WL测试已成为衡量GNN分离能力的标准。关键是将GNN视为描述GNN各层计算的程序性张量语言中的表达式。然后，通过对所获得的表达式进行简单分析，从索引的数量和求和的嵌套深度来看，就WL测试而言，分离能力的界很容易得到。用张量语言来定义高阶消息传递神经网络(k-MPNN)，这是MPNN的自然扩展。此外，张量语言的观点允许以一种自然的方式推导出GNN类的普遍性结果。所提出方法提供了一个工具箱，GNN架构设计者可以用它来分析GNN的分离能力，而不需要知道WL测试的复杂度。

Characterizing the separation power of graph neural networks (GNNs) provides an understanding of their limitations for graph learning tasks. Results regarding separation power are, however, usually geared at specific GNN architectures, and tools for understanding arbitrary GNN architectures are generally lacking. We provide an elegant way to easily obtain bounds on the separation power of GNNs in terms of the Weisfeiler-Leman (WL) tests, which have become the yardstick to measure the separation power of GNNs. The crux is to view GNNs as expressions in a procedural tensor language describing the computations in the layers of the GNNs. Then, by a simple analysis of the obtained expressions, in terms of the number of indexes and the nesting depth of summations, bounds on the separation power in terms of the WL-tests readily follow. We use tensor language to define Higher-Order Message-Passing Neural Networks (or k-MPNNs), a natural extension of MPNNs. Furthermore, the tensor language point of view allows for the derivation of universality results for classes of GNNs in a natural way. Our approach provides a toolbox with which GNN architecture designers can analyze the separation power of their GNNs, without needing to know the intricacies of the WL-tests. We also provide insights in what is needed to boost the separation power of GNNs.

https://arxiv.org/abs/2204.04661

3、[CV] Share With Thy Neighbors: Single-View Reconstruction by Cross-Instance Consistency

T Monnier, M Fisher, A A. Efros, M Aubry

[Univ Gustave Eiffel & Adobe Research & UC Berkeley]

与近邻分享：基于跨实例一致性的单视角重建。单视角重建的方法通常依赖于视图标注、轮廓、去背景、同实例多视图、模板形状或对称性。本文通过显式利用不同目标实例图像间一致性来避免这些监督和假设。所提出方法可从描述同一目标类别的大型无标签图像集合中学习。本文主要贡献是利用跨实例一致性的两种方法：(i) 渐进式调节，一种训练策略，以课程学习的方式使模型从类别到实例逐渐特定化；(ii) 交换重建，一种损失，强制具有类似形状或纹理的实例间的一致性。对所提出方法的成功至关重要的还有：所提出的结构化自编码架构，将图像分解为明确的形状、纹理、姿态和背景；一个自适应的差分渲染的表述；以及一个在3D和姿态学习间交替进行的新的优化方案。在不同的合成ShapeNet数据集和标准的真实图像基准(Pascal3D+ Car, CUB-200)上对所提出方法UNICORN进行了比较，展示了对更具挑战性的真实世界集合的适用性(CompCars, LSUN)，证明了其对不同形状以及具有挑战性的真实世界的图像集产生了高质量的结果。

Approaches to single-view reconstruction typically rely on viewpoint annotations, silhouettes, the absence of background, multiple views of the same instance, a template shape, or symmetry. We avoid all of these supervisions and hypotheses by leveraging explicitly the consistency between images of different object instances. As a result, our method can learn from large collections of unlabelled images depicting the same object category. Our main contributions are two approaches to leverage cross-instance consistency: (i) progressive conditioning, a training strategy to gradually specialize the model from category to instances in a curriculum learning fashion; (ii) swap reconstruction, a loss enforcing consistency between instances having similar shape or texture. Critical to the success of our method are also: our structured autoencoding architecture decomposing an image into explicit shape, texture, pose, and background; an adapted formulation of differential rendering, and; a new optimization scheme alternating between 3D and pose learning. We compare our approach, UNICORN, both on the diverse synthetic ShapeNet dataset the classical benchmark for methods requiring multiple views as supervision and on standard real-image benchmarks (Pascal3D+ Car, CUB-200) for which most methods require known templates and silhouette annotations. We also showcase applicability to more challenging real-world collections (CompCars, LSUN), where silhouettes are not available and images are not cropped around the object.

https://arxiv.org/abs/2204.10310

4、[CL] On the Representation Collapse of Sparse Mixture of Experts

Z Chi, L Dong, S Huang, D Dai, S Ma, B Patra, S Singhal, P Bajaj, X Song, F Wei

[Microsoft Corporation]

稀疏专家混合表示坍缩研究。稀疏专家混合提供了更大的模型容量，同时需要一个持续的计算开销。其采用了路由机制，根据专家的隐性表示，将输入token分配给最匹配的专家。然而，学习这样的路由机制，会鼓励专家中心点周围的token聚类，这意味着一种表示坍缩的趋势。本文建议在低维超球上估计token和专家之间的路由分数。对跨语言语言模型的预训练和下游任务的微调进行了广泛的实验。七个多语言基准的实验结果表明，所提出方法取得了一致的收益。还对该模型的表示和路由行为进行了全面分析，其缓解了表示坍缩的问题，并且比基线的专家混合方法实现了更一致的路由。

Sparse mixture of experts provides larger model capacity while requiring a constant computational overhead. It employs the routing mechanism to distribute input tokens to the best-matched experts according to their hidden representations. However, learning such a routing mechanism encourages token clustering around expert centroids, implying a trend toward representation collapse. In this work, we propose to estimate the routing scores between tokens and experts on a lowdimensional hypersphere. We conduct extensive experiments on cross-lingual language model pre-training and fine-tuning on downstream tasks. Experimental results across seven multilingual benchmarks show that our method achieves consistent gains. We also present a comprehensive analysis on the representation and routing behaviors of our models. Our method alleviates the representation collapse issue and achieves more consistent routing than the baseline mixture-of-experts methods.

https://arxiv.org/abs/2204.09179

5、[CV] OSSO: Obtaining Skeletal Shape from Outside

M Keller, S Zuffi, M J. Black, S Pujades

[Max Planck Institute for Intelligent Systems & IMATI-CNR & Universite Grenoble Alpes]

OSSO：从外部获得骨骼形状。本文解决了从人体3D表面推断人在任意姿态的解剖学骨骼的问题，从外部(皮肤)预测内部(骨骼)。这在医学和生物力学方面有很多应用。现有的最先进的生物力学骨架很详细，但不容易推广到新的主题。此外，预测骨架的计算机视觉和图形方法通常是启发式的，不是从数据学习，没有利用完整的3D体表，也没有针对真实值进行验证。所提出系统叫做OSSO(从外部获得骨骼形状)，是第一个从真实数据中学习3D体表到内部骨骼的映射。使用1000名男性和1000名女性的双能X射线吸收仪(DXA)扫描来实现这一目标。对这些数据，拟合了一个参数化的3D体形模型(STAR)来捕捉体表，并拟合了一个基于部件的新的3D骨架模型来捕捉骨骼。这提供了内部/外部的训练对。在一个姿态规范化空间中用PCA对全骨骼的统计变化进行建模。训练一个从身体形状参数到骨架形状参数的回归器，并完善骨架以满足对物理合理性约束。给定任意的3D身体形状和姿态，OSSO预测出其中逼真的骨架。与之前工作相比，对保持的DXA扫描的骨架形状的准确性进行了定量评估，表现优于最先进的水平。还展示了从不同的、具有挑战性的3D人体预测3D骨架的情况。

We address the problem of inferring the anatomic skeleton of a person, in an arbitrary pose, from the 3D surface of the body; i.e. we predict the inside (bones) from the outside (skin). This has many applications in medicine and biomechanics. Existing state-of-the-art biomechanical skeletons are detailed but do not easily generalize to new subjects. Additionally, computer vision and graphics methods that predict skeletons are typically heuristic, not learned from data, do not leverage the full 3D body surface, and are not validated against ground truth. To our knowledge, our system, called OSSO (Obtaining Skeletal Shape from Outside), is the first to learn the mapping from the 3D body surface to the internal skeleton from real data. We do so using 1000 male and 1000 female dual-energy X-ray absorptiometry (DXA) scans. To these, we fit a parametric 3D body shape model (STAR) to capture the body surface and a novel part-based 3D skeleton model to capture the bones. This provides inside/outside training pairs. We model the statistical variation of full skeletons using PCA in a pose-normalized space. We then train a regressor from body shape parameters to skeleton shape parameters and refine the skeleton to satisfy constraints on physical plausibility. Given an arbitrary 3D body shape and pose, OSSO predicts a realistic skeleton inside. In contrast to previous work, we evaluate the accuracy of the skeleton shape quantitatively on held-out DXA scans, outperforming the state-of-the-art. We also show 3D skeleton prediction from varied and challenging 3D bodies. The code to infer a skeleton from a body shape is available for research at this https URL, and the dataset of paired outer surface (skin) and skeleton (bone) meshes is available as a Biobank Returned Dataset. This research has been conducted using the UK Biobank Resource.

https://arxiv.org/abs/2204.10129