爱可可AI前沿推介(6.9)

LG - 机器学习 CV - 计算机视觉 CL - 计算与语言 AS - 音频与语音 RO - 机器人

转自爱可可爱生活

摘要：动态场景长视频生成、面向运动学动画的神经运动场、零样本图像分类掩码无监督自训练、梯度空间降维少样本学习、集体智能的不可能性、面向自动语音识别的高效Transformer、以克里奥尔语为例的低资源自然语言研究、离散去噪模型的连续时间框架、大型测试图消息传递GNN分布外链路预测泛化能力

1、[CV] Generating Long Videos of Dynamic Scenes

T Brooks, J Hellsten, M Aittala, T Wang, T Aila, J Lehtinen, M Liu, A A. Efros, T Karras

[NVIDIA & UC Berkeley]

动态场景长视频生成。本文提出一种视频生成模型，能准确地再现物体运动、摄像机视角变化以及随时间推移产生的新内容。现有的视频生成方法往往不能随着时间推移产生新内容，同时保持真实环境中预期的一致性，如合理的动态性和目标的持久性。一种常见的失败案例是，由于过度依赖归纳偏差来提供时间上的一致性，如一个决定整个视频内容的单一潜代码，所以内容永远不会改变。而另一种极端，如果没有长期的一致性，生成的视频可能在不同的场景间不真实地变形。为解决这些局限性，本文通过重新设计时间潜表示优先考虑时间轴，通过对较长视频的训练从数据中学习长期一致性。利用一个两阶段的训练策略，分别用低分辨率长视频和高分辨率短视频进行训练。为评估该模型能力，引入了两个新的基准数据集，明确关注长程时间动态。

We present a video generation model that accurately reproduces object motion, changes in camera viewpoint, and new content that arises over time. Existing video generation methods often fail to produce new content as a function of time while maintaining consistencies expected in real environments, such as plausible dynamics and object persistence. A common failure case is for content to never change due to over-reliance on inductive biases to provide temporal consistency, such as a single latent code that dictates content for the entire video. On the other extreme, without long-term consistency, generated videos may morph unrealistically between different scenes. To address these limitations, we prioritize the time axis by redesigning the temporal latent representation and learning long-term consistency from data by training on longer videos. To this end, we leverage a two-phase training strategy, where we separately train using longer videos at a low resolution and shorter videos at a high resolution. To evaluate the capabilities of our model, we introduce two new benchmark datasets with explicit focus on long-term temporal dynamics.

https://arxiv.org/abs/2206.03429

2、[CV] NeMF: Neural Motion Fields for Kinematic Animation

C He, J Saito, J Zachary, H Rushmeier, Y Zhou

[Yale University & Adobe Research]

NeMF：面向运动学动画的神经运动场。本文提出一种隐神经表示来学习运动学动作的时-空间。与以往将运动表示为离散的连续样本的工作不同，本文建议将庞大的运动空间表示为一个随时间变化的连续函数，称为神经运动场(NeMF)。用一个神经网络来学习该函数，用于各种运动集，它被设计为一个生成模型，以时间坐标t和控制风格的随机向量z为条件。然后，该模型被训练成一个带有运动编码器的变分自编码器(VAE)，以对潜空间进行采样。用不同的人类运动数据集和四足动物数据集来训练该模型，以证明其通用性，最后将其部署为通用运动先验，以解决任务无关问题，并在不同的运动生成和编辑应用中显示其优越性，如运动插值、中间定位和再导航。

We present an implicit neural representation to learn the spatio-temporal space of kinematic motions. Unlike previous work that represents motion as discrete sequential samples, we propose to express the vast motion space as a continuous function over time, hence the name Neural Motion Fields (NeMF). Specifically, we use a neural network to learn this function for miscellaneous sets of motions, which is designed to be a generative model conditioned on a temporal coordinate t and a random vector z for controlling the style. The model is then trained as a Variational Autoencoder (VAE) with motion encoders to sample the latent space. We train our model with diverse human motion dataset and quadruped dataset to prove its versatility, and finally deploy it as a generic motion prior to solve task-agnostic problems and show its superiority in different motion generation and editing applications, such as motion interpolation, in-betweening, and re-navigating.

https://arxiv.org/abs/2206.03287

3、[CV] Masked Unsupervised Self-training for Zero-shot Image Classification

J Li, S Savarese, S C.H. Hoi

[Salesforce]

零样本图像分类掩码无监督自训练。最先进的计算机视觉模型大多是使用人工标记的图像进行监督学习训练，由于昂贵的标注成本，限制了其可扩展性。虽然自监督表示学习已经取得了令人印象深刻的进展，但它仍然需要在标记数据上进行第二阶段的微调。另一方面，用大规模的文本-图像监督(如CLIP)预训练的模型已经实现了对下游图像分类任务的零样本迁移。然而，类似CLIP的模型的零样本性能往往不足以在现实世界中采用。本文旨在利用丰富的无标记数据来提高预训练的零样本分类器在下游任务中的性能。本文提出了掩码式无监督自训练(Masked Unsupervised Self-Training, MUST)，这一种新方法，利用了两种不同的、互补的监督源：伪标签和原始图像。MUST联合优化了三个目标，以学习类别级全局特征和像素级局部特征，并在这两者之间实施正则化。在各种域的8个下游任务上证明了MUST的功效，它在很大程度上改进了CLIP，并缩小了无监督和有监督分类之间的性能差距。例如，MUST在使用ViT-B的ImageNet上实现了77.7%的零样本top-1准确率，比CLIP高出9.4%。

State-of-the-art computer vision models are mostly trained with supervised learning using human-labeled images, which limits their scalability due to the expensive annotation cost. While self-supervised representation learning has achieved impressive progress, it still requires a second stage of finetuning on labeled data. On the other hand, models pre-trained with large-scale text-image supervision (e.g., CLIP) have enabled zero-shot transfer to downstream image classification tasks. However, the zero-shot performance of CLIP-like models are often insufficient for real-world adoption. In this paper, we aim to leverage the abundant unlabeled data to improve the performance of a pre-trained zero-shot classifier on downstream tasks. We propose Masked Unsupervised Self-Training (MUST), a new approach which leverages two different and complimentary sources of supervision: pseudo-labels and raw images. MUST jointly optimizes three objectives to learn both class-level global feature and pixel-level local feature and enforces a regularization between the two. We demonstrate the efficacy of MUST on 8 downstream tasks across a variety of domains, where it improves upon CLIP by a large margin and narrows the performance gap between unsupervised and supervised classification. For instance, MUST achieves a zero-shot top-1 accuracy of 77.7% on ImageNet using ViT-B, +9.4% higher than CLIP. Our code is available at https://github.com/salesforce/MUST.

https://arxiv.org/abs/2206.02967

4、[LG] Few-Shot Learning by Dimensionality Reduction in Gradient Space

M Gauch, M Beck, T Adler, D Kotsur, S Fiel, H Eghbal-zadeh...

[Johannes Kepler University Linz & Anyline GmbH & Austrian Academy of Sciences]

梯度空间降维少样本学习。本文提出SubGD，一种新的少样本学习方法，基于最近发现的随机梯度下降更新倾向于活跃在低维参数子空间中。实验和理论分析表明，限定在一个合适的预定子空间中的模型可以很好地用于少样本学习。一个合适的子空间在给定的任务中满足三个标准：(a)允许通过梯度流减少训练误差，(b)导致模型有良好的泛化，(c)可以通过随机梯度下降识别。SubGD通过对不同任务中更新方向的自相关矩阵进行特征分解来识别这些子空间。可以识别出低维的合适子空间，用于动态系统的少样本学习，这些系统具有由分析系统描述的一个或几个参数描述的不同属性。这样的系统在科学和工程的实际应用中无处不在。通过实验证实了SubGD在三个不同的动态系统问题设置中的优势，在样本效率和性能方面都明显优于流行的少样本学习方法。

We introduce SubGD, a novel few-shot learning method which is based on the recent finding that stochastic gradient descent updates tend to live in a low-dimensional parameter subspace. In experimental and theoretical analyses, we show that models confined to a suitable predefined subspace generalize well for few-shot learning. A suitable subspace fulfills three criteria across the given tasks: it (a) allows to reduce the training error by gradient flow, (b) leads to models that generalize well, and (c) can be identified by stochastic gradient descent. SubGD identifies these subspaces from an eigen decomposition of the auto-correlation matrix of update directions across different tasks. Demonstrably, we can identify low-dimensional suitable subspaces for few-shot learning of dynamical systems, which have varying properties described by one or few parameters of the analytical system description. Such systems are ubiquitous among real-world applications in science and engineering. We experimentally corroborate the advantages of SubGD on three distinct dynamical systems problem settings, significantly outperforming popular few-shot learning methods both in terms of sample efficiency and performance.

https://arxiv.org/abs/2206.03483

5、[LG] Impossibility of Collective Intelligence

K Muandet

[Max Planck Institute for Intelligent Systems]

集体智能的不可能性。人工智能的民主化涉及在异质和潜在的大规模环境中训练和部署机器学习模型。数据的多样性为推进人工智能系统开辟了许多可能性，但同时也引入了需要特别注意的隐私、安全和公平等紧迫问题。本文表明，理论上不可能设计出一种具有跨异质环境成功学习能力的理性学习算法，我们往往称其为集体智能(CI)。通过将学习算法表示为假设空间上的选择对应关系，能用基本属性将其公理化。不幸的是，唯一与所有公理兼容的可行算法是标准的经验风险最小化(ERM)，它可以任意地从单一环境中学习。本文的不可能性结果揭示了环境之间的信息不可比性是设计从多个环境中学习的新算法的研究人员的首要障碍之一，这揭示了在机器学习的关键领域取得成功的先决条件，如分布外泛化、联合学习、算法公平和多模态学习。

Democratization of AI involves training and deploying machine learning models across heterogeneous and potentially massive environments. Diversity of data opens up a number of possibilities to advance AI systems, but also introduces pressing concerns such as privacy, security, and equity that require special attention. This work shows that it is theoretically impossible to design a rational learning algorithm that has the ability to successfully learn across heterogeneous environments, which we decoratively call collective intelligence (CI). By representing learning algorithms as choice correspondences over a hypothesis space, we are able to axiomatize them with essential properties. Unfortunately, the only feasible algorithm compatible with all of the axioms is the standard empirical risk minimization (ERM) which learns arbitrarily from a single environment. Our impossibility result reveals informational incomparability between environments as one of the foremost obstacles for researchers who design novel algorithms that learn from multiple environments, which sheds light on prerequisites for success in critical areas of machine learning such as out-of-distribution generalization, federated learning, algorithmic fairness, and multi-modal learning.

https://arxiv.org/abs/2206.02786