爱可可AI前沿推介(10.8)

LG - 机器学习 CV - 计算机视觉 CL - 计算与语言 AS - 音频与语音 RO - 机器人

转自爱可可爱生活

摘要：用去噪扩散和CLIP对预训练StyleGAN进行文本驱动采样、用扩散模型生成高清视频、用Tree Mover距离将图度量与图神经网络稳定性联系起来、用Lie导数度量习得等变性、无散度角度神经守恒律、NLP泛化研究前沿进展分类与回顾、用强化学习优化离散文本提示、基于多模态提示的通用机器人操纵、基于掩码视觉预训练的现实世界机器人学习

1、[CV] clip2latent: Text driven sampling of a pre-trained StyleGAN using denoising diffusion and CLIP

J N. M. Pinkney, C Li
[Lambda, Inc]
clip2latent: 用去噪扩散和CLIP对预训练StyleGAN进行文本驱动采样。本文提出一种新方法，从预训练的CLIP和StyleGAN中有效地创建文本到图像模型，能用现有的生成模型进行文本驱动采样，而不需要任何外部数据或微调。这是通过训练一个以CLIP嵌入为条件的扩散模型对一个预先训练好的StyleGAN的潜向量进行采样实现的，称为clip2latent。利用CLIP的图像和文本嵌入间的一致性，以避免训练条件扩散模型时需要任何文本标记数据。实验证明，clip2latent允许根据文本提示生成高分辨率(1024x1024像素)的图像，具有快速采样、高图像质量和低训练计算和数据要求。本文还表明，用研究透彻的StyleGAN架构，无需进一步微调，就可直接应用现有方法来控制和修改生成的图像，为文本到图像管道增加一个控制层。

We introduce a new method to efficiently create text-to-image models from a pre-trained CLIP and StyleGAN. It enables text driven sampling with an existing generative model without any external data or fine-tuning. This is achieved by training a diffusion model conditioned on CLIP embeddings to sample latent vectors of a pre-trained StyleGAN, which we call clip2latent. We leverage the alignment between CLIP's image and text embeddings to avoid the need for any text labelled data for training the conditional diffusion model. We demonstrate that clip2latent allows us to generate high-resolution (1024x1024 pixels) images based on text prompts with fast sampling, high image quality, and low training compute and data requirements. We also show that the use of the well studied StyleGAN architecture, without further fine-tuning, allows us to directly apply existing methods to control and modify the generated images adding a further layer of control to our text-to-image pipeline.

https://arxiv.org/abs/2210.02347

2、[CV] Imagen Video: High Definition Video Generation with Diffusion Models

J Ho, W Chan, C Saharia, J Whang, R Gao, A Gritsenko, D P. Kingma, B Poole, M Norouzi, D J. Fleet, T Salimans
[Google Research]
Imagen Video: 用扩散模型生成高清视频。本文提出Imagen Video，一种基于视频扩散模型级联的文本条件视频生成系统。给定一个文本提示，Imagen Video用一个基本的视频生成模型和一个交错空间时间视频超分辨率模型序列来生成高清视频。本文描述了如何将该系统扩展为一个高清文本到视频模型，包括设计决策，如在某些分辨率下选择全卷积时间空间超分辨率模型，以及扩散模型的v参数化选择。此外，确认并将之前基于扩散的图像生成工作的发现迁移到视频生成环境中。最后，将渐进式蒸馏方法应用于视频模型，在无分类器指导下进行快速、高质量的采样。本文发现Imagen Video不仅能够生成高保真视频，还具有高度的可控性和世界知识，包括生成各种艺术风格和3D物体理解的各种视频和文本动画的能力。

We present Imagen Video, a text-conditional video generation system based on a cascade of video diffusion models. Given a text prompt, Imagen Video generates high definition videos using a base video generation model and a sequence of interleaved spatial and temporal video super-resolution models. We describe how we scale up the system as a high definition text-to-video model including design decisions such as the choice of fully-convolutional temporal and spatial super-resolution models at certain resolutions, and the choice of the v-parameterization of diffusion models. In addition, we confirm and transfer findings from previous work on diffusion-based image generation to the video generation setting. Finally, we apply progressive distillation to our video models with classifier-free guidance for fast, high quality sampling. We find Imagen Video not only capable of generating videos of high fidelity, but also having a high degree of controllability and world knowledge, including the ability to generate diverse videos and text animations in various artistic styles and with 3D object understanding. See https://imagen.research.google/video/ for samples.

https://arxiv.org/abs/2210.02303

3、[LG] Tree Mover's Distance: Bridging Graph Metrics and Stability of Graph Neural Networks

C Chuang, S Jegelka
[MIT CSAIL]
Tree Mover距离：将图度量与图神经网络稳定性联系起来。理解机器学习模型的泛化和鲁棒性从根本上依赖于假设数据空间的适当度量。对于像图这样的非欧几里得数据，找到这样合适的度量尤其具有挑战性。本文提出一种归属图的伪度量，即Tree Mover距离(TMD)，并研究其与泛化的关系。通过一个分层最优传输问题，TMD反映了节点属性的局部分布以及局部计算树的分布，这些都是已知的对图神经网络(GNN)的学习行为具有决定性作用的。首先，本文表明TMD捕捉到了与图分类相关的属性：一个简单的TMD-SVM的表现与标准的GNN有竞争力。其次，将TMD与分布变化下的GNN的泛化联系起来，并表明它与这种变化下的性能下降有很好的关联。

Understanding generalization and robustness of machine learning models fundamentally relies on assuming an appropriate metric on the data space. Identifying such a metric is particularly challenging for non-Euclidean data such as graphs. Here, we propose a pseudometric for attributed graphs, the Tree Mover's Distance (TMD), and study its relation to generalization. Via a hierarchical optimal transport problem, TMD reflects the local distribution of node attributes as well as the distribution of local computation trees, which are known to be decisive for the learning behavior of graph neural networks (GNNs). First, we show that TMD captures properties relevant to graph classification: a simple TMD-SVM performs competitively with standard GNNs. Second, we relate TMD to generalization of GNNs under distribution shifts, and show that it correlates well with performance drop under such shifts.

https://arxiv.org/abs/2210.01906

4、[LG] The Lie Derivative for Measuring Learned Equivariance

N Gruver, M Finzi, M Goldblum, A G Wilson
[New York University]
用Lie导数度量习得等变性。等变性保证了模型的预测能捕捉到数据中的关键对称性。当一个图像被平移或旋转时，一个等变模型对该图像的表示将相应地变平移或旋转。卷积神经网络的成功在历史上一直与直接编码在其架构中的平移等变性有关。视觉Transformer架构并没有明确的等边性偏差，其不断成功挑战了以上这种说法，并表明增强和训练数据也可能在其性能中发挥重要作用。为了更好地理解等变性在最近的视觉模型中的作用，本文提出Lie导数，一种具有强大数学基础和最小超参数的等变性度量方法。利用Lie导数，本文研究了数百个预训练模型的等变特性，其中包括CNN、Transformer和Mixer架构。所分析的规模使得能将架构的影响与其他因素如模型大小或训练方法分开。令人惊讶的是，本文发现许多违反等变性的行为可以与无处不在的网络层中的空间异化联系起来，比如点状非线性，而且随着模型越来越大，越来越精确，它们往往显示出更多的等变性，不管是什么架构。例如，Transformer在训练后可以比卷积神经网络更具有等变性。

Equivariance guarantees that a model's predictions capture key symmetries in data. When an image is translated or rotated, an equivariant model's representation of that image will translate or rotate accordingly. The success of convolutional neural networks has historically been tied to translation equivariance directly encoded in their architecture. The rising success of vision transformers, which have no explicit architectural bias towards equivariance, challenges this narrative and suggests that augmentations and training data might also play a significant role in their performance. In order to better understand the role of equivariance in recent vision models, we introduce the Lie derivative, a method for measuring equivariance with strong mathematical foundations and minimal hyperparameters. Using the Lie derivative, we study the equivariance properties of hundreds of pretrained models, spanning CNNs, transformers, and Mixer architectures. The scale of our analysis allows us to separate the impact of architecture from other factors like model size or training method. Surprisingly, we find that many violations of equivariance can be linked to spatial aliasing in ubiquitous network layers, such as pointwise non-linearities, and that as models get larger and more accurate they tend to display more equivariance, regardless of architecture. For example, transformers can be more equivariant than convolutional neural networks after training.

https://arxiv.org/abs/2210.02984

5、[LG] Neural Conservation Laws: A Divergence-Free Perspective

J Richter-Powell, Y Lipman, R T. Q. Chen
[Vector Institute & Meta AI]
神经守恒律：一种无散度的视角。本文研究了深度神经网络的参数化问题，这些网络在设计上满足连续性方程这一基本守恒定律。这是由于观察到连续性方程的解可以表示为无散度向量场。因此，本文建议通过微分形式的概念建立无散度神经网络，并在自动微分的帮助下，实现两个实际的构造。因此，可以对密度和矢量场进行参数化，通过构造始终满足连续性方程，放弃了对额外惩罚方法或昂贵的数值仿真的需要。此外，本文证明这些模型是通用的，可以用来表示任何无散度向量场。最后，本文在基于神经网络的流体方程求解上实验验证了以上方法，求解霍奇分解，并学习动态最优传输图霍奇分解，学习动态最优传输图。

We investigate the parameterization of deep neural networks that by design satisfy the continuity equation, a fundamental conservation law. This is enabled by the observation that solutions of the continuity equation can be represented as a divergence-free vector field. We hence propose building divergence-free neural networks through the concept of differential forms, and with the aid of automatic differentiation, realize two practical constructions. As a result, we can parameterize pairs of densities and vector fields that always satisfy the continuity equation by construction, foregoing the need for extra penalty methods or expensive numerical simulation. Furthermore, we prove these models are universal and so can be used to represent any divergence-free vector field. Finally, we experimentally validate our approaches on neural network-based solutions to fluid equations, solving for the Hodge decomposition, and learning dynamical optimal transport maps the Hodge decomposition, and learning dynamical optimal transport maps.

https://arxiv.org/abs/2210.01741

另外几篇值得关注的论文：

[CL] State-of-the-art generalisation research in NLP: a taxonomy and review

NLP泛化研究前沿进展分类与回顾
D Hupkes, M Giulianelli, V Dankers...
[Meta AI & University of Amsterdam & Allen Institute of AI & University of Cambridge & ...]
https://arxiv.org/abs/2210.03050

[CL] RLPrompt: Optimizing Discrete Text Prompts with Reinforcement Learning

RLPrompt：用强化学习优化离散文本提示
M Deng, J Wang, C Hsieh, Y Wang, H Guo, T Shu, M Song, E P. Xing, Z Hu
[CMU & UC San Diego & MIT]
https://arxiv.org/abs/2205.12548

[RO] VIMA: General Robot Manipulation with Multimodal Prompts

VIMA：基于多模态提示的通用机器人操纵
Y Jiang, A Gupta, Z Zhang, G Wang, Y Dou, Y Chen, L Fei-Fei, A Anandkumar, Y Zhu, L Fan
[NVIDIA & Stanford & Macalester College & Tsinghua] https://arxiv.org/abs/2210.03094