爱可可AI前沿推介(10.13)

LG - 机器学习 CV - 计算机视觉 CL - 计算与语言 AS - 音频与语音 RO - 机器人

转自爱可可爱生活

摘要：用交叉注意力解释Stable Diffusion、通过人工正则化搜索和学习实现人与AI协作、扩散模型潜空间的统一及其CycleDiffusion和引导应用、高阶去噪扩散解算器、基于模拟的基础语言模型推理、发现策略优化、基于快速电机自适应的手持物体旋转、基于渐进信号变换的多级扩散模型、面向航拍水鸟监测的深度目标检测

1、[CV] What the DAAM: Interpreting Stable Diffusion Using Cross Attention

R Tang, A Pandey, Z Jiang, G Yang, K Kumar, J Lin, F Ture
[Comcast Applied AI & University of Waterloo]
DAAM：用交叉注意力解释Stable Diffusion。大规模扩散神经网络代表了文本到图像生成的一个重要里程碑，其中一些在人类评估中的表现与真实照片相似。然而，它们仍然没有被深入理解，缺乏可解释性和可诠释性分析，这主要是由于它们的专有、闭源性质。本文对最近开源的大型扩散模型Stable Diffusion进行了文本-图像归因分析。为了产生像素级的归因图，提出DAAM，一种基于潜去噪子网络中交叉注意力激活的提升和聚合的新方法。通过评估其在生成图像上的无监督语义分割与有监督分割模型相比的质量，来支持其正确性。本文表明，DAAM在COCO描述生成的图像上表现强劲，实现了61.0的mIoU，而且它在开放词汇分割上优于有监督模型，mIoU为51.5。本文进一步发现，某些语音部分，如标点符号和连接词，对生成的图像影响最大，这与之前的文献一致，而定语和数字的影响最小，表明数字能力差。

Large-scale diffusion neural networks represent a substantial milestone in text-to-image generation, with some performing similar to real photographs in human evaluation. However, they remain poorly understood, lacking explainability and interpretability analyses, largely due to their proprietary, closed-source nature. In this paper, to shine some much-needed light on text-to-image diffusion models, we perform a text-image attribution analysis on Stable Diffusion, a recently open-sourced large diffusion model. To produce pixel-level attribution maps, we propose DAAM, a novel method based on upscaling and aggregating cross-attention activations in the latent denoising subnetwork. We support its correctness by evaluating its unsupervised semantic segmentation quality on its own generated imagery, compared to supervised segmentation models. We show that DAAM performs strongly on COCO caption-generated images, achieving an mIoU of 61.0, and it outperforms supervised models on open-vocabulary segmentation, for an mIoU of 51.5. We further find that certain parts of speech, like punctuation and conjunctions, influence the generated imagery most, which agrees with the prior literature, while determiners and numerals the least, suggesting poor numeracy. To our knowledge, we are the first to propose and study word-pixel attribution for large-scale text-to-image diffusion models. Our code and data are at this https URL.

https://arxiv.org/abs/2210.04885

2、[LG] Human-AI Coordination via Human-Regularized Search and Learning

H Hu, D J Wu, A Lerer, J Foerster, N Brown
[Meta AI & Oxford University]
通过人工正则化搜索和学习实现人与AI协作。本文考虑的问题是，在给定人工行为数据集的情况下，如何使AI智能体在部分可观察完全合作环境中与人进行良好的协作。受piKL的启发，本文开发了一种三步算法，在Hanabi基准中与真实人工的协同方面取得了很好的性能。首先用正则化搜索算法和行为克隆来产生一个更好的人工模型，以捕捉各种技能水平。将策略正则化的想法整合到强化学习中，以训练类似人的最佳响应的人工模型。在测试时将正则化搜索应用于最佳响应策略之上，以处理与人工合作时的分布外挑战。本文在两个大规模的人工实验中评估了所提出的方法。实验表明所提出方法在与一群不同的人类玩家组成的临时团队参与比赛，表现优于专家。通过让专家与两个智能体反复比赛，击败了vanilla的最佳响应行为克隆基线。

We consider the problem of making AI agents that collaborate well with humans in partially observable fully cooperative environments given datasets of human behavior. Inspired by piKL, a human-data-regularized search method that improves upon a behavioral cloning policy without diverging far away from it, we develop a three-step algorithm that achieve strong performance in coordinating with real humans in the Hanabi benchmark. We first use a regularized search algorithm and behavioral cloning to produce a better human model that captures diverse skill levels. Then, we integrate the policy regularization idea into reinforcement learning to train a human-like best response to the human model. Finally, we apply regularized search on top of the best response policy at test time to handle out-of-distribution challenges when playing with humans. We evaluate our method in two large scale experiments with humans. First, we show that our method outperforms experts when playing with a group of diverse human players in ad-hoc teams. Second, we show that our method beats a vanilla best response to behavioral cloning baseline by having experts play repeatedly with the two agents.

https://arxiv.org/abs/2210.05125

3、[CV] Unifying Diffusion Models' Latent Space, with Applications to CycleDiffusion and Guidance

C H Wu, F D l Torre
[CMU]
扩散模型潜空间的统一及其CycleDiffusion和引导应用。扩散模型在生成性建模中取得了前所未有的性能。通常采用的扩散模型潜代码的表述是逐渐去噪的样本序列，而GAN、VAE和归一化流的潜空间则比较简单(如高斯)。本文提供了各种扩散模型潜空间的另一种高斯表述，以及将图像映射到潜空间的可逆DPM-编码器。虽然该表述纯粹是基于扩散模型的定义，但本文展示了几个有趣的结果: (1) 从经验上看，本文观察到，在相关领域独立训练的两个扩散模型出现了一个共同的潜空间。鉴于这一发现，本文提出CycleDiffusion，用DPM-Encoder进行非配对图像到图像翻译。此外，将CycleDiffusion应用于文本到图像的扩散模型，本文表明大规模文本到图像扩散模型可以被用作零样本图像到图像编辑。(2) 可以通过控制基于能量模型的统一的、即插即用的表述中的潜代码来指导预训练扩散模型和GAN。用CLIP模型和人脸识别模型作为指导，本文证明了扩散模型比GAN对低密度子群体和个体有更好的覆盖。

Diffusion models have achieved unprecedented performance in generative modeling. The commonly-adopted formulation of the latent code of diffusion models is a sequence of gradually denoised samples, as opposed to the simpler (e.g., Gaussian) latent space of GANs, VAEs, and normalizing flows. This paper provides an alternative, Gaussian formulation of the latent space of various diffusion models, as well as an invertible DPM-Encoder that maps images into the latent space. While our formulation is purely based on the definition of diffusion models, we demonstrate several intriguing consequences. (1) Empirically, we observe that a common latent space emerges from two diffusion models trained independently on related domains. In light of this finding, we propose CycleDiffusion, which uses DPM-Encoder for unpaired image-to-image translation. Furthermore, applying CycleDiffusion to text-to-image diffusion models, we show that large-scale text-to-image diffusion models can be used as zero-shot image-to-image editors. (2) One can guide pre-trained diffusion models and GANs by controlling the latent codes in a unified, plug-and-play formulation based on energy-based models. Using the CLIP model and a face recognition model as guidance, we demonstrate that diffusion models have better coverage of low-density sub-populations and individuals than GANs.

https://arxiv.org/abs/2210.05559

4、[LG] GENIE: Higher-Order Denoising Diffusion Solvers

T Dockhorn, A Vahdat, K Kreis
[NVIDIA]
GENIE：高阶去噪扩散解算器。去噪扩散模型(DDM)已经成为一类强大的生成模型。前向扩散过程缓慢地扰动数据，而深度模型则学习逐渐去噪。合成相当于解决一个由学习模型定义的微分方程(DE)。求解微分方程需要缓慢的迭代解算器来实现高质量的生成。本文提出高阶去噪扩散解算器(GENIE)。基于截断泰勒方法，推导出一种新的高阶解算器，大大加快了合成速度。该解算器依赖于扰动数据分布的高阶梯度，即高阶打分函数。在实践中，只需要雅各布向量积(JVP)，本文建议通过自动微分从一阶得分网络中提取它们。将JVP提炼成一个单独的神经网络，使得能够在合成过程中有效地计算新采样器所需的高阶项。只需要在一阶打分网络的基础上训练一个小的附加头。本文在多个图像生成基准上验证了GENIE，并证明GENIE优于之前所有的求解器。与最近从根本上改变DDM生成过程的方法不同，所提出的GENIE解决了真正的生成性DE，并且仍然能够实现编码和引导采样等应用。

Denoising diffusion models (DDMs) have emerged as a powerful class of generative models. A forward diffusion process slowly perturbs the data, while a deep model learns to gradually denoise. Synthesis amounts to solving a differential equation (DE) defined by the learnt model. Solving the DE requires slow iterative solvers for high-quality generation. In this work, we propose Higher-Order Denoising Diffusion Solvers (GENIE): Based on truncated Taylor methods, we derive a novel higher-order solver that significantly accelerates synthesis. Our solver relies on higher-order gradients of the perturbed data distribution, that is, higher-order score functions. In practice, only Jacobian-vector products (JVPs) are required and we propose to extract them from the first-order score network via automatic differentiation. We then distill the JVPs into a separate neural network that allows us to efficiently compute the necessary higher-order terms for our novel sampler during synthesis. We only need to train a small additional head on top of the first-order score network. We validate GENIE on multiple image generation benchmarks and demonstrate that GENIE outperforms all previous solvers. Unlike recent methods that fundamentally alter the generation process in DDMs, our GENIE solves the true generative DE and still enables applications such as encoding and guided sampling. Project page and code: this https URL.

https://arxiv.org/abs/2210.05475

5、[CL] Mind's Eye: Grounded Language Model Reasoning through Simulation

R Liu, J Wei, S S Gu, T Wu, S Vosoughi, C Cui, D Zhou, A M. Dai
[Google Research & Dartmouth College]
Mind's Eye：基于模拟的基础语言模型推理。人和AI间成功而有效的沟通依赖于对世界的共同体验。目前的语言模型(LM)仅在书面文本上进行训练，因此错过了人在现实世界中的基础经验——未能将语言与物理世界联系起来，导致知识被错误地表述，并在推理中出现明显的错误。本文提出"Mind's Eye"，一种将语言模型推理建立在物理世界基础上的范式。给定一个物理推理问题，用一个计算物理引擎(DeepMind的MuJoCo)来模拟可能的结果，然后将模拟结果作为输入的一部分，这使得语言模型能进行推理。在一个物理排列基准的39个任务上的实验表明，Mind's Eye可以大幅提高推理能力(平均27.9%的零样本，和46.0%的少样本绝对准确率提高）。用Mind's Eye武装起来的较小的语言模型可以获得与100倍大的模型相似的性能。最后，本文通过消融研究证实了Mind's Eye的鲁棒性。

Successful and effective communication between humans and AI relies on a shared experience of the world. By training solely on written text, current language models (LMs) miss the grounded experience of humans in the real-world -- their failure to relate language to the physical world causes knowledge to be misrepresented and obvious mistakes in their reasoning. We present Mind's Eye, a paradigm to ground language model reasoning in the physical world. Given a physical reasoning question, we use a computational physics engine (DeepMind's MuJoCo) to simulate the possible outcomes, and then use the simulation results as part of the input, which enables language models to perform reasoning. Experiments on 39 tasks in a physics alignment benchmark demonstrate that Mind's Eye can improve reasoning ability by a large margin (27.9% zero-shot, and 46.0% few-shot absolute accuracy improvement on average). Smaller language models armed with Mind's Eye can obtain similar performance to models that are 100x larger. Finally, we confirm the robustness of Mind's Eye through ablation studies.

https://arxiv.org/abs/2210.05359

另外几篇值得关注的论文：

[LG] Discovered Policy Optimisation

发现策略优化
C Lu, J G Kuba, A Letcher, L Metz, C S d Witt, J Foerster
[University of Oxford & UC Berkeley & Google Brain] https://arxiv.org/abs/2210.05639

[RO] In-Hand Object Rotation via Rapid Motor Adaptation

基于快速电机自适应的手持物体旋转
H Qi, A Kumar, R Calandra, Y Ma, J Malik
[UC Berkeley & Meta AI]
https://arxiv.org/abs/2210.04887

[CV] f-DM: A Multi-stage Diffusion Model via Progressive Signal Transformation

f-DM：基于渐进信号变换的多级扩散模型
J Gu, S Zhai, Y Zhang, M A Bautista, J Susskind
[Apple]
https://arxiv.org/abs/2210.04955

[CV] Deep object detection for waterbird monitoring using aerial imagery

面向航拍水鸟监测的深度目标检测
K Kabra, A Xiong, W Li, M Luo, W Lu, R Garcia, D Vijay, J Yu, M Tang, T Yu, H Arnold, A Vallery, R Gibbons, A Barman
[Rice University]
https://arxiv.org/abs/2210.04868