爱可可AI前沿推介(5.18)

LG - 机器学习 CV - 计算机视觉 CL - 计算与语言 AS - 音频与语音 RO - 机器人

转自爱可可爱生活

摘要：词移距离的重新评估、基于对抗性掩码的自监督学习、基于自退火随机量化的离散表示变分贝叶斯、基于深度谱方法的无监督语义分割和定位、面向对抗性净化的扩散模型、隐式Sinkhorn微分的统一框架、多智能体强化学习中涌现的以物易物行为、神经代码补全的生产力评估、基于视听协同的视觉风格学习

1、[LG] Re-evaluating Word Mover's Distance

R Sato, M Yamada, H Kashima

[Kyoto University]

词移距离的重新评估。词移距离(WMD)是衡量两个文档相似性的基本技术。作为WMD的核心，它可以通过采用最优传输公式来利用词空间的基础几何特性。关于WMD的原始研究报告指出，在各种数据集中，WMD以明显优势优于经典的基线，如词袋(BOW)和TF-IDF。本文指出，之前研究中的评估可能是误导性的。本文重新评估了WMD和经典基线的性能，发现如果我们采用适当的预处理，即L1归一化，经典基线与WMD相比是有竞争力的。此外，在WMD和L1归一化的BOW之间进行了类比，发现在高维空间中，WMD不仅性能相似，距离值也与BOW相似。

The word mover’s distance (WMD) is a fundamental technique for measuring the similarity of two documents. As the crux of WMD, it can take advantage of the underlying geometry of the word space by employing an optimal transport formulation. The original study on WMD reported that WMD outperforms classical baselines such as bag-of-words (BOW) and TF-IDF by significant margins in various datasets. In this paper, we point out that the evaluation in the original study could be misleading. We re-evaluate the performances of WMD and the classical baselines and find that the classical baselines are competitive with WMD if we employ an appropriate preprocessing, i.e., L1 normalization. In addition, We introduce an analogy between WMD and L1-normalized BOW and find that not only the performance of WMD but also the distance values resemble those of BOW in high dimensional spaces.

https://arxiv.org/abs/2105.14403

2、[CV] Adversarial Masking for Self-Supervised Learning

Y Shi, N. Siddharth, P H.S. Torr, A R. Kosiorek

[University of Oxford & The University of Edinburgh & DeepMind]

基于对抗性掩码的自监督学习。本文提出ADIOS，一种面向自监督学习的掩码图像建模(MIM)框架，用对抗性目标同时学习掩码函数和图像编码器。图像编码器被训练为最小化原始图像和被掩码图像表示间的距离。掩码函数的目的是使这个距离最大化。ADIOS在各种任务和数据集上不断改进最先进的自监督学习(SSL)方法——包括ImageNet100和STL10的分类，CIFAR10/100、Flowers102和iNaturalist的迁移学习，以及背景挑战中评估的鲁棒性——同时产生有语义的掩码。与MAE、BEiT和iBOT等现代MIM模型不同，ADIOS不依赖于视觉Transformer的图块标记化结构，可以用卷积骨架实现。实验证明，与流行的MIM模型中使用的掩码方案相比，ADIOS学到的掩码在改善SSL方法的表示学习方面更为有效。

We propose ADIOS, a masked image modeling (MIM) framework for self-supervised learning, which simultaneously learns a masking function and an image encoder using an adversarial objective. The image encoder is trained to minimise the distance between representations of the original and that of a masked image. The masking function, conversely, aims at maximising this distance. ADIOS consistently improves on state-ofthe-art self-supervised learning (SSL) methods on a variety of tasks and datasets—including classification on ImageNet100 and STL10, transfer learning on CIFAR10/100, Flowers102 and iNaturalist, as well as robustness evaluated on the backgrounds challenge (Xiao et al., 2021)—while generating semantically meaningful masks. Unlike modern MIM models such as MAE, BEiT and iBOT, ADIOS does not rely on the image-patch tokenisation construction of Vision Transformers, and can be implemented with convolutional backbones. We further demonstrate that the masks learned by ADIOS are more effective in improving representation learning of SSL methods than masking schemes used in popular MIM models.

https://arxiv.org/abs/2201.13100

3、[LG] SQ-VAE: Variational Bayes on Discrete Representation with Self-annealed Stochastic Quantization

Y Takida, T Shibuya, W Liao, C Lai, J Ohmura, T Uesaka...

[Sony Group Corporation & Sony Corporation of America]

SQ-VAE：基于自退火随机量化的离散表示变分贝叶斯。矢量量化变分自编码器(VQ-VAE)的一个值得注意的问题是，学到的离散表示只用了码本全部容量的一小部分，也称为码本坍缩。本文假设VQ-VAE的训练方案涉及一些精心设计的启发式方法，是这个问题的基础。本文提出一种新的训练方案，通过新的随机去量化和量化，扩展了标准VAE，即随机量化变分自编码器(SQ-VAE)。在SQ-VAE中，有一种趋势，量化在训练的初始阶段是随机的，但逐渐向确定性的量化收敛，称为自退火。实验表明，SQ-VAE在不使用普通启发式方法的情况下提高了码本的利用率。经验表明，SQ-VAE在视觉和语音相关任务中优于VAE和VQ-VAE。

One noted issue of vector-quantized variational autoencoder (VQ-VAE) is that the learned discrete representation uses only a fraction of the full capacity of the codebook, also known as codebook collapse. We hypothesize that the training scheme of VQ-VAE, which involves some carefully designed heuristics, underlies this issue. In this paper, we propose a new training scheme that extends the standard VAE via novel stochastic dequantization and quantization, called stochastically quantized variational autoencoder (SQ-VAE). In SQ-VAE, we observe a trend that the quantization is stochastic at the initial stage of the training but gradually converges toward a deterministic quantization, which we call self-annealing. Our experiments show that SQ-VAE improves codebook utilization without using common heuristics. Furthermore, we empirically show that SQ-VAE is superior to VAE and VQ-VAE in visionand speech-related tasks.

https://arxiv.org/abs/2205.07547

4、[CV] Deep Spectral Methods: A Surprisingly Strong Baseline for Unsupervised Semantic Segmentation and Localization

L Melas-Kyriazi, C Rupprecht, I Laina, A Vedaldi

[University of Oxford]

深度谱方法：无监督语义分割和定位的强大基线。无监督定位和分割是计算机视觉长期以来的挑战，涉及到在没有任何标记数据的情况下，将图像分解成有语义的片段。由于获得密集图像标注的难度和成本，这些任务在无监督情况下特别有意义，但现有的无监督方法在处理包含多个目标的复杂场景时很困难。与现有的纯粹基于深度学习的方法不同，本文从传统的谱分割方法中获得灵感，将图像分解重构为一个图形分割问题。研究了来自自监督网络的特征亲和矩阵的拉普拉斯的特征向量。发现这些特征向量已经将图像分解成有意义的片段，并可以很容易地用于定位场景中的目标。通过对整个数据集中与这些片段相关的特征进行聚类，可以获得界限分明、可命名的区域，即语义分割。在复杂的数据集(PASCAL VOC, MS-COCO)上的实验表明，简单的谱方法在无监督定位和分割方面比最先进的方法要好很多。该方法还可以用于各种复杂的图像编辑任务，如去除背景和合成。

Unsupervised localization and segmentation are longstanding computer vision challenges that involve decomposing an image into semantically meaningful segments without any labeled data. These tasks are particularly interesting in an unsupervised setting due to the difficulty and cost of obtaining dense image annotations, but existing unsupervised approaches struggle with complex scenes containing multiple objects. Differently from existing methods, which are purely based on deep learning, we take inspiration from traditional spectral segmentation methods by reframing image decomposition as a graph partitioning problem. Specifically, we examine the eigenvectors of the Laplacian of a feature affinity matrix from self-supervised networks. We find that these eigenvectors already decompose an image into meaningful segments, and can be readily used to localize objects in a scene. Furthermore, by clustering the features associated with these segments across a dataset, we can obtain well-delineated, nameable regions, i.e. semantic segmentations. Experiments on complex datasets (PASCAL VOC, MS-COCO) demonstrate that our simple spectral method outperforms the state-of-the-art in unsupervised localization and segmentation by a significant margin. Furthermore, our method can be readily used for a variety of complex image editing tasks, such as background removal and compositing.

https://arxiv.org/abs/2205.07839

5、[LG] Diffusion Models for Adversarial Purification

W Nie, B Guo, Y Huang, C Xiao, A Vahdat, A Anandkumar

[NVIDIA & Caltech]

面向对抗性净化的扩散模型。对抗性净化是一类防御方法，用生成模型去除对抗性扰动。此类方法不对攻击形式和分类模型进行假设，因此可以保护已经存在的分类器免受不可见的威胁。然而，其性能目前落后于对抗性训练方法。本文提出DiffPure，用扩散模型进行对抗性净化。给定一个对抗性样本，先按照前向扩散过程，用少量噪声进行扩散，通过反向生成过程恢复干净的图像。为评估该方法以高效和可扩展的方式对抗强适应性攻击，本文建议用邻接法来计算反向生成过程的全部梯度。在包括CIFAR10、ImageNet和CelebA-HQ在内的三个图像数据集上进行的广泛实验，以及包括ResNet、WideResNet和ViT在内的三个分类器架构，表明所提出方法取得了最先进的结果，优于目前的对抗性训练和对抗性净化方法，往往有很大的优势。

Adversarial purification refers to a class of defense methods that remove adversarial perturbations using a generative model. These methods do not make assumptions on the form of attack and the classification model, and thus can defend pre-existing classifiers against unseen threats. However, their performance currently falls behind adversarial training methods. In this work, we propose DiffPure that uses diffusion models for adversarial purification: Given an adversarial example, we first diffuse it with a small amount of noise following a forward diffusion process, and then recover the clean image through a reverse generative process. To evaluate our method against strong adaptive attacks in an efficient and scalable way, we propose to use the adjoint method to compute full gradients of the reverse generative process. Extensive experiments on three image datasets including CIFAR10, ImageNet and CelebA-HQ with three classifier architectures including ResNet, WideResNet and ViT demonstrate that our method achieves the state-of-the-art results, outperforming current adversarial training and adversarial purification methods, often by a large margin. Project page: https://diffpure.github.io.

https://arxiv.org/abs/2205.07460