爱可可AI前沿推介(12.2)

LG - 机器学习 CV - 计算机视觉 CL - 计算与语言 AS - 音频与语音 RO - 机器人

转自爱可可爱生活

1、[CL] MAUVE: Measuring the Gap Between Neural Text and Human Text using Divergence Frontiers

K Pillutla, S Swayamdipta, R Zellers, J Thickstun, S Welleck, Y Choi, Z Harchaoui

[University of Washington & Allen Institute for Artificial Intelligence & Stanford University]

MAUVE：用散度边界测量神经文本和人工文本间的差距。随着开放式文本生成的重大进展，测量机器生成的文本与人工语言的接近程度仍然是一个关键的开放性问题。本文提出MAUVE，一种用于开放式文本生成的比较措施，用散度边界直接比较来自文本生成模型的学习分布和人工撰写的文本分布。MAUVE通过计算量化嵌入空间中的散度边界，可扩展到现代文本生成模型。通过对三个开放式生成任务的广泛实证研究，发现MAUVE能识别生成文本的已知属性，随着模型的大小自然扩展，并与人工的判断相关，而且比现有的分布评价指标限制更少。

As major progress is made in open-ended text generation, measuring how close machine-generated text is to human language remains a critical open problem. We introduce MAUVE, a comparison measure for open-ended text generation, which directly compares the learnt distribution from a text generation model to the distribution of human-written text using divergence frontiers. MAUVE scales up to modern text generation models by computing information divergences in a quantized embedding space. Through an extensive empirical study on three open-ended generation tasks, we find that MAUVE identifies known properties of generated text, scales naturally with model size, and correlates with human judgments, with fewer restrictions than existing distributional evaluation metrics.

https://weibo.com/1402400261/L47zTqpur

2、[LG] Long-range and hierarchical language predictions in brains and algorithms

C Caucheteux, A Gramfort, J King

[Facebook AI Research & Université Paris-Saclay]

大脑和算法的长程和分层语言预测。深度学习最近在自然语言处理方面取得了显著进展。然而，由此产生的算法仍然远远不能与人脑的语言能力相媲美。预测编码理论为这种差异提供了一种潜在解释：深度语言算法被优化为预测相邻单词，而人脑被调整为进行长程和分层的预测。为测试这一假设，本文分析了304名受试者的fMRI大脑信号，他们分别聆听了70分钟的短篇故事。在证实了深度语言算法的激活与大脑的激活呈线性映射后，本文表明用长程预测表示来增强这些模型，可改善它们的大脑映射。研究结果进一步揭示了大脑中的预测层次，即前额-顶叶皮质比颞叶皮质预测更抽象和更长远的表示。总的来说，这项研究加强了预测编码理论，并表明长程和分层预测在自然语言处理中的关键作用。

Deep learning has recently made remarkable progress in natural language processing. Yet, the resulting algorithms remain far from competing with the language abilities of the human brain. Predictive coding theory offers a potential explanation to this discrepancy: while deep language algorithms are optimized to predict adjacent words, the human brain would be tuned to make long-range and hierarchical predictions. To test this hypothesis, we analyze the fMRI brain signals of 304 subjects each listening to 70min of short stories. After confirming that the activations of deep language algorithms linearly map onto those of the brain, we show that enhancing these models with long-range forecast representations improves their brain-mapping. The results further reveal a hierarchy of predictions in the brain, whereby the fronto-parietal cortices forecast more abstract and more distant representations than the temporal cortices. Overall, this study strengthens predictive coding theory and suggests a critical role of long-range and hierarchical predictions in natural language processing.

https://weibo.com/1402400261/L47EXd2YW

3、[CV] HyperStyle: StyleGAN Inversion with HyperNetworks for Real Image Editing

Y Alaluf, O Tov, R Mokady, R Gal, A H. Bermano

[Tel Aviv University]

HyperStyle：面向真实图像编辑的基于超网络的StyleGAN反演。将真实图像反演到StyleGAN潜空间是个经过充分研究的问题。然而，由于重建和可编辑性之间的内在权衡，将现有方法应用于真实世界场景仍然是个公开的挑战：能准确代表真实图像的潜空间区域通常会受到语义控制的削弱。最近的工作建议通过对生成器进行微调，将目标图像添加到潜空间中良好的、可编辑的区域来缓解这种折衷。虽然很有希望，但这种微调方案对于普遍使用来说是不切实际的，因为它需要对每个新图像进行冗长的训练阶段。本文将这种方法引入基于编码器的反演领域，提出HyperStyle，一种超网络(hypernetwork)，可以学习调制StyleGAN的权重，以忠实地表达潜空间可编辑区域的特定图像。一个简单的调制方法需要训练一个具有超过30亿个参数的超网络。通过仔细的网络设计，将其减少到与现有编码器一致。HyperStyle产生的重建结果可以与优化技术相媲美，并具有编码器的近乎实时的推理能力。证明了HyperStyle在反演任务之外的一些应用中的有效性，包括在训练中从未见过的域外图像的编辑。

The inversion of real images into StyleGAN’s latent space is a well-studied problem. Nevertheless, applying existing approaches to real-world scenarios remains an open challenge, due to an inherent trade-off between reconstruction and editability: latent space regions which can accurately represent real images typically suffer from degraded semantic control. Recent work proposes to mitigate this trade-off by fine-tuning the generator to add the target image to well-behaved, editable regions of the latent space. While promising, this fine-tuning scheme is impractical for prevalent use as it requires a lengthy training phase for each new image. In this work, we introduce this approach into the realm of encoder-based inversion. We propose HyperStyle, a hypernetwork that learns to modulate StyleGAN’s weights to faithfully express a given image in editable regions of the latent space. A naive modulation approach would require training a hypernetwork with over three billion parameters. Through careful network design, we reduce this to be in line with existing encoders. HyperStyle yields reconstructions comparable to those of optimization techniques with the near real-time inference capabilities of encoders. Lastly, we demonstrate HyperStyle’s effectiveness on several applications beyond the inversion task, including the editing of out-of-domain images which were never seen during training. Code is available on our project page: https://yuval-alaluf.github.io/hyperstyle/.

https://weibo.com/1402400261/L47JanElt

4、[LG] A Universal Law of Robustness via Isoperimetry

S Bubeck, M Sellke

[Microsoft Research & Stanford University]

基于等周的通用鲁棒性法则。传统上，只要参数数量大于需要满足的方程的数量，用参数化的模型类进行数据插值是可能的。深度学习中一个令人费解的现象是，模型的训练参数比这个经典理论所建议的要多得多。本文对这一现象提出了理论上的解释，证明了对一大类数据分布和模型类，如果想平稳地插值数据，过参数化是必要的，平滑插值需要比单纯插值多d倍的参数，其中d是环境数据维度。为任意具有多项式大小权重的平滑参数化函数类和任意验证等效性的协变量分布(或其混合)证明了这一普遍的鲁棒性规律。在两层神经网络和高斯协变量的情况下，Bubeck、Li和Nagaraj在之前的工作中猜想了这一规律。本文还把结果解释为对由平滑函数组成的模型类的改进的泛化约束。

Classically, data interpolation with a parametrized model class is possible as long as the number of parameters is larger than the number of equations to be satisfied. A puzzling phenomenon in deep learning is that models are trained with many more parameters than what this classical theory would suggest. We propose a theoretical explanation for this phenomenon. We prove that for a broad class of data distributions and model classes, overparametrization is necessary if one wants to interpolate the data smoothly. Namely we show that smooth interpolation requires d times more parameters than mere interpolation, where d is the ambient data dimension. We prove this universal law of robustness for any smoothly parametrized function class with polynomial size weights, and any covariate distribution verifying isoperimetry (or a mixture thereof). In the case of two-layer neural networks and Gaussian covariates, this law was conjectured in prior work by Bubeck, Li and Nagaraj. We also give an interpretation of our result as an improved generalization bound for model classes consisting of smooth functions.

https://weibo.com/1402400261/L47MKBg56

5、[CV] Diffusion Autoencoders: Toward a Meaningful and Decodable Representation

K Preechakul, N Chatthee, S Wizadwongsa, S Suwajanakorn

[VISTEC]

扩散自编码器：有意义可解读表示研究。扩散概率模型(DPM)在图像生成方面已经取得了显著的质量，可以与GAN相媲美。但与GAN不同的是，DPM用一组缺乏语义的潜变量，不能作为其他任务的有用表示。本文探讨了用DPM进行表示学习的可能性，并试图通过自编码提取输入图像的有意义、可解读表示。关键想法是用一个可学习编码器发现高级语义，用DPM作为解码器来模拟其余的随机变化。该方法可将任意图像编码成两部分的潜代码，其中第一部分是有语义的和线性的，第二部分捕捉随机细节，允许近乎精确的重建。这种能力使目前以GAN为基础的方法受到限制的挑战性应用成为可能，例如真实图像上的属性操作。这种两级编码提高了去噪效率，自然促进了各种下游任务，包括少样本条件采样。

Diffusion probabilistic models (DPMs) have achieved remarkable quality in image generation that rivals GANs’. But unlike GANs, DPMs use a set of latent variables that lack semantic meaning and cannot serve as a useful representation for other tasks. This paper explores the possibility of using DPMs for representation learning and seeks to extract a meaningful and decodable representation of an input image via autoencoding. Our key idea is to use a learnable encoder for discovering the high-level semantics, and a DPM as the decoder for modeling the remaining stochastic variations. Our method can encode any image into a two-part latent code, where the first part is semantically meaningful and linear, and the second part captures stochastic details, allowing near-exact reconstruction. This capability enables challenging applications that currently foil GAN-based methods, such as attribute manipulation on real images. We also show that this two-level encoding improves denoising efficiency and naturally facilitates various downstream tasks including few-shot conditional sampling.

另外几篇值得关注的论文：

[CV] ATS: Adaptive Token Sampling For Efficient Vision Transformers

ATS：面向高效视觉Transformer的自适应Token采样

M Fayyaz, S A Kouhpayegani, F R Jafari, E Sommerlade, H R V Joze, H Pirsiavash, J Gall

[Microsoft & University of Maryland & Technical University of Berlin & University of California, Davis & University of Bonn]

https://weibo.com/1402400261/L47UV9c5K

[CV] DiffSDFSim: Differentiable Rigid-Body Dynamics With Implicit Shapes

DiffSDFSim：基于隐形状的可微刚体动力学

M Strecke, J Stueckler

[Tubingen]

https://weibo.com/1402400261/L47XzpzZ6

[CV] AdaViT: Adaptive Vision Transformers for Efficient Image Recognition

AdaViT：面向高效图像识别的自适应视觉Transformer

L Meng, H Li, B Chen, S Lan, Z Wu, Y Jiang, S Lim

[Fudan University & University of Maryland & Meta AI]

https://weibo.com/1402400261/L47ZL3AOO

[CV] NeuSample: Neural Sample Field for Efficient View Synthesis

NeuSample：面向高效视图合成的神经样本场

J Fang, L Xie, X Wang, X Zhang, W Liu, Q Tian

[Huazhong University of Science & Technology & Huawei Inc]

https://weibo.com/1402400261/L481BlsI8

内容中包含的图片若涉及版权问题，请及时与我们联系删除