爱可可AI前沿推介(6.13)

LG - 机器学习 CV - 计算机视觉 CL - 计算与语言 AS - 音频与语音 RO - 机器人

转自爱可可爱生活

摘要：面向motif-scaffolding问题的3D蛋白质骨架扩散概率建模、面向实例和分布式视觉表示学习的极大掩码、VAE独立机制分析、基于自动机增强检索的神经符号语言建模、面向连续时空超分辨率的视频隐神经表示学习、面向自动驾驶模仿的基于Transformer的传感器融合、证明等变强彩票假说的通用框架、深度立体视觉开放挑战、面向单目3D检测与跟踪的每对象深度估计改进

1、[LG] Diffusion probabilistic modeling of protein backbones in 3D for the motif-scaffolding problem

B L. Trippe, J Yim, D Tischer, T Broderick, D Baker, R Barzilay, T Jaakkola

[MIT & University of Washington]

面向motif-scaffolding问题的3D蛋白质骨架扩散概率建模。构建一个支持所需基序的支架结构，赋予蛋白质功能，显示出疫苗和酶设计的广泛前景。但对motif-scaffolding问题的通用解决方案仍然是开放的。目前用于scaffold设计的机器学习技术要么局限于不切实际的小的scaffold(长度不超过20)，要么难以产生多种多样的scaffold。本文建议通过E(3)等变图神经网络来学习多样化和更长的蛋白质骨架结构的分布，提出SMCDiff，从该分布中以给定的基序为条件对scaffold进行有效采样；所提出算法是第一个在理论上保证在大计算极限下从扩散模型中进行条件性采样的算法。通过评估所设计的scaffold与AlphaFold2预测的结构的吻合程度来评价它们。是呀表明，所提出方法可以（1）对高达80个残基的scaffold进行采样，（2）对一个固定的基序实现结构上的多样化。

Construction of a scaffold structure that supports a desired motif, conferring protein function, shows promise for the design of vaccines and enzymes. But a general solution to this motif-scaffolding problem remains open. Current machine-learning techniques for scaffold design are either limited to unrealistically small scaffolds (up to length 20) or struggle to produce multiple diverse scaffolds. We propose to learn a distribution over diverse and longer protein backbone structures via an E(3)-equivariant graph neural network. We develop SMCDiff to efficiently sample scaffolds from this distribution conditioned on a given motif; our algorithm is the first to theoretically guarantee conditional samples from a diffusion model in the large-compute limit. We evaluate our designed backbones by how well they align with AlphaFold2-predicted structures. We show that our method can (1) sample scaffolds up to 80 residues and (2) achieve structurally diverse scaffolds for a fixed motif.

https://arxiv.org/abs/2206.04119

2、[CV] Extreme Masking for Learning Instance and Distributed Visual Representations.

Z Wu, Z Lai, X Sun, S Lin

[Microsoft Research Asia & CMU]

面向实例和分布式视觉表示学习的极大掩码。本文提出一种可扩展的方法，用于同时学习单个token的分布式表示和整体的实例表示。用自注意力块来表示分布式token，用交叉注意力块来聚合整体实例。该方法的核心是用极大的token掩码(75%-90%)作为监督的数据增强。所提出的模型ExtreMA，遵循朴素的BYOL方法，其中来自未掩码子集的实例表示被训练来预测来自完整输入的实例。学习要求模型捕捉实例中的信息性变化，而不是鼓励不变性。本文有三方面贡献：1）随机掩码是一种强大的、计算高效的数据增强手段，用于学习可泛化的注意力表示。2）在每个实例多次采样的情况下，极大掩码大大加快了学习速度，并渴望获得更多的数据。3) 分布式表示可以单独从实例监督中学习，这与掩码建模中的每标记监督不同。

The paper presents a scalable approach for learning distributed representations over individual tokens and a holistic instance representation simultaneously. We use self-attention blocks to represent distributed tokens, followed by cross-attention blocks to aggregate the holistic instance. The core of the approach is the use of extremely large token masking (75%-90%) as the data augmentation for supervision. Our model, named ExtreMA, follows the plain BYOL approach where the instance representation from the unmasked subset is trained to predict that from the intact input. Learning requires the model to capture informative variations in an instance, instead of encouraging invariances. The paper makes three contributions: 1) Random masking is a strong and computationally efficient data augmentation for learning generalizable attention representations. 2) With multiple sampling per instance, extreme masking greatly speeds up learning and hungers for more data. 3) Distributed representations can be learned from the instance supervision alone, unlike per-token supervisions in masked modeling.

https://arxiv.org/abs/2206.04667

3、[LG] Embrace the Gap: VAEs Perform Independent Mechanism Analysis

P Reizinger, L Gresele, J Brady, J v Kügelgen, D Zietlow, B Schölkopf...

[University of Tübingen & Max Planck Institute for Intelligent Systems]

拥抱差距：VAE独立机制分析。变分自编码器(VAE)是一种流行框架，用于对复杂的数据分布进行建模；它们可以通过变分推理的证据下界(ELBO)最大化进行有效的训练，以牺牲对精确(对数)边际似然的差距为代价。虽然VAE通常用于表示学习，但不清楚为什么ELBO最大化会产生有用的表示，因为非正则化最大似然估计不能反转数据生成过程。然而，VAE经常成功地完成这一任务。本文试图通过研究近似决定性解码器极限下的非线性VAE来阐明这一明显的悖论。本文首先证明，在这个体系中，最优编码器近似于反转解码器——这是一个常用但未被证实的猜想——称为自洽(self-consistency)。利用自洽性，本文表明ELBO会收敛到一个正则化的对数似然。这允许VAE执行最近被称为独立机制分析(IMA)的操作：它增加了对具有柱正交雅各布的解码器的归纳偏差，有助于恢复真正的潜在因素。因此，ELBO和对数似然之间的差距是值得欢迎的，因为它对非线性表示学习有着意想不到的好处。在合成和图像数据的实验中，当数据生成过程满足IMA假设时，VAE可以发现真正的潜因子。

Variational autoencoders (VAEs) are a popular framework for modeling complex data distributions; they can be efficiently trained via variational inference by maximizing the evidence lower bound (ELBO), at the expense of a gap to the exact (log-)marginal likelihood. While VAEs are commonly used for representation learning, it is unclear why ELBO maximization would yield useful representations, since unregularized maximum likelihood estimation cannot invert the data-generating process. Yet, VAEs often succeed at this task. We seek to elucidate this apparent paradox by studying nonlinear VAEs in the limit of near-deterministic decoders. We first prove that, in this regime, the optimal encoder approximately inverts the decoder -- a commonly used but unproven conjecture -- which we refer to as {\em self-consistency}. Leveraging self-consistency, we show that the ELBO converges to a regularized log-likelihood. This allows VAEs to perform what has recently been termed independent mechanism analysis (IMA): it adds an inductive bias towards decoders with column-orthogonal Jacobians, which helps recovering the true latent factors. The gap between ELBO and log-likelihood is therefore welcome, since it bears unanticipated benefits for nonlinear representation learning. In experiments on synthetic and image data, we show that VAEs uncover the true latent factors when the data generating process satisfies the IMA assumption.

https://arxiv.org/abs/2206.02416

4、[CL] Neuro-Symbolic Language Modeling with Automaton-augmented Retrieval

U Alon, F F. Xu, J He, S Sengupta, D Roth, G Neubig

[CMU & Amazon AWS & AWS AI Labs]

基于自动机增强检索的神经符号语言建模。基于检索的语言模型(R-LM)通过将标准语言模型(LM)与测试时从外部数据存储中检索的样本相结合，对自然语言文本的概率进行建模。虽然有效，但在实践中用这些模型的一个主要瓶颈是计算成本很高的数据存储搜索，可能需要频繁地在每个时间步骤中进行。本文提出RetoMaton——检索自动机——近似数据存储搜索，基于（1）保存连续数据存储条目之间的指针，以及（2）将条目聚类为"状态"。这实际上导致了一个建立在数据存储之上的加权有限自动机，而不是将数据存储表示为一个平面列表。自动机的创建是无监督的，RetoMaton可以从任何文本集合中构建：要么是原始训练语料库，要么是来自另一个领域。在推理时遍历这个自动机，与语言模型推理并行，可将困惑度降低1.85，比kNN-LM节省83%的近邻搜索，而不会损害困惑度。

Retrieval-based language models (R-LM) model the probability of natural language text by combining a standard language model (LM) with examples retrieved from an external datastore at test time. While effective, a major bottleneck of using these models in practice is the computationally costly datastore search, which can be performed as frequently as every time step. In this paper, we present RetoMaton - retrieval automaton - which approximates the datastore search, based on (1) saving pointers between consecutive datastore entries, and (2) clustering of entries into "states". This effectively results in a weighted finite automaton built on top of the datastore, instead of representing the datastore as a flat list. The creation of the automaton is unsupervised, and a RetoMaton can be constructed from any text collection: either the original training corpus or from another domain. Traversing this automaton at inference time, in parallel to the LM inference, reduces its perplexity by up to 1.85, or alternatively saves up to 83% of the nearest neighbor searches over kNN-LM (Khandelwal et al., 2020) without hurting perplexity. Our code and trained models are available at this https URL .

https://arxiv.org/abs/2201.12431

5、[CV] VideoINR: Learning Video Implicit Neural Representation for Continuous Space-Time Super-Resolution

Z Chen, Y Chen, J Liu, X Xu, V Goel...

[USTC & UC San Diego & UIUC & U of Oregon & Picsart AI Research (PAIR)]

VideoINR：面向连续时空超分辨率的视频隐神经表示学习。视频通常以离散的连续帧形式记录连续的视觉数据。由于高保真视频存储成本很高，所以大多数视频都以相对较低的分辨率和帧率存储。最近开发的时空视频超分辨率(STVSR)将时间插值和空间超分辨率纳入统一的框架中。然而，其中的大多数只支持固定的上采样比例，限制了其灵活性和应用。本文提出视频隐神经表示(VideoINR)，而不是遵循离散的表示，并展示了其在STVSR中的应用。学习到的隐神经表示可解码为任意空间分辨率和帧率的视频。VideoINR在常见的上采样尺度上实现了与最先进的STVSR方法相竞争的性能，并在连续和训练外分布尺度上明显优于之前的工作。

Videos typically record the streaming and continuous visual data as discrete consecutive frames. Since the storage cost is expensive for videos of high fidelity, most of them are stored in a relatively low resolution and frame rate. Recent works of Space-Time Video Super-Resolution (STVSR) are developed to incorporate temporal interpolation and spatial super-resolution in a unified framework. However, most of them only support a fixed up-sampling scale, which limits their flexibility and applications. In this work, instead of following the discrete representations, we propose Video Implicit Neural Representation (VideoINR), and we show its applications for STVSR. The learned implicit neural representation can be decoded to videos of arbitrary spatial resolution and frame rate. We show that VideoINR achieves competitive performances with state-of-the-art STVSR methods on common up-sampling scales and significantly outperforms prior works on continuous and out-of-training-distribution scales. Our project page is at this http URL .https://arxiv.org/abs/2206.04647