爱可可AI前沿推介(10.6)

LG - 机器学习 CV - 计算机视觉 CL - 计算与语言 AS - 音频与语音 RO - 机器人

转自爱可可爱生活

摘要：分子对接的扩散模型、将语音与预训练视觉和语言模型相结合、局部视觉特征自监督学习、稳定边缘梯度下降的隐性偏差、基于子图草图的图神经网络链接预测、基于物理优化的机器人学习库、图像集动画的隐式形变、视觉预训练方法效率评估、基于大型语言模型的集合概率推理

1、[LG] DiffDock: Diffusion Steps, Twists, and Turns for Molecular Docking

G Corso, H Stärk, B Jing, R Barzilay, T Jaakkola
[MIT]
DiffDock: 用于分子对接的扩散步骤、转折和回环。预测小分子配体与蛋白质的结合结构——这项任务被称为分子对接——对于药物设计至关重要。与传统的基于搜索的方法相比，最近将对接视为回归问题的深度学习方法减少了运行时间，但在准确性方面还没有实质性的改善。本文将分子对接看作是一个生成建模问题，提出DiffDock，一种在配体姿态的非欧几里德流形上的扩散生成模型。将该流形映射到对接中涉及的自由度(平移、旋转和扭转)的乘积空间，并在该空间上开发了一个有效的扩散过程。根据经验，DiffDock在PDBBind上获得了38%的top-1成功率(RMSD<2A)，大大超过了之前传统对接(23%)和深度学习(20%)方法的最先进水平。此外，DiffDock具有快速的推理时间，并提供具有高选择性精度的置信估计。

Predicting the binding structure of a small molecule ligand to a protein -- a task known as molecular docking -- is critical to drug design. Recent deep learning methods that treat docking as a regression problem have decreased runtime compared to traditional search-based methods but have yet to offer substantial improvements in accuracy. We instead frame molecular docking as a generative modeling problem and develop DiffDock, a diffusion generative model over the non-Euclidean manifold of ligand poses. To do so, we map this manifold to the product space of the degrees of freedom (translational, rotational, and torsional) involved in docking and develop an efficient diffusion process on this space. Empirically, DiffDock obtains a 38% top-1 success rate (RMSD<2A) on PDBBind, significantly outperforming the previous state-of-the-art of traditional docking (23%) and deep learning (20%) methods. Moreover, DiffDock has fast inference times and provides confidence estimates with high selective accuracy.

https://arxiv.org/abs/2210.01776

2、[CL] SpeechCLIP: Integrating Speech with Pre-Trained Vision and Language Model

Y Shih, H Wang, H Chang, L Berry, H Lee, D Harwath
[National Taiwan University & The University of Texas at Austin]
SpeechCLIP：将语音与预训练视觉和语言模型相结合。数据驱动的语音处理模型通常在大量的文本监督下表现良好，但收集转录的语音数据成本很高。本文提出SpeechCLIP，一种通过图像连接语音和文本的新框架，在没有转录的情况下增强语音模型。利用最先进的预训练HuBERT和CLIP，通过成对的图像和口语说明，以最小的微调将它们对齐。SpeechCLIP在图像-语音检索方面的表现优于之前最先进水平，并在没有转录的直接监督下进行零样本语音-文本检索。此外，SpeechCLIP可以直接从语音中检索出语义相关的关键词。

Data-driven speech processing models usually perform well with a large amount of text supervision, but collecting transcribed speech data is costly. Therefore, we propose SpeechCLIP, a novel framework bridging speech and text through images to enhance speech models without transcriptions. We leverage state-of-the-art pre-trained HuBERT and CLIP, aligning them via paired images and spoken captions with minimal fine-tuning. SpeechCLIP outperforms prior state-of-the-art on image-speech retrieval and performs zero-shot speech-text retrieval without direct supervision from transcriptions. Moreover, SpeechCLIP can directly retrieve semantically related keywords from speech.

https://arxiv.org/abs/2210.00705

3、[CV] VICRegL: Self-Supervised Learning of Local Visual Features

A Bardes, J Ponce, Y LeCun
[Meta & PSL Research University]
VICRegL：局部视觉特征自监督学习。最近大多数用于学习图像表示的自监督方法都集中在产生一个具有不变性的全局特征，或者生成一组局部特征。前者对分类任务效果最好，而后者对检测和分割任务效果最好。本文探讨了学习局部和全局特征之间的基本权衡。提出一种新方法VICRegL，可同时学习好的全局特征和局部特征，在检测和分割任务上产生优异的性能，同时在分类任务上保持良好的性能。具体来说，标准卷积网结构的两个相同的分支被输入同一图像的两个不同的扭曲版本。VICReg准则被应用于全局特征向量对。同时，VICReg准则也被应用于最后一个池化层之前出现的局部特征向量对。如果两个局部特征向量的l²距离低于阈值，或者它们的相对位置与两幅输入图像之间的已知几何变换相一致，它们就会相互吸引。VICRegL在线性分类和分割转移任务上展示了强大的性能。

Most recent self-supervised methods for learning image representations focus on either producing a global feature with invariance properties, or producing a set of local features. The former works best for classification tasks while the latter is best for detection and segmentation tasks. This paper explores the fundamental trade-off between learning local and global features. A new method called VICRegL is proposed that learns good global and local features simultaneously, yielding excellent performance on detection and segmentation tasks while maintaining good performance on classification tasks. Concretely, two identical branches of a standard convolutional net architecture are fed two differently distorted versions of the same image. The VICReg criterion is applied to pairs of global feature vectors. Simultaneously, the VICReg criterion is applied to pairs of local feature vectors occurring before the last pooling layer. Two local feature vectors are attracted to each other if their l2-distance is below a threshold or if their relative locations are consistent with a known geometric transformation between the two input images. We demonstrate strong performance on linear classification and segmentation transfer tasks. Code and pretrained models are publicly available at: this https URL

https://arxiv.org/abs/2210.01571

4、[LG] Self-Stabilization: The Implicit Bias of Gradient Descent at the Edge of Stability

A Damian, E Nichani, J D. Lee
[Princeton University]
自稳性：稳定边缘梯度下降的隐性偏差。梯度下降的传统分析表明，当Hessian的最大特征值(也称为锐度S(θ))被2/η所约束时，训练是"稳定的"，训练损失单调地减少了。然而，最近的工作观察到，当用全批量或大批量梯度下降训练现代神经网络时，这一假设并不成立。最近观察到两个重要现象：第一种，称为渐进式锐化，指在整个训练过程中，锐度稳步增加，直到达到不稳定的截止点2/η。第二种现象，称为稳定边缘，即在训练的剩余时间里，锐度在2/η处徘徊，而损失继续减少，尽管是非单调的。本文证明，稳定边缘的梯度下降动态远不是混乱的，可以被一个立方泰勒展开所捕获：当迭代结果由于不稳定而向Hessian的最大特征向量方向发散时，损失函数的局部泰勒展开中的立方项导致曲率下降，直到恢复稳定。这一特性称为自稳性，是梯度下降的一般特性，并解释了其在稳定边缘的行为。自稳性的一个关键结果是，在稳定边缘的梯度下降隐含地遵循S(θ)≤2/η约束下的投影梯度下降(PGD)。本文的分析为整个训练过程中的损失、锐度和对PGD轨迹的偏离提供了精确的预测，一些标准设置中和温和条件下的理论上都验证了这一点。本文的分析发现了梯度下降对稳定性的隐性偏差的机制。

Traditional analyses of gradient descent show that when the largest eigenvalue of the Hessian, also known as the sharpness S(θ), is bounded by 2/η, training is "stable" and the training loss decreases monotonically. Recent works, however, have observed that this assumption does not hold when training modern neural networks with full batch or large batch gradient descent. Most recently, Cohen et al. (2021) observed two important phenomena. The first, dubbed progressive sharpening, is that the sharpness steadily increases throughout training until it reaches the instability cutoff 2/η. The second, dubbed edge of stability, is that the sharpness hovers at 2/η for the remainder of training while the loss continues decreasing, albeit non-monotonically.We demonstrate that, far from being chaotic, the dynamics of gradient descent at the edge of stability can be captured by a cubic Taylor expansion: as the iterates diverge in direction of the top eigenvector of the Hessian due to instability, the cubic term in the local Taylor expansion of the loss function causes the curvature to decrease until stability is restored. This property, which we call self-stabilization, is a general property of gradient descent and explains its behavior at the edge of stability. A key consequence of self-stabilization is that gradient descent at the edge of stability implicitly follows projected gradient descent (PGD) under the constraint S(θ)≤2/η. Our analysis provides precise predictions for the loss, sharpness, and deviation from the PGD trajectory throughout training, which we verify both empirically in a number of standard settings and theoretically under mild conditions. Our analysis uncovers the mechanism for gradient descent's implicit bias towards stability.

https://arxiv.org/abs/2209.15594

5、[LG] Graph Neural Networks for Link Prediction with Subgraph Sketching

B P Chamberlain, S Shirobokov, E Rossi, F Frasca, T Markovich, N Hammerla, M M. Bronstein, M Hansmire
[Twitter Inc]
基于子图草图的图神经网络链接预测。许多图神经网络(GNN)在链接预测(LP)任务上与简单的启发式方法相比表现不佳。这是由于表达能力的限制，如无法对三角形计数(大多数LP启发方法的骨干)，也因为其无法区分自同构节点(具有相同结构作用的节点)。这两个表现力问题都可以通过学习链接(而不是节点)表示和纳入结构特征如三角形计数来缓解。由于明确的链路表示往往过于昂贵，最近的工作借助基于子图的方法，这些方法在LP方面取得了最先进的性能，但由于子图之间的高度冗余而导致效率低下。本文分析了用于链接预测的子图GNN(SGNN)方法的组成部分。基于该分析，本文提出一种新的全图GNN，称为ELPH(高效链接预测与哈希)，将子图草图作为消息传递，以接近SGNN的关键组成部分，而不需要明确的子图构造。ELPH比消息传递GNN(MPNN)更有表现力。它在许多标准的LP基准上优于现有SGNN模型，同时速度快了好几个数量级。然而，它也存在一个共同的GNN限制，即只有当数据集适合于GPU内存时才有效。因此，本文开发了一种高度可扩展模型BUDDY，用特征预计算来规避这一限制而不牺牲预测性能。实验表明，BUDDY在标准的LP基准上也优于SGNN，同时具有高度的可扩展性，比ELPH更快。

Many Graph Neural Networks (GNNs) perform poorly compared to simple heuristics on Link Prediction (LP) tasks. This is due to limitations in expressive power such as the inability to count triangles (the backbone of most LP heuristics) and because they can not distinguish automorphic nodes (those having identical structural roles). Both expressiveness issues can be alleviated by learning link (rather than node) representations and incorporating structural features such as triangle counts. Since explicit link representations are often prohibitively expensive, recent works resorted to subgraph-based methods, which have achieved state-of-the-art performance for LP, but suffer from poor efficiency due to high levels of redundancy between subgraphs. We analyze the components of subgraph GNN (SGNN) methods for link prediction. Based on our analysis, we propose a novel full-graph GNN called ELPH (Efficient Link Prediction with Hashing) that passes subgraph sketches as messages to approximate the key components of SGNNs without explicit subgraph construction. ELPH is provably more expressive than Message Passing GNNs (MPNNs). It outperforms existing SGNN models on many standard LP benchmarks while being orders of magnitude faster. However, it shares the common GNN limitation that it is only efficient when the dataset fits in GPU memory. Accordingly, we develop a highly scalable model, called BUDDY, which uses feature precomputation to circumvent this limitation without sacrificing predictive performance. Our experiments show that BUDDY also outperforms SGNNs on standard LP benchmarks while being highly scalable and faster than ELPH.

https://arxiv.org/abs/2209.15486

另外几篇值得关注的论文：

[RO] PyPose: A Library for Robot Learning with Physics-based Optimization

PyPose：基于物理优化的机器人学习库
C Wang, D Gao, K Xu…
[CMU & MIT & ETH Zürich & …]
https://arxiv.org/abs/2209.15428

[CV] Implicit Warping for Animation with Image Sets

图像集动画的隐式形变
A Mallya, T Wang, M Liu
[NVIDIA]
https://arxiv.org/abs/2210.01794

[CV] Where Should I Spend My FLOPS? Efficiency Evaluations of Visual Pre-training Methods

FLOPS应该花在哪？视觉预训练方法效率评估
S Koppula, Y Li, E Shelhamer, A Jaegle, N Parthasarathy, R Arandjelovic, J Carreira, O Hénaff
[DeepMind]
https://arxiv.org/abs/2209.15589

[CL] ThinkSum: Probabilistic reasoning over sets using large language models

ThinkSum：基于大型语言模型的集合概率推理
B Ozturkler, N Malkin, Z Wang, N Jojic
[Stanford University & Mila & Ohio State University & Microsoft Research]
https://arxiv.org/abs/2210.01293

内容中包含的图片若涉及版权问题，请及时与我们联系删除