爱可可AI前沿推介(6.8)

LG - 机器学习 CV - 计算机视觉 CL - 计算与语言 AS - 音频与语音 RO - 机器人

转自爱可可爱生活

摘要：基于层次自监督学习将视觉Transformer扩展到千兆像素图像、基于扭转扩散的分子构象体生成、关于使语言模型成为更好的推理器的进展、用于改进相机姿态估计的可渲染神经代码、基于扩散的GAN训练、让预训练Transformer的极限压缩变得简单高效、模块化架构就够了吗、神经体目标选择、混合潜扩散

1、[CV] Scaling Vision Transformers to Gigapixel Images via Hierarchical Self-Supervised Learning

R J. Chen, C Chen, Y Li, T Y. Chen, A D. Trister...

[Harvard & Bill & Melinda Gates Foundation & University of Toronto]

基于层次自监督学习将视觉Transformer扩展到千兆像素图像。视觉Transformer(ViT)及其多尺度和分层变化在捕捉图像表示方面是成功的，但它们的使用通常是针对低分辨率图像(如256×256，384×384)进行研究。对于计算病理学中的千兆像素全片成像(WSI)，WSI在20倍放大率下可以大到150000×150000像素，并在不同的分辨率下表现出视觉Token的层次结构：从捕捉单个细胞的16×16图像，到表征组织微环境内相互作用的4096×4096图像。本文提出一种新的ViT架构，层次图像金字塔Transformer(HIPT)，利用WSI固有的自然层次结构，用两级自监督学习来学习高分辨率图像表示。HIPT用10,678个千兆像素WSI、408,218个4096×4096图像和104M个256×256图像，对33种癌症类型进行了预训练。对HIPT表征在9个幻灯片级别的任务上进行了基准测试，并证明：1）分层预训练的HIPT在癌症亚型和生存预测方面优于目前最先进的方法，2）自监督ViT能对肿瘤微环境中表型的分层结构的重要归纳偏差进行建模。

Vision Transformers (ViTs) and their multi-scale and hierarchical variations have been successful at capturing image representations but their use has been generally studied for low-resolution images (e.g. 256 × 256, 384 × 384). For gigapixel whole-slide imaging (WSI) in computational pathology, WSIs can be as large as 150000×150000 pixels at 20× magnification and exhibit a hierarchical structure of visual tokens across varying resolutions: from 16 × 16 images capturing individual cells, to 4096×4096 images characterizing interactions within the tissue microenvironment. We introduce a new ViT architecture called the Hierarchical Image Pyramid Transformer (HIPT), which leverages the natural hierarchical structure inherent in WSIs using two levels of self-supervised learning to learn high-resolution image representations. HIPT is pretrained across 33 cancer types using 10,678 gigapixel WSIs, 408,218 4096×4096 images, and 104M 256× 256 images. We benchmark HIPT representations on 9 slide-level tasks, and demonstrate that: 1) HIPT with hierarchical pretraining outperforms current state-of-the-art methods for cancer subtyping and survival prediction, 2) self-supervised ViTs are able to model important inductive biases about the hierarchical structure of phenotypes in the tumor microenvironment.

https://arxiv.org/abs/2206.02647

2、[LG] Torsional Diffusion for Molecular Conformer Generation

B Jing, G Corso, J Chang, R Barzilay, T Jaakkola

[MIT & Harvard University]

基于扭转扩散的分子构象体生成。分子构象体生成是计算化学的一项基本任务。已经开发了几种机器学习方法，但没有一种方法比最先进的化学信息学方法更优秀。本文提出扭转扩散，一种新的扩散框架，通过超环面的扩散过程和外在到内在的得分模型在扭转角空间运行。在一个标准的类药物分子基准上，与机器学习和化学信息学方法相比，扭转扩散在RMSD和化学性质方面都产生了卓越的构象组合，比之前基于扩散的模型快了几个数量级。此外，所提出模型提供了精确似然，用它构建了第一个可推广的玻尔兹曼发生器。

Molecular conformer generation is a fundamental task in computational chemistry. Several machine learning approaches have been developed, but none have outperformed state-of-the-art cheminformatics methods. We propose torsional diffusion, a novel diffusion framework that operates on the space of torsion angles via a diffusion process on the hypertorus and an extrinsic-to-intrinsic score model. On a standard benchmark of drug-like molecules, torsional diffusion generates superior conformer ensembles compared to machine learning and cheminformatics methods in terms of both RMSD and chemical properties, and is orders of magnitude faster than previous diffusion-based models. Moreover, our model provides exact likelihoods, which we employ to build the first generalizable Boltzmann generator. Code is available at https://github.com/gcorso/torsional-diffusion.

https://arxiv.org/abs/2206.01729

3、[CL] On the Advance of Making Language Models Better Reasoners

Y Li, Z Lin, S Zhang, Q Fu, B Chen, J Lou, W Chen

[Peking University & Microsoft Corporation]

关于使语言模型成为更好的推理器的进展。像GPT-3和PaLM这样的大型语言模型在少样本学习中表现出了卓越的性能。然而，它们在推理任务中仍有困难，如算术基准GSM8K。最近的进展有意引导语言模型在产生最终答案之前产生一连串的推理步骤，成功地将GSM8K基准的问题解决率从17.9%提升到58.1%。本文提出一种新方法，DIVERSE(推理步骤多样化验证器)，以进一步提升其推理能力。DIVERSE首先探索了不同的提示，以提高推理路径的多样性。DIVERSE引入了一个验证器来区分好的答案和坏的答案，以获得更好的加权投票。最后，DIVERSE验证了每个单一步骤的正确性，而不是所有步骤的整体性。用最新的语言模型code-davinci-002进行了广泛的实验，证明DIVERSE可以在8个推理基准中的6个取得新的最先进的性能(例如，GSM8K 74.4%→83.2%)，超过了参数为540B的PaLM模型。

Large language models such as GPT-3 and PaLM have shown remarkable performance in few-shot learning. However, they still struggle with reasoning tasks such as the arithmetic benchmark GSM8K. Recent advances deliberately guide the language model to generate a chain of reasoning steps before producing the final answer, successfully boosting the GSM8K benchmark from 17.9% to 58.1% in terms of problem solving rate. In this paper, we propose a new approach, DIVERSE (Diverse Verifier on Reasoning Step), to further advance their reasoning capability. DIVERSE first explores different prompts to enhance the diversity in reasoning paths. Second, DIVERSE introduces a verifier to distinguish good answers from bad answers for a better weighted voting. Finally, DIVERSE verifies the correctness of each single step rather than all the steps in a whole. We conduct extensive experiments using the latest language model code-davinci-002 and demonstrate that DIVERSE can achieve new state-of-the-art performance on six out of eight reasoning benchmarks (e.g., GSM8K 74.4%→ 83.2%), outperforming the PaLM model with 540B parameters.

https://arxiv.org/abs/2206.02336

4、[CV] Nerfels: Renderable Neural Codes for Improved Camera Pose Estimation

G Avraham, J Straub, T Shen, T Yang, H Germain, C Sweeney...

[Monash University & Ecole des Ponts & Facebook Reality Labs & Facebook AI Research]

Nerfels: 用于改进相机姿态估计的可渲染神经代码。本文提出了一种框架，将传统的基于关键点的相机姿态优化与可逆的神经渲染机制相结合。提出的3D场景表示Nerfels，是局部稠密但全局稀疏的。与现有的将模型过拟合整个场景的可逆神经渲染系统不同，本文采用了一种特征驱动的方法，用可渲染代码来表示与场景无关的局部3D块。通过只在检测到局部特征的地方对场景进行建模，该框架通过神经渲染器中可优化代码调节机制有效地泛化到场景中未见的局部区域，同时保持稀疏3D图表示的低内存占用率。所提出模型可纳入现有的最先进的手工制作和学习的局部特征姿态估计器，当在ScanNet上评估宽相机基线场景时，产生了更好的性能。

This paper presents a framework that combines traditional keypoint-based camera pose optimization with an invertible neural rendering mechanism. Our proposed 3D scene representation, Nerfels, is locally dense yet globally sparse. As opposed to existing invertible neural rendering systems which overfit a model to the entire scene, we adopt a feature-driven approach for representing scene-agnostic, local 3D patches with renderable codes. By modelling a scene only where local features are detected, our framework effectively generalizes to unseen local regions in the scene via an optimizable code conditioning mechanism in the neural renderer, all while maintaining the low memory footprint of a sparse 3D map representation. Our model can be incorporated to existing state-of-the-art hand-crafted and learned local feature pose estimators, yielding improved performance when evaluating on ScanNet for wide camera baseline scenarios.

https://arxiv.org/abs/2206.01916

5、[LG] Diffusion-GAN: Training GANs with Diffusion

Z Wang, H Zheng, P He, W Chen, M Zhou

[The University of Texas at Austin & Microsoft Azure AI]

Diffusion-GAN：基于扩散的GAN训练。对于生成对抗网络(GAN)的稳定训练，向判别器的输入注入实例噪声被认为是一种理论上合理的解决方案，然而，在实践中还没有兑现其承诺。本文提出Diffusion-GAN，采用高斯混合分布，定义在前向扩散链的所有扩散步骤上，以注入实例噪声。从观察到的或生成的数据中扩散出来的混合的随机样本被作为输入输到判别器。生成器通过前向扩散链反向传播其梯度进行更新，前向扩散链的长度是自适应调整的，以控制每个训练步骤中允许的最大噪声与数据比率。理论分析验证了所提出的Diffusion-GAN的合理性，其提供了模型和域的可微增强。在各种数据集上的实验表明，Diffusion-GAN可以提供稳定和数据高效的GAN训练，在合成逼真图像方面比强GAN基线带来一致的性能改进。

For stable training of generative adversarial networks (GANs), injecting instance noise into the input of the discriminator is considered as a theoretically sound solution, which, however, has not yet delivered on its promise in practice. This paper introduces Diffusion-GAN that employs a Gaussian mixture distribution, defined over all the diffusion steps of a forward diffusion chain, to inject instance noise. A random sample from the mixture, which is diffused from an observed or generated data, is fed as the input to the discriminator. The generator is updated by backpropagating its gradient through the forward diffusion chain, whose length is adaptively adjusted to control the maximum noise-to-data ratio allowed at each training step. Theoretical analysis verifies the soundness of the proposed Diffusion-GAN, which provides modeland domain-agnostic differentiable augmentation. A rich set of experiments on diverse datasets show that DiffusionGAN can provide stable and data-efficient GAN training, bringing consistent performance improvement over strong GAN baselines for synthesizing photorealistic images.

https://arxiv.org/abs/2206.02262