LG - 机器学习   CV - 计算机视觉   CL - 计算与语言   AS - 音频与语音 RO - 机器人 GR - 图形学




1、[CV] Scaling Language-Image Pre-training via Masking

Y Li, H Fan, R Hu, C Feichtenhofer, K He
[Meta AI]




We present Fast Language-Image Pre-training (FLIP), a simple and more efficient method for training CLIP. Our method randomly masks out and removes a large portion of image patches during training. Masking allows us to learn from more image-text pairs given the same wall-clock time and contrast more samples per iteration with similar memory footprint. It leads to a favorable trade-off between accuracy and training time. In our experiments on 400 million image-text pairs, FLIP improves both accuracy and speed over the no-masking baseline. On a large diversity of downstream tasks, FLIP dominantly outperforms the CLIP counterparts trained on the same data. Facilitated by the speedup, we explore the scaling behavior of increasing the model size, data size, or training length, and report encouraging results and comparisons. We hope that our work will foster future research on scaling vision-language learning.



2、[LG] A Self-Attention Ansatz for Ab-initio Quantum Chemistry

I v Glehn, J S. Spencer, D Pfau


简介:波函数Transformer(Psiformer)是一种新的神经网络结构,使用自注意力,可求解多电子薛定谔方程的近似值(或称拟设(Ansatz)),该方程是量子化学和材料科学的基本方程,可以从第一性原理来求解,而无需外部训练数据。Psiformer可用作其他神经网络的替代品,通常可极大地提高计算的准确性。在较大的分子上,基态能量可以提高几十千卡/摩尔,比之前的方法有很大的改进。这些结果表明,自注意力网络可以学习电子间复杂的量子力学关联,是在更大的系统上达到前所未有的化学计算精度的有希望的途径。全始计算量子化学(Ab-initio quantum chemistry)是量子化学的一个分支,其重点是在没有任何经验输入的情况下从第一性原理计算分子和材料特性。其依赖于用薛定谔方程来描述分子或材料中电子和原子核的行为。全始计算量子化学被用来准确预测诸如能量、力和振动等特性,还被用来研究量子效应,如量子纠缠和叠加,以及理解分子结构和反应性。


We present a novel neural network architecture using self-attention, the Wavefunction Transformer (Psiformer), which can be used as an approximation (or Ansatz) for solving the many-electron Schrödinger equation, the fundamental equation for quantum chemistry and material science. This equation can be solved from first principles, requiring no external training data. In recent years, deep neural networks like the FermiNet and PauliNet have been used to significantly improve the accuracy of these first-principle calculations, but they lack an attention-like mechanism for gating interactions between electrons. Here we show that the Psiformer can be used as a drop-in replacement for these other neural networks, often dramatically improving the accuracy of the calculations. On larger molecules especially, the ground state energy can be improved by dozens of kcal/mol, a qualitative leap over previous methods. This demonstrates that self-attention networks can learn complex quantum mechanical correlations between electrons, and are a promising route to reaching unprecedented accuracy in chemical calculations on larger systems.



3、[CV] Testing GLOM's ability to infer wholes from ambiguous parts

L Culp, S Sabour, G E. Hinton
[Google AI]




The GLOM architecture proposed by Hinton [2021] is a recurrent neural network for parsing an image into a hierarchy of wholes and parts. When a part is ambiguous, GLOM assumes that the ambiguity can be resolved by allowing the part to make multi-modal predictions for the pose and identity of the whole to which it belongs and then using attention to similar predictions coming from other possibly ambiguous parts to settle on a common mode that is predicted by several different parts. In this study, we describe a highly simplified version of GLOM that allows us to assess the effectiveness of this way of dealing with ambiguity. Our results show that, with supervised training, GLOM is able to successfully form islands of very similar embedding vectors for all of the locations occupied by the same object and it is also robust to strong noise injections in the input and to out-of-distribution input transformations.



4、[LG] Improved Representation of Asymmetrical Distances with Interval Quasimetric Embeddings

T Wang, P Isola




Asymmetrical distance structures (quasimetrics) are ubiquitous in our lives and are gaining more attention in machine learning applications. Imposing such quasimetric structures in model representations has been shown to improve many tasks, including reinforcement learning (RL) and causal relation learning. In this work, we present four desirable properties in such quasimetric models, and show how prior works fail at them. We propose Interval Quasimetric Embedding (IQE), which is designed to satisfy all four criteria. On three quasimetric learning experiments, IQEs show strong approximation and generalization abilities, leading to better performance and improved efficiency over prior methods.



5、[CV] Instance-Specific Image Goal Navigation: Training Embodied Agents to Find Object Instances

J Krantz, S Lee, J Malik, D Batra, D S Chaplot
[Meta AI & Oregon State University]



摘要:本文考虑给定图像目标的具身视觉导航(ImageNav)问题,在这个问题上,智能体在一个不熟悉的环境中被初始化,任务是导航到一个由图像"描述"的位置。与相关的导航任务不同,ImageNav没有一个标准化的任务定义,这使得不同的方法难以比较。此外,现有的表述有两个问题:(1) 图像目标是从随机位置取样的,这可能会导致模糊性(例如,看墙),以及 (2) 图像目标与相机规格和智能体的具身相匹配;在考虑用户驱动的下游应用时,这种僵化是有限的。本文提出针对具体实例的图像导航任务(InstanceImageNav)来解决这些限制。具体来说,目标图像被"聚焦"在场景中的某些特定目标实例上,并且是用独立于智能体的相机参数拍摄的。使用Habitat-Matterport3D数据集(HM3D)中的场景在Habitat模拟器中实例化InstanceImageNav,并发布了一个标准化的基准来衡量社区的进展。

We consider the problem of embodied visual navigation given an image-goal (ImageNav) where an agent is initialized in an unfamiliar environment and tasked with navigating to a location 'described' by an image. Unlike related navigation tasks, ImageNav does not have a standardized task definition which makes comparison across methods difficult. Further, existing formulations have two problematic properties; (1) image-goals are sampled from random locations which can lead to ambiguity (e.g., looking at walls), and (2) image-goals match the camera specification and embodiment of the agent; this rigidity is limiting when considering user-driven downstream applications. We present the Instance-specific ImageNav task (InstanceImageNav) to address these limitations. Specifically, the goal image is 'focused' on some particular object instance in the scene and is taken with camera parameters independent of the agent. We instantiate InstanceImageNav in the Habitat Simulator using scenes from the Habitat-Matterport3D dataset (HM3D) and release a standardized benchmark to measure community progress.





[CV] If your data distribution shifts, use self-learning

E Rusak, S Schneider, G Pachitariu, L Eck, P Gehler, O Bringmann, W Brendel, M Bethge
[University of Tübingen & University of Oxford]





[CV] Mixed Neural Voxels for Fast Multi-view Video Synthesis

F Wang, S Tan, X Li, Z Tian, H Liu
[Tsinghua University & Hong Kong University of Science and Technology]





[CV] Zero-Shot Image Restoration Using Denoising Diffusion Null-Space Model

Y Wang, J Yu, J Zhang
[Peking University]





[LG] Linear Causal Disentanglement via Interventions

A Seigal, C Squires, C Uhler
[Broad Institute of MIT and Harvard]






