LG - 机器学习 CV - 计算机视觉 CL - 计算与语言 AS - 音频与语音 RO - 机器人
摘要:面向现实世界规划的开放词表可查询场景表示、重新思考其对称性来理解和扩展子图GNN、Transformer模型松弛注意力、面向分布外检测的极其简单的激活整形、具有线性复杂度的节能注意力、零样本文本驱动HDR全景生成、基于神经隐映射和深度特征跟踪的端到端RGB-D SLAM、通用极小极大最优学习器和表征、基于信息最大化准则的自监督学习
1、[RO] Open-vocabulary Queryable Scene Representations for Real World Planning
B Chen, F Xia, B Ichter, K Rao, K Gopalakrishnan, M S. Ryoo, A Stone, D Kappler
[MIT & Robotics at Google]
Large language models (LLMs) have unlocked new capabilities of task planning from human instructions. However, prior attempts to apply LLMs to real-world robotic tasks are limited by the lack of grounding in the surrounding scene. In this paper, we develop NLMap, an open-vocabulary and queryable scene representation to address this problem. NLMap serves as a framework to gather and integrate contextual information into LLM planners, allowing them to see and query available objects in the scene before generating a context-conditioned plan. NLMap first establishes a natural language queryable scene representation with Visual Language models (VLMs). An LLM based object proposal module parses instructions and proposes involved objects to query the scene representation for object availability and location. An LLM planner then plans with such information about the scene. NLMap allows robots to operate without a fixed list of objects nor executable options, enabling real robot operation unachievable by previous methods.
2、[LG] Understanding and Extending Subgraph GNNs by Rethinking Their Symmetries
F Frasca, B Bevilacqua, M M. Bronstein, H Maron
[Imperial College London & Purdue University & University of Oxford & NVIDIA Research]
Subgraph GNNs are a recent class of expressive Graph Neural Networks (GNNs) which model graphs as collections of subgraphs. So far, the design space of possible Subgraph GNN architectures as well as their basic theoretical properties are still largely unexplored. In this paper, we study the most prominent form of subgraph methods, which employs node-based subgraph selection policies such as ego-networks or node marking and deletion. We address two central questions: (1) What is the upper-bound of the expressive power of these methods? and (2) What is the family of equivariant message passing layers on these sets of subgraphs?. Our first step in answering these questions is a novel symmetry analysis which shows that modelling the symmetries of node-based subgraph collections requires a significantly smaller symmetry group than the one adopted in previous works. This analysis is then used to establish a link between Subgraph GNNs and Invariant Graph Networks (IGNs). We answer the questions above by first bounding the expressive power of subgraph methods by 3-WL, and then proposing a general family of message-passing layers for subgraph methods that generalises all previous node-based Subgraph GNNs. Finally, we design a novel Subgraph GNN dubbed SUN, which theoretically unifies previous architectures while providing better empirical performance on multiple benchmarks.
3、[LG] Relaxed Attention for Transformer Models
T Lohrenz, B Möller, Z Li, T Fingscheidt
[Technische Universität Braunschweig]
The powerful modeling capabilities of all-attention-based transformer architectures often cause overfitting and—for natural language processing tasks—lead to an implicitly learned internal language model in the autoregressive transformer decoder complicating the integration of external language models. In this paper, we explore relaxed attention, a simple and easy-to-implement smoothing of the attention weights, yielding a two-fold improvement to the general transformer architecture: First, relaxed attention provides regularization when applied to the self-attention layers in the encoder. Second, we show that it naturally supports the integration of an external language model as it suppresses the implicitly learned internal language model by relaxing the cross attention in the decoder. We demonstrate the benefit of relaxed attention across several tasks with clear improvement in combination with recent benchmark approaches. Specifically, we exceed the former state-of-the-art performance of 26.90% word error rate on the largest public lip-reading LRS3 benchmark with a word error rate of 26.31%, as well as we achieve a top-performing BLEU score of 37.67 on the IWSLT14 (DE→EN) machine translation task without external language models and virtually no additional model parameters. Code and models will be made publicly available.
4、[LG] Extremely Simple Activation Shaping for Out-of-Distribution Detection
A Djurisic, N Bozanic, A Ashok, R Liu
[ML Collective]
The separation between training and deployment of machine learning models implies that not all scenarios encountered in deployment can be anticipated during training, and therefore relying solely on advancements in training has its limits. Out-of-distribution (OOD) detection is an important area that stress-tests a model’s ability to handle unseen situations: Do models know when they don’t know? Existing OOD detection methods either incur extra training steps, additional data or make nontrivial modifications to the trained network. In contrast, in this work, we propose an extremely simple, post-hoc, on-the-fly activation shaping method, ASH, where a large portion (e.g. 90%) of a sample’s activation at a late layer is removed, and the rest (e.g. 10%) simplified or lightly adjusted. The shaping is applied at inference time, and does not require any statistics calculated from training data. Experiments show that such a simple treatment enhances in-distribution and outof-distribution sample distinction so as to allow state-of-the-art OOD detection on ImageNet, and does not noticeably deteriorate the in-distribution accuracy. We release alongside the paper two calls for explanation and validation, encouraging collective participation to further validate and understand the discovery. Calls, video and code can be found at: https://andrijazz.github.io/ash.
5、[CV] EcoFormer: Energy-Saving Attention with Linear Complexity
J Liu, Z Pan, H He, J Cai, B Zhuang
[Monash University]
Transformer is a transformative framework that models sequential data and has achieved remarkable performance on a wide range of tasks, but with high computational and energy cost. To improve its efficiency, a popular choice is to compress the models via binarization which constrains the floating-point values into binary ones to save resource consumption owing to cheap bitwise operations significantly. However, existing binarization methods only aim at minimizing the information loss for the input distribution statistically, while ignoring the pairwise similarity modeling at the core of the attention mechanism. To this end, we propose a new binarization paradigm customized to high-dimensional softmax attention via kernelized hashing, called EcoFormer, to map the original queries and keys into low-dimensional binary codes in Hamming space. The kernelized hash functions are learned to match the ground-truth similarity relations extracted from the attention map in a self-supervised way. Based on the equivalence between the inner product of binary codes and the Hamming distance as well as the associative property of matrix multiplication, we can approximate the attention in linear complexity by expressing it as a dot-product of binary codes. Moreover, the compact binary representations of queries and keys enable us to replace most of the expensive multiply-accumulate operations in attention with simple accumulations to save considerable on-chip energy footprint on edge devices. Extensive experiments on both vision and language tasks show that EcoFormer consistently achieves comparable performance with standard attentions while consuming much fewer resources. For example, based on PVTv2-B0 and ImageNet-1K, Ecoformer achieves a 73% energy footprint reduction with only a 0.33% performance drop compared to the standard attention. Code is available at this https URL.
[CV] Text2Light: Zero-Shot Text-Driven HDR Panorama Generation
Z Chen, G Wang, Z Liu
[Nanyang Technological University]
[RO] iDF-SLAM: End-to-End RGB-D SLAM with Neural Implicit Mapping and Deep Feature Tracking
iDF-SLAM:基于神经隐映射和深度特征跟踪的端到端RGB-D SLAM
Y Ming, W Ye, A Calway
[University of Bristol & Zhejiang University]
[LG] Adversarially Robust Learning: A Generic Minimax Optimal Learner and Characterization
O Montasser, S Hanneke, N Srebro
[Toyota Technological Institute at Chicago & Purdue University] https://arxiv.org/abs/2209.07369
[LG] Self-Supervised Learning with an Information Maximization Criterion
S Ozsoy, S Hamdan, S Ö. Arik, D Yuret, A T. Erdogan
[Koc University & Google Cloud AI Research] https://arxiv.org/abs/2209.07999