爱可可AI前沿推介 (11.8)

LG - 机器学习 CV - 计算机视觉 CL - 计算与语言 AS - 音频与语音 RO - 机器人

转自爱可可爱生活

摘要：机器学习期末试题的自动回答和生成、基于注意力的可泛化NeRF Transformer架构、快速稳定的神经网络训练通用方法、实时目标声音提取、面向地理空间系统的通用神经架构、重新定义面向再合成任务的视频抠图、音乐混音风格迁移、长文档BERT、在文本引导图像生成模型中注入不可见后门

1、[LG] Automatically Answering and Generating Machine Learning Final Exams

机器学习期末试题的自动回答和生成。机器可以学习机器学习吗？本文提出用回答类似问题的标准来回答这个问题：人能学会机器学习吗？本文自动回答麻省理工学院最近的大型机器学习课程的期末考试，并产生了新的人工水平的试题。最近，程序合成和少样本学习解决了数学和STEM课程中的大学水平的问题集问题，并达到人类水平。本文解决了来自期末考试的问题，这些问题在几个方面与问题集不同：问题更长，有多个部分，更复杂，并且跨越了更广泛的主题集。本文提供了一个新的数据集和机器学习期末考试问题的基准，以及自动回答这些问题和生成新问题的代码。为了使所提出的数据集成为一个可复现的基准，对选择题、有数字答案的问题和有表达式答案的问题使用了自动检查器，并评估了一个大型的开放语言模型——Meta的OPT，并将其结果与Open AI的GPT-3和Codex进行比较。一项学生调查比较了机器生成的问题与人写的问题的质量、适当性和难度，显示在多个方面，机器生成的问题与人生成的问题没有区别，适合于期末考试。本文进行了消融研究，在一系列机器学习试题上比较了零样本学习与少样本学习、思维链提示、GPT-3和OPT预训练的文本以及Codex微调的代码，发现少样本学习方法表现最好。

Can a machine learn machine learning? We propose to answer this question using the same criteria we use to answer a similar question: can a human learn machine learning? We automatically answer final exams in MIT's recent large machine learning course and generate new questions at a human level. Recently, program synthesis and few-shot learning solved university-level problem set questions in mathematics and STEM courses at a human level. In this work, we solve questions from final exams that differ from problem sets in several ways: the questions are longer, have multiple parts, are more complicated, and span a broader set of topics. We provide a new dataset and benchmark of questions from machine learning final exams and code for automatically answering these questions and generating new questions. To make our dataset a reproducible benchmark, we use automatic checkers for multiple choice questions, questions with numeric answers, and questions with expression answers, and evaluate a large free language model, Meta’s OPT, and compare the results with Open AI’s GPT-3 and Codex. A student survey comparing the quality, appropriateness, and difficulty of machine-generated questions with human-written questions shows that across multiple aspects, machine-generated questions are indistinguishable from human-generated questions and are suitable for final exams. We perform ablation studies comparing zero-shot learning with few-shot learning, chain-of-thought prompting, GPT-3 and OPT pre-trained on text and Codex fine-tuned on code on a range of machine learning topics and find that few-shot learning methods perform best. We make our data and code publicly available for the machine learning community.

https://openreview.net/pdf?id=MT1Pcdo8sGG

2、[CV] Is Attention All That NeRF Needs?

基于注意力的可泛化NeRF Transformer架构。我们提出了可泛化的NeRF Transformer(GNT)，一种基于Transformer的架构，可以重建神经辐射场(NeRF)，并学会从源视图中快速渲染新视图。之前关于NeRF的工作通过逆转手工制作的渲染方程来优化场景表示，而GNT在两个阶段使用Transformer实现了跨场景通用的神经表示和渲染。(1) 视图Transformer利用多视图几何学作为基于注意力的场景表示的归纳偏差，并通过聚合来自相邻视图上表极线的信息来预测坐标对齐的特征。(2) 射线Transformer渲染新视图，用注意力来解码来自视图Transformer的特征，沿着射线行进期间的采样点。实验表明，当在单个场景上进行优化时，由于学习了光线渲染器，GNT可以在没有明确渲染公式的情况下成功重建NeRF。当对多个场景进行训练时，GNT在迁移到未见过的场景时，始终能达到最先进的性能，并比其他所有方法平均高出约10%。本文对学习到的推断深度和遮挡的注意力图的分析表明，注意力能学习到以物理为基础的渲染。

We present Generalizable NeRF Transformer (GNT), a transformer-based architecture that reconstructs Neural Radiance Fields (NeRFs) and learns to renders novel views on the fly from source views. While prior works on NeRFs optimize a scene representation by inverting a handcrafted rendering equation, GNT achieves neural representation and rendering that generalizes across scenes using transformers at two stages. (1) The view transformer leverages multi-view geometry as an inductive bias for attention-based scene representation, and predicts coordinate-aligned features by aggregating information from epipolar lines on the neighboring views. (2) The ray transformer renders novel views using attention to decode the features from the view transformer along the sampled points during ray marching. Our experiments demonstrate that when optimized on a single scene, GNT can successfully reconstruct NeRF without an explicit rendering formula due to the learned ray renderer. When trained on multiple scenes, GNT consistently achieves state-of-the-art performance when transferring to unseen scenes and outperform all other methods by ~10% on average. Our analysis of the learned attention maps to infer depth and occlusion indicates that attention enables learning a physically-grounded rendering. Our results show the promise of transformers as a universal modelling tool for graphics.

https://openreview.net/forum?id=qpeAhwxTopw

3、[LG] Identical Initialization: A Universal Approach to Fast and Stable Training of Neural Networks

单位初始化：快速稳定的神经网络训练通用方法。条件良好的初始化对深度神经网络训练是有益的。然而，现有的初始化方法并没有同时表现出鲁棒性和通用性。具体来说，即使广泛使用的Xavier和Kaiming初始化方法一般可以适合各种网络，但它们在没有Batch Normalization的情况下无法训练残差网络，因为在数据流上计算出了不合适的尺度。另一方面，一些文献设计了基于动态等值线的稳定初始化(如Fixup和ReZero)，这是一种高效的学习机制。尽管如此，这些方法都是专门为非残差结构或残差块设计的，甚至包括额外的辅助成分，限制了它们的适用范围。耐人寻味的是，本文发现单位矩阵是解决上述问题的一个可行的、普遍的方案，因为它坚持动态等值线，同时又适用于广泛的模型。受此启发，本文提出了单位初始化(IDInit)，一种足够鲁棒、普遍和快速转换的关于恒等矩阵的方法。在各种基准上的经验结果表明，IDInit对各种网络类型都是通用的，并且具有良好的性能和快速收敛的实际作用。

A well-conditioned initialization is beneficial for training deep neural networks. However, existing initialization approaches do not simultaneously show robustness and universality. Specifically, even though the widely-used Xavier and Kaiming initialization approaches can generally fit a variety of networks, they fail to train residual networks without Batch Normalization for calculating an inappropriate scale on data-flow. On the other hand, some literature design stable initialization (e.g., Fixup and ReZero) based on dynamical isometry, an efficient learning mechanism. Nonetheless, these methods are specifically designed for either a non-residual structure or a residual block only, and even include extra auxiliary components, limiting their applicable range. Intriguingly, we find that the identity matrix is a feasible and universal solution to the aforementioned problems, as it adheres to dynamical isometry while remaining applicable to a wide range of models. Motivated by this, we develop Identical Initialization (IDInit), a sufficiently robust, universal, and fast-converging approach on the identity matrix. Empirical results on a variety of benchmarks show that IDInit is universal to various network types, and practically useful with good performance and fast convergence.

https://openreview.net/forum?id=qpeAhwxTopw

4、[AS] Real-Time Target Sound Extraction

B Veluri, J Chan, M Itani, T Chen, T Yoshioka, S Gollakota
[University of Washington & Microsoft]
实时目标声音提取。本文提出了第一个神经网络模型，以实现实时和流式目标声音提取。为实现这一目标，本文提出Waveformer，一种编-解码器架构，由一堆扩张因果卷积层作为编码器，由一个transformer解码层作为解码器。这种混合架构使用扩张因果卷积，以高计算效率的方式处理大的感受野，同时也受益于基于transformer的架构提供的性能。评估显示，与之前的模型相比，该任务的SI-SNRi提高了2.2-3.3分贝，同时模型大小减少1.2-4倍，运行时间减少1.5-2倍。

We present the first neural network model to achieve real-time and streaming target sound extraction. To accomplish this, we propose Waveformer, an encoder-decoder architecture with a stack of dilated causal convolution layers as the encoder, and a transformer decoder layer as the decoder. This hybrid architecture uses dilated causal convolutions for processing large receptive fields in a computationally efficient manner, while also benefiting from the performance transformer-based architectures provide. Our evaluations show as much as 2.2-3.3 dB improvement in SI-SNRi compared to the prior models for this task while having a 1.2-4x smaller model size and a 1.5-2x lower runtime. Open-source code and datasets: this https URL

https://arxiv.org/abs/2211.02250

5、[LG] A General Purpose Neural Architecture for Geospatial Systems

N Rahaman, M Weiss, F Träuble...
[Mila & Max Planck Institute for Intelligent Systems & AWS AI Service & Now Research]
面向地理空间系统的通用神经架构。地理空间信息系统被研究人员和人道主义援助与灾难响应(HADR)从业者用来支持各种重要的应用。然而，由于地理空间数据模式的异质性(如各种分辨率的多光谱图像、时间序列、天气数据)和任务的多样性(如人类活动指标的回归或探测森林火灾)，各层面的合作很困难。本文提出一种构建通用神经架构(GPNA)的路线图，该架构具有地理空间的归纳偏差，以自监督的方式对大量未标记的地球观测数据进行预训练。本文设想了这样一个模型如何促进社区成员之间的合作，展示了路线图第一步的初步结果——实例化了一个可以处理各种地理空间数据模态的架构，并证明它可以在与联合国可持续发展目标有关的任务上取得与特定领域架构相竞争的性能。

Geospatial Information Systems are used by researchers and Humanitarian Assistance and Disaster Response (HADR) practitioners to support a wide variety of important applications. However, collaboration between these actors is difficult due to the heterogeneous nature of geospatial data modalities (e.g., multi-spectral images of various resolutions, timeseries, weather data) and diversity of tasks (e.g., regression of human activity indicators or detecting forest fires). In this work, we present a roadmap towards the construction of a general-purpose neural architecture (GPNA) with a geospatial inductive bias, pre-trained on large amounts of unlabelled earth observation data in a self-supervised manner. We envision how such a model may facilitate cooperation between members of the community. We show preliminary results on the first step of the roadmap, where we instantiate an architecture that can process a wide variety of geospatial data modalities and demonstrate that it can achieve competitive performance with domain-specific architectures on tasks relating to the U.N.'s Sustainable Development Goals.

https://arxiv.org/abs/2211.02348

另外几篇值得关注的论文：

[CV] FactorMatte: Redefining Video Matting for Re-Composition Tasks

因子抠图：重新定义面向再合成任务的视频抠图
Z Gu, W Xian, N Snavely, A Davis
[Cornell Tech]
https://arxiv.org/abs/2211.02145

[AS] Music Mixing Style Transfer: A Contrastive Learning Approach to Disentangle Audio Effects

音乐混音风格迁移：基于对比学习的音效解缠
J Koo, M A. Martinez-Ramirez, W Liao, S Uhlich, K Lee, Y Mitsufuji
[Sony Group Corporation & Seoul National University]
https://arxiv.org/abs/2211.02247

[CL] BERT for Long Documents: A Case Study of Automated ICD Coding

长文档BERT：ICD自动编码案例研究
A Afkanpour, S Adeel, H Bassani, A Epshteyn...
[Google]
https://arxiv.org/abs/2211.02519

[LG] Rickrolling the Artist: Injecting Invisible Backdoors into Text-Guided Image Generation Models

在文本引导图像生成模型中注入不可见后门
L Struppek, D Hintersdorf, K Kersting
[Technical University of Darmstadt]
https://arxiv.org/abs/2211.02408

内容中包含的图片若涉及版权问题，请及时与我们联系删除