爱可可AI前沿推介(6.17)

LG - 机器学习 CV - 计算机视觉 CL - 计算与语言 AS - 音频与语音 RO - 机器人

转自爱可可爱生活

摘要：大型语言模型的突现能力、基于摆放单目视频的实时3D平面检测与重建、稀疏体素网格快速3D感知图像合成、基于记忆模型的大规模编辑、可变比特率神经场、宽贝叶斯神经网络具有简单加权后验、既可持续生成又可填空的单一模型案例研究、视觉任务统一序列接口、模块化编-解码器模型构建

1、[CL] Emergent Abilities of Large Language Models

J Wei, Y Tay, R Bommasani, C Raffel, B Zoph, S Borgeaud, D Yogatama, M Bosma, D Zhou...

[Google Research & Stanford University & UNC Chapel Hill & DeepMind]

大型语言模型的突现能力。扩大语言模型的规模已被证明可在广泛的下游任务中可预测地提高性能和采样效率。本文转而讨论了一种不可预测的现象，称为大型语言模型的突现能力。如果一种能力在较小的模型中不存在，但在较大的模型中存在，我们认为这种能力是突现的。因此，出现的能力不能简单地通过推断较小模型的性能来预测。这种突现的存在意味着额外的规模扩展可以进一步扩大语言模型的能力范围。新出现的能力可以跨越各种语言模型、任务类型和实验场景。这种能力是最近发现的语言模型扩展的结果，关于它们是如何出现的，以及更多的扩展是否会带来更多的突现能力，似乎是NLP领域未来重要的研究方向。

Scaling up language models has been shown to predictably improve performance and sample efficiency on a wide range of downstream tasks. This paper instead discusses an unpredictable phenomenon that we refer to as emergent abilities of large language models. We consider an ability to be emergent if it is not present in smaller models but is present in larger models. Thus, emergent abilities cannot be predicted simply by extrapolating the performance of smaller models. The existence of such emergence implies that additional scaling could further expand the range of capabilities of language models.

https://arxiv.org/abs/2206.07682

2、[CV] PlanarRecon: Real-time 3D Plane Detection and Reconstruction from Posed Monocular Videos

Y Xie, M Gadelha, F Yang...

Northeastern University & Adobe Research & The Pennsylvania State University & Zhejiang University]

PlanarRecon：基于摆放单目视频的实时3D平面检测与重建。本文提出PlanarRecon——一种新框架，用于从摆放单目视频中进行全局一致的3D平面检测与重建。与以往从单幅图像中检测2D平面的工作不同，PlanarRecon利用神经网络从场景的体表示中逐步检测每个包含一组关键帧的视频片段的3D平面。一个基于学习的跟踪和融合模块被设计用来合并来自之前片段的平面，以形成一个连贯的全局平面重建。其关键思想是用场景的体表示和基于学习的跟踪和融合模块，为每个视频片段逐步检测、匹配和融合3D平面。这样的设计使PlanarRecon能够整合每个片段中的多个视图和不同片段中的时间信息，从而以低多边形的几何形状对场景的抽象进行准确和连贯的重建。实验表明，所提出的方法在ScanNet数据集上实现了最先进的性能，同时具有实时性。

We present PlanarRecon – a novel framework for globally coherent detection and reconstruction of 3D planes from a posed monocular video. Unlike previous works that detect planes in 2D from a single image, PlanarRecon incrementally detects planes in 3D for each video fragment, which consists of a set of key frames, from a volumetric representation of the scene using neural networks. A learning-based tracking and fusion module is designed to merge planes from previous fragments to form a coherent global plane reconstruction. Such design allows PlanarRecon to integrate observations from multiple views within each fragment and temporal information across different ones, resulting in an accurate and coherent reconstruction of the scene abstraction with low-polygonal geometry. Experiments show that the proposed approach achieves state-of-the-art performances on the ScanNet dataset while being real-time. Code is available at the project page: https://neu-vi.github.io/planarrecon/.

https://arxiv.org/abs/2206.07710

3、[CV] VoxGRAF: Fast 3D-Aware Image Synthesis with Sparse Voxel Grids

K Schwarz, A Sauer, M Niemeyer, Y Liao, A Geiger

[University of Tübingen & Zhejiang University]

VoxGRAF：稀疏体素网格快速3D感知图像合成。最先进的3D感知生成模型依靠基于坐标的MLP来确定3D辐射场参数。虽然展示了令人印象深刻的结果，但沿每条光线查询MLP的每个样本会导致缓慢的渲染。因此，现有的方法经常渲染低分辨率的特征图，并通过升采样网络对其进行处理以获得最终图像。尽管效率很高，但神经渲染通常会将视角和内容纠缠在一起，从而改变相机的姿态会导致不想要的几何或外观变化。在基于体素的新视图合成最新成果的激励下，本文研究了稀疏体素网格表示对于快速和3D一致的生成性建模的效用。研究结果表明，当将稀疏体素网格与渐进式增长、自由空间修剪和适当的正则化相结合时，单体MLP确实可以被3D卷积所取代。为了获得场景的紧凑表示，并允许扩展到更高的体素分辨率，所提出模型将前景目标(3D建模)与背景(2D建模)分开。与现有方法相比，所提出方法只需要一次前向传递就能生成一个完整的3D场景。允许从任意视角进行有效的渲染，同时产生具有高视觉保真度的3D一致性结果。

State-of-the-art 3D-aware generative models rely on coordinate-based MLPs to parameterize 3D radiance fields. While demonstrating impressive results, querying an MLP for every sample along each ray leads to slow rendering. Therefore, existing approaches often render low-resolution feature maps and process them with an upsampling network to obtain the final image. Albeit efficient, neural rendering often entangles viewpoint and content such that changing the camera pose results in unwanted changes of geometry or appearance. Motivated by recent results in voxel-based novel view synthesis, we investigate the utility of sparse voxel grid representations for fast and 3D-consistent generative modeling in this paper. Our results demonstrate that monolithic MLPs can indeed be replaced by 3D convolutions when combining sparse voxel grids with progressive growing, free space pruning and appropriate regularization. To obtain a compact representation of the scene and allow for scaling to higher voxel resolutions, our model disentangles the foreground object (modeled in 3D) from the background (modeled in 2D). In contrast to existing approaches, our method requires only a single forward pass to generate a full 3D scene. It hence allows for efficient rendering from arbitrary viewpoints while yielding 3D consistent results with high visual fidelity.

https://arxiv.org/abs/2206.07695

4、[CL] Memory-Based Model Editing at Scale

E Mitchell, C Lin, A Bosselut, C D. Manning, C Finn

[Stanford University & EPFL]

基于记忆模型的规模化编辑。即使是最大的神经网络也会出错，曾经正确的预测也会随着世界的变化而变得无效。模型编辑对基础模型(预训练模型)的行为进行局部更新，以注入更新的知识或纠正不理想的行为。现有的模型编辑器已经显示出良好的前景，但也存在表达能力不足的问题：它们难以准确地模拟编辑的预期范围(受编辑影响的样本)，导致对与编辑松散相关的测试输入的预测不准确，而且它们在多次编辑后往往完全失败。作为一个高能力的替代方案，本文提出带有检索增强反事实模型(SERAC)的准参数化编辑，将模型编辑存储在一个外部存储器中，并学会推理，以根据需要修改基础模型的预测。为了能对模型编辑进行更严格的评估，引入了基于问答、事实核查和对话生成的三个具有挑战性的语言模型编辑问题。实验发现，只有SERAC在这三个问题上都取得了很高的性能，始终比现有的模型编辑方法要好得多。

Even the largest neural networks make errors, and once-correct predictions can become invalid as the world changes. Model editors make local updates to the behavior of base (pre-trained) models to inject updated knowledge or correct undesirable behaviors. Existing model editors have shown promise, but also suffer from insufficient expressiveness: they struggle to accurately model an edit’s intended scope (examples affected by the edit), leading to inaccurate predictions for test inputs loosely related to the edit, and they often fail altogether after many edits. As a highercapacity alternative, we propose Semi-Parametric Editing with a Retrieval-Augmented Counterfactual Model (SERAC), which stores edits in an explicit memory and learns to reason over them to modulate the base model’s predictions as needed. To enable more rigorous evaluation of model editors, we introduce three challenging language model editing problems based on question answering, fact-checking, and dialogue generation. We find that only SERAC achieves high performance on all three problems, consistently outperforming existing approaches to model editing by a significant margin. Code, data, and additional project information will be made available at https://sites.google.com/view/serac-editing.

https://arxiv.org/abs/2206.06520

5、[CV] Variable Bitrate Neural Fields

T Takikawa, A Evans, J Tremblay, T Müller, M McGuire, A Jacobson, S Fidler

[NVIDIA & University of Waterloo & Adobe Research]

可变比特率神经场。标量场和向量场的神经近似，如有符号距离函数和辐射场，已经成为准确的、高质量的表示。最先进的结果是通过从可训练的特征网格中查找神经近似来获得的，该网格承担了部分学习任务，并允许更小、更有效的神经网络。不幸的是，与独立的神经网络模型相比，这些特征网格通常以显著增加的内存消耗为代价。本文提出一种压缩这种特征网格的字典方法，将内存消耗减少了100倍，并允许使用多分辨率表示，这对核外流媒体是有帮助的。将字典优化表述为一个向量量化的自解码器问题，能在一个没有直接监督的空间中学习端到端的离散神经表示，并且具有动态拓扑和结构。

Neural approximations of scalar and vector fields, such as signed distance functions and radiance fields, have emerged as accurate, high-quality representations. State-of-the-art results are obtained by conditioning a neural approximation with a lookup from trainable feature grids that take on part of the learning task and allow for smaller, more efficient neural networks. Unfortunately, these feature grids usually come at the cost of significantly increased memory consumption compared to stand-alone neural network models. We present a dictionary method for compressing such feature grids, reducing their memory consumption by up to 100x and permitting a multiresolution representation which can be useful for out-of-core streaming. We formulate the dictionary optimization as a vector-quantized auto-decoder problem which lets us learn end-to-end discrete neural representations in a space where no direct supervision is available and with dynamic topology and structure. Our source code will be available at this https URL.

https://arxiv.org/abs/2206.07707