爱可可AI前沿推介(12.3)

LG - 机器学习 CV - 计算机视觉 CL - 计算与语言 AS - 音频与语音 RO - 机器人

1、[LG] The signature and cusp geometry of hyperbolic knots

A Davies, A Juhász, M Lackenby, N Tomasev

双曲结特征和尖峰几何。本文提出一种新的实值不变量，称为3球双曲结的自然斜率，根据其尖峰几何来定义。结的特征和自然斜率的两倍最多相差一个常数，即双曲体积除以注入性半径的立方。这个不等式是用机器学习来检测各种结的不变量之间的关系而发现的。它适用于Dehn surgery和4-ball genus。本文还展示了该不等式的精炼版本，其上界是体积的线性函数，斜率被对应于连接结的奇数次的短测地线项所修正。

We introduce a new real-valued invariant called the natural slope of a hyperbolic knot in the 3-sphere, which is defined in terms of its cusp geometry. We show that twice the knot signature and the natural slope differ by at most a constant times the hyperbolic volume divided by the cube of the injectivity radius. This inequality was discovered using machine learning to detect relationships between various knot invariants. It has applications to Dehn surgery and to 4-ball genus. We also show a refined version of the inequality where the upper bound is a linear function of the volume, and the slope is corrected by terms corresponding to short geodesics that link the knot an odd number of times.

https://weibo.com/1402400261/L4h2r3wv8

2、[LG] Donut: Document Understanding Transformer without OCR

G Kim, T Hong, M Yim, J Park, J Yim, W Hwang, S Yun, D Han, S Park

[Clova AI Research, NAVER Corp.]

Donut: 无需OCR的文档理解Transformer。理解文档图像(如发票)一直是个重要的研究课题，在文档处理自动化方面有许多应用。通过基于深度学习的光学字符识别(OCR)的最新进展，目前的视觉文档理解(VDU)系统已经开始基于OCR设计。尽管这种基于OCR的方法承诺了合理的性能，但它们受到OCR引起的关键问题的影响，例如：(1)昂贵的计算成本和(2)由于OCR错误传播导致的性能下降。本文提出一种新的VDU模型，在没有OCR框架支持的情况下可以进行端到端训练。提出了一个新的任务和一个合成文档图像生成器来预训练模型，以减轻对大规模真实文档图像的依赖。该方法在公共基准数据集和私有商用服务数据集的各种文档理解任务中取得了最先进的性能。通过广泛的实验和分析，证明了所提出的模型的有效性，特别是考虑到了真实世界的应用。

Understanding document images (e.g., invoices) has been an important research topic and has many applications in document processing automation. Through the latest advances in deep learning-based Optical Character Recognition (OCR), current Visual Document Understanding (VDU) systems have come to be designed based on OCR. Although such OCR-based approach promise reasonable performance, they suffer from critical problems induced by the OCR, e.g., (1) expensive computational costs and (2) performance degradation due to the OCR error propagation. In this paper, we propose a novel VDU model that is end-to-end trainable without underpinning OCR framework. To this end, we propose a new task and a synthetic document image generator to pre-train the model to mitigate the dependencies on largescale real document images. Our approach achieves state-of-the-art performance on various document understanding tasks in public benchmark datasets and private industrial service datasets. Through extensive experiments and analysis, we demonstrate the effectiveness of the proposed model especially with consideration for a real-world application.

https://weibo.com/1402400261/L4h6v5Gpo

3、[CV] RegNeRF: Regularizing Neural Radiance Fields for View Synthesis from Sparse Inputs

M Niemeyer, J T. Barron, B Mildenhall, M S. M. Sajjadi, A Geiger, N Radwan

[Max Planck Institute for Intelligent Systems & University of Tubingen & Google Research]

RegNeRF：面向稀疏输入视图合成的神经辐射场正则化。神经辐射场(NeRF)由于其简单性和最先进的性能，已经成为新视图合成任务的强大表示。尽管NeRF在有许多输入视图时可以产生未见过的视角的逼真渲染，但当数量减少时，性能会明显下降。在稀疏输入场景中的大多数伪影是由估计场景几何的错误和训练开始时的发散行为造成的。通过对未观察到的视角渲染的几何形状和外观进行正则化处理，以及在训练期间对射线采样空间进行退火处理来解决这个问题。此外，还使用归一化流模型来规范未观察到的视点的颜色。所提出模型不仅优于其他对单一场景进行优化的方法，而且在许多情况下也优于在大型多视角数据集上广泛预训练的条件模型。

Neural Radiance Fields (NeRF) have emerged as a powerful representation for the task of novel view synthesis due to their simplicity and state-of-the-art performance. Though NeRF can produce photorealistic renderings of unseen viewpoints when many input views are available, its performance drops significantly when this number is reduced. We observe that the majority of artifacts in sparse input scenarios are caused by errors in the estimated scene geometry, and by divergent behavior at the start of training. We address this by regularizing the geometry and appearance of patches rendered from unobserved viewpoints, and annealing the ray sampling space during training. We additionally use a normalizing flow model to regularize the color of unobserved viewpoints. Our model outperforms not only other methods that optimize over a single scene, but in many cases also conditional models that are extensively pre-trained on large multi-view datasets.

https://weibo.com/1402400261/L4h9U0aAN

4、[LG] Show Your Work: Scratchpads for Intermediate Computation with Language Models

M Nye, A J Andreassen, G Gur-Ari, H Michalewski, J Austin, D Bieber, D Dohan, A Lewkowycz, M Bosma, D Luan, C Sutton, A Odena

[MIT & Google Research]

Show Your Work：语言模型中间计算暂存区。大型预训练语言模型在可以"一次完成"的任务上表现非常好，例如生成现实文本或合成计算机程序，但在处理需要无限制的多步骤计算任务时却很吃力，例如添加整数或执行程序。本文发现这些相同的模型能够进行复杂的多步骤计算——即使是在少样本情况下——当被要求"一步一步"地执行操作，显示中间计算的结果。通过要求将中间计算步骤排放到"暂存区(Scratchpads)"中，来训练Transformer执行多步骤计算。在一系列越来越复杂的任务中，从长加法到任意程序的执行，表明暂存区极大地提高了语言模型进行多步骤计算的能力。

Large pre-trained language models perform remarkably well on tasks that can be done “in one pass”, such as generating realistic text (Brown et al., 2020) or synthesizing computer programs (Chen et al., 2021; Austin et al., 2021). However, they struggle with tasks that require unbounded multi-step computation, such as adding integers (Brown et al., 2020) or executing programs (Austin et al., 2021). Surprisingly, we find that these same models are able to perform complex multistep computations—even in the few-shot regime—when asked to perform the operation “step by step”, showing the results of intermediate computations. In particular, we train Transformers to perform multi-step computations by asking them to emit intermediate computation steps into a “scratchpad”. On a series of increasingly complex tasks ranging from long addition to the execution of arbitrary programs, we show that scratchpads dramatically improve the ability of language models to perform multi-step computations.

https://weibo.com/1402400261/L4hcGDsCh

5、[CV] Hallucinated Neural Radiance Fields in the Wild

X Chen, Q Zhang, X Li, Y Chen, Y Feng, X Wang, J Wang

[Xi’an Jiaotong University & Tencent AI Lab]

现实场景的幻化神经辐射场。神经辐射场(NeRF)最近因其令人印象深刻的新视角合成能力而受到欢迎。本文研究了幻化NeRF问题：即从一组游览图像中恢复出一天中不同时间的现实NeRF。现有的解决方案采用具有可控外观嵌入的NeRF来呈现各种条件下的新视图，但不能呈现具有未见外观的视图一致的图像。为了解决该问题，本文提出一种构建幻化NeRF的端到端框架Ha-NeRF。提出一种外观幻化模块来处理时间变化的外观，并将其迁移到新视图中。考虑到游览图像的复杂遮挡，提出一种反遮挡模块，以准确分解静态主体的可见性。在合成数据和真实的游览照片集上的实验结果表明，所提出方法不仅可以幻化出所需的外观，还可以从不同的视角呈现出无遮挡的图像。

Neural Radiance Fields (NeRF) has recently gained popularity for its impressive novel view synthesis ability. This paper studies the problem of hallucinated NeRF: i.e. recovering a realistic NeRF at a different time of day from a group of tourism images. Existing solutions adopt NeRF with a controllable appearance embedding to render novel views under various conditions, but cannot render view-consistent images with an unseen appearance. To solve this problem, we present an end-to-end framework for constructing a hallucinated NeRF, dubbed as Ha-NeRF. Specifically, we propose an appearance hallucination module to handle time-varying appearances and transfer them to novel views. Considering the complex occlusions of tourism images, an anti-occlusion module is introduced to decompose the static subjects for visibility accurately. Experimental results on synthetic data and real tourism photo collections demonstrate that our method can not only hallucinate the desired appearances, but also render occlusion-free images from different views. The project and supplementary materials are available at https://rover-xingyu.github.io/Ha-NeRF/.

https://weibo.com/1402400261/L4hhaDCfv

另外几篇值得关注的论文：

[CV] Translation-equivariant Image Quantizer for Bi-directional Image-Text Generation

面向双向图像-文本生成的平移等变图像量化器

W Shin, G Lee, J Lee, J Lee, E Choi

[KAIST & Google Research]

https://weibo.com/1402400261/L4hkuDjrU

[CV] Unsupervised Domain Adaptation: A Reality Check

无监督域自适应的真实性验证

K Musgrave, S Belongie, S Lim

[Cornell Tech & University of Copenhagen & Meta AI]

https://weibo.com/1402400261/L4hmcnhhq

[CV] CLIPstyler: Image Style Transfer with a Single Text Condition

CLIPstyler：单独文本为条件的图像画风迁移

G Kwon, J C Ye

[KAIST]

https://weibo.com/1402400261/L4hoekvvG

[CV] VoRTX: Volumetric 3D Reconstruction With Transformers for Voxelwise View Selection and Fusion

N Stier, A Rich, P Sen, T Höllerer

VoRTX：基于体素视图选择融合Transformer的体3D重建

[University of California, Santa Barbara]

https://weibo.com/1402400261/L4hpFxChF

内容中包含的图片若涉及版权问题，请及时与我们联系删除