爱可可AI前沿推介(3.11)

LG - 机器学习 CV - 计算机视觉 CL - 计算与语言 AS - 音频与语音 RO - 机器人

转自爱可可爱生活

1、[LG] Restoring and attributing ancient texts using deep neural networks

Y Assael, T Sommerschield, B Shillingford, M Bordbar, J Pavlopoulos, M Chatzipanagiotou, I Androutsopoulos, J Prag, N d Freitas

[DeepMind & Ca Foscari University of Venice & Athens University of Economics and BusinessUniversity of Oxford]

基于深度网络的古文字恢复与(地域)归属。古代历史依赖于诸如书画等学科——研究被称为铭文的雕刻文字，以证明过去文明的思想、语言、社会和历史。然而，几个世纪以来，许多铭文已经被损坏到无法辨认的地步，被运到远离原址的地方，其书写日期也充满了不确定性。本文提出Ithaca，一种用于古希腊铭文的文本修复、地理归属和时间归属的深度神经网络，可以改变铭文作为历史来源的价值，并帮助历史学家对整个古代世界书信习惯的分布和性质有一个更全面的了解。Ithaca旨在协助和扩展历史学家的工作流。Ithaca的架构着重于协同、决策支持和可解释性。虽然Ithaca在修复受损文本时能达到62%的精度，但历史学家使用Ithaca后，其精度从25%提高到72%，证实了这一研究工具的协同效应。Ithaca可以将铭文归于其原始地址，精度为71%，并且可以将其日期确定在其真值30年范围内，重新确定了古雅典关键文本，为古代历史的热点争论做出了贡献。本文工作表明，像Ithaca这样的模型可以释放出人工智能和历史学家之间的合作潜力，对研究和撰写人类历史上最重要的时期之一的方式产生变革性影响。

Ancient history relies on disciplines such as epigraphy—the study of inscribed texts known as inscriptions—for evidence of the thought, language, society and history of past civilizations. However, over the centuries, many inscriptions have been damaged to the point of illegibility, transported far from their original location and their date of writing is steeped in uncertainty. Here we present Ithaca, a deep neural network for the textual restoration, geographical attribution and chronological attribution of ancient Greek inscriptions. Ithaca is designed to assist and expand the historian’s workflow. The architecture of Ithaca focuses on collaboration, decision support and interpretability. While Ithaca alone achieves 62% accuracy when restoring damaged texts, the use of Ithaca by historians improved their accuracy from 25% to 72%, confirming the synergistic effect of this research tool. Ithaca can attribute inscriptions to their original location with an accuracy of 71% and can date them to less than 30 years of their ground-truth ranges, redating key texts of Classical Athens and contributing to topical debates in ancient history. This research shows how models such as Ithaca can unlock the cooperative potential between artificial intelligence and historians, transformationally impacting the way that we study and write about one of the most important periods in human history.

2、[CV] StyleNeRF: A Style-based 3D-Aware Generator for High-resolution Image Synthesis

J Gu, L Liu, P Wang, C Theobalt

[Facebook AI & Max Planck Institute for Informatics & The University of Hong Kong]

StyleNeRF：面向高分辨率图像合成的基于风格的3D感知生成器。本文提出StyleNeRF，一种用于具有多视图高度一致性的照片级逼真高分辨率图像合成的3D感知生成模型，可以在非结构化的2D图像上进行训练。现有方法要么不能合成具有精细细节的高分辨率图像，要么产生明显的3D不一致的结果。此外，其中很多还缺乏对风格属性和明确3D摄像机位置的控制。StyleNeRF将神经辐射场(NeRF)整合到一个基于风格的生成器中，以解决上述挑战，即提高高分辨率图像生成的渲染效率和3D一致性。只进行体渲染以产生低分辨率的特征图，并在二维中逐步应用升采样来解决第一个问题。为缓解2D升采样造成的不一致，提出了多种设计，包括更好的升采样器和新的正则化损失。通过这些设计，StyleNeRF能够以交互式速率合成高分辨率图像，同时高质量保留3D一致性。StyleNeRF还能控制摄像机的姿态和不同级别的样式，可通用于未见过的视图。还支持挑战性的任务，包括放大和缩小、风格混合、逆转和语义编辑。

We propose StyleNeRF, a 3D-aware generative model for photo-realistic highresolution image synthesis with high multi-view consistency, which can be trained on unstructured 2D images. Existing approaches either cannot synthesize highresolution images with fine details or yield noticeable 3D-inconsistent artifacts. In addition, many of them lack control over style attributes and explicit 3D camera poses. StyleNeRF integrates the neural radiance field (NeRF) into a style-based generator to tackle the aforementioned challenges, i.e., improving rendering efficiency and 3D consistency for high-resolution image generation. We perform volume rendering only to produce a low-resolution feature map and progressively apply upsampling in 2D to address the first issue. To mitigate the inconsistencies caused by 2D upsampling, we propose multiple designs, including a better upsampler and a new regularization loss. With these designs, StyleNeRF can synthesize high-resolution images at interactive rates while preserving 3D consistency at high quality. StyleNeRF also enables control of camera poses and different levels of styles, which can generalize to unseen views. It also supports challenging tasks, including zoom-in and-out, style mixing, inversion, and semantic editing.1

3、[CV] On the surprising tradeoff between ImageNet accuracy and perceptual similarity

M Kumar, N Houlsby, N Kalchbrenner, E D. Cubuk

[Google Research]

ImageNet上精度和感知相似度间意外的反相关关系。在预训练深度特征空间中测量的图像间的感知距离，在评估图像相似度方面已经超过了先前的低层的、基于像素的衡量标准。虽然旧的和不太准确的模型，如AlexNet和VGG捕捉感知相似性的能力是众所周知的，但现代的和更准确的模型却没有被研究。本文观察到现代网络如ResNets、EfficientNets和Vision Transformers在ImageNet上的精度和感知得分间存在惊人的反相关关系：即更好的分类器获得的感知得分更差。本文进行了一项大规模研究，考察了ImageNet的精度/感知分数在不同的深度、宽度、训练步数、权重衰退、标签平滑和dropout等方面的关系。更高的精度在一定程度上提高了感知分数，但在中高精度的情况下，精度和感知分数之间存在一个帕累托前沿。用失真不变性、空间频率敏感性和其他感知函数进一步探索这种关系，发现浅层ResNets，只在ImageNet上训练了不到5轮，其出现的Perceptual Score与之前直接在有监督人类知觉判断上训练的最佳网络相匹配。

Perceptual distances between images, as measured in the space of pre-trained deep features, have outperformed prior low-level, pixel-based metrics on assessing image similarity. While the capabilities of older and less accurate models such as AlexNet and VGG to capture perceptual similarity are well known, modern and more accurate models are less studied. First, we observe a surprising inverse correlation between ImageNet accuracy and Perceptual Scores of modern networks such as ResNets, EfficientNets, and Vision Transformers: that is better classifiers achieve worse Perceptual Scores. Then, we perform a large-scale study and examine the ImageNet accuracy/Perceptual Score relationship on varying the depth, width, number of training steps, weight decay, label smoothing, and dropout. Higher accuracy improves Perceptual Score up to a certain point, but we uncover a Pareto frontier between accuracies and Perceptual Score in the mid-to-high accuracy regime. We explore this relationship further using distortion invariance, spatial frequency sensitivity, and alternative perceptual functions. Interestingly we discover shallow ResNets, trained for less than 5 epochs only on ImageNet, whose emergent Perceptual Score matches the prior best networks trained directly on supervised human perceptual judgements.

4、[CL] GatorTron: A Large Clinical Language Model to Unlock Patient Information from Unstructured Electronic Health Records

X Yang, N PourNejatian, H C Shin, K E Smith, C Parisien, C Compas, C Martin, M G Flores, Y Zhang, T Magoc, C A Harle, G Lipori, D A Mitchell, W R Hogan, E A Shenkman, J Bian, Y Wu

[University of Florida & NVIDIA]

GatorTron：从非结构化电子病例中解锁病人信息的大规模临床语言模型。人们对开发自然语言处理(NLP)中的大规模深度学习模型越来越感兴趣，该模型是从非结构化电子病例(EHR)中提取病人信息的关键技术。然而，在临床领域探索大型语言模型的研究有限；目前最大的临床NLP模型是用1.1亿个参数训练的(相比之下，一般领域的参数为1750亿)。目前还不清楚大尺寸的NLP模型如何帮助机器从非结构化的EHR中理解病人的临床信息。本文用超过900亿词的文本开发了一个大型临床Transformer模型GatorTron，在5个临床NLP任务上进行了评估，包括临床概念提取、关系提取、语义文本相似性、自然语言推理和医学问答。GatorTron现在是临床领域最大的Transformer模型，从之前的1.1亿参数扩展到89亿参数，并在针对EHR中记录的各种医疗信息的5项临床NLP任务中取得了最先进的性能。GatorTron模型在理解和利用临床表述中的病人信息方面表现更好，可以应用于改善医疗服务和病人结果。

There is an increasing interest in developing massive-size deep learning models in natural language processing (NLP) the key technology to extract patient information from unstructured electronic health records (EHRs). However, there are limited studies exploring large language models in the clinical domain; the current largest clinical NLP model was trained with 110 million parameters (compared with 175 billion parameters in the general domain). It is not clear how large-size NLP models can help machines understand patients’ clinical information from unstructured EHRs. In this study, we developed a large clinical transformer model – GatorTron – using >90 billion words of text and evaluated it on 5 clinical NLP tasks including clinical concept extraction, relation extraction, semantic textual similarity, natural language inference, and medical question answering. GatorTron is now the largest transformer model in the clinical domain that scaled up from the previous 110 million to 8.9 billion parameters and achieved state-of-the-art performance on the 5 clinical NLP tasks targeting various healthcare information documented in EHRs. GatorTron models perform better in understanding and utilizing patient information from clinical narratives in ways that can be applied to improvements in healthcare delivery and patient outcomes.

5、[CV] Learning Multi-Object Dynamics with Compositional Neural Radiance Fields

D Driess, Z Huang, Y Li, R Tedrake, M Toussaint

[Technical University of Berlin & University of California, San Diego & MIT]

基于复合神经辐射场的多目标动力学学习。本文提出一种方法，从基于隐性目标编码器、神经辐射场(NeRFs)和图神经网络的图像观测中学习复合预测模型。由于其强大的3D先验，NeRFs已经成为代表场景的流行选择。然而，大多数NeRF方法都是在单一场景下训练的，用一个全局模型表示整个场景，这使得对包含不同数量物体的新场景的泛化具有挑战性。本文提出了一个复合式的、以物体为中心的自编码器框架，将场景的多个视图映射到一组分别代表每个物体的潜向量。潜向量为各个NeRF模型提供参数，从这些模型可以重建场景，并从新的视角进行渲染。在潜空间中训练一个图神经网络动力学模型，以实现动力学预测的复合性。所提出方法的一个主要特点是，通过NeRF模型学习到的场景的3D信息使得在学习动力学模型时纳入结构先验成为可能，使长期预测更加稳定。对于规划，在学习的潜空间中利用RRT，可以利用模型和隐含的物体编码器，使潜空间的采样信息量更大，效率更高。实验表明，该模型在一个包含许多物体的推动任务中的表现优于几个基线。

We present a method to learn compositional predictive models from image observations based on implicit object encoders, Neural Radiance Fields (NeRFs), and graph neural networks. NeRFs have become a popular choice for representing scenes due to their strong 3D prior. However, most NeRF approaches are trained on a single scene, representing the whole scene with a global model, making generalization to novel scenes, containing different numbers of objects, challenging. Instead, we present a compositional, object-centric auto-encoder framework that maps multiple views of the scene to a set of latent vectors representing each object separately. The latent vectors parameterize individual NeRF models from which the scene can be reconstructed and rendered from novel viewpoints. We train a graph neural network dynamics model in the latent space to achieve compositionality for dynamics prediction. A key feature of our approach is that the learned 3D information of the scene through the NeRF model enables us to incorporate structural priors in learning the dynamics models, making long-term predictions more stable. For planning, we utilize RRTs in the learned latent space, where we can exploit our model and the implicit object encoder to make sampling the latent space informative and more efficient. In the experiments, we show that the model outperforms several baselines on a pushing task containing many objects.