爱可可AI前沿推介(1.1)

LG - 机器学习 CV - 计算机视觉 CL - 计算与语言 AS - 音频与语音 RO - 机器人

转自爱可可爱生活

1、[CL] Measuring Attribution in Natural Language Generation Models

H Rashkin, V Nikolaev, M Lamm, M Collins, D Das, S Petrov, G S Tomar, I Turc, D Reitter

[Google Research]

自然语言生成模型的归因性度量。随着最近自然语言生成(NLG)模型在各种应用中的改进，必须要有办法来识别和评估NLG输出是否只是分享关于外部世界的可验证信息。本文提出一种新的评估框架，"识别源可归因性(AIS)"，用于当输出与外部世界有关时对自然语言生成模型输出进行评估。首先定义了AIS，介绍了一个两阶段的标注管线，允许标注者根据AIS指南适当地评估模型输出。通过人工评价研究在三个生成数据集(两个是对话QA领域，一个是摘要领域)上对这个方法进行了实证验证，表明AIS可以作为衡量模型生成语句是否有基本来源支持的通用框架。

With recent improvements in natural language generation (NLG) models for various applications, it has become imperative to have the means to identify and evaluate whether NLG output is only sharing verifiable information about the external world. In this work, we present a new evaluation framework entitled Attributable to Identified Sources (AIS) for assessing the output of natural language generation models, when such output pertains to the external world. We first define AIS and introduce a twostage annotation pipeline for allowing annotators to appropriately evaluate model output according to AIS guidelines. We empirically validate this approach on three generation datasets (two in the conversational QA domain and one in summarization) via human evaluation studies that suggest that AIS could serve as a common framework for measuring whether model-generated statements are supported by underlying sources. We release guidelines for the human evaluation studies.

2、[CV] Scene Representation Transformer: Geometry-Free Novel View Synthesis Through Set-Latent Scene Representations

M S. M. Sajjadi, H Meyer, E Pot, U Bergmann, K Greff, N Radwan, S Vora, M Lucic, D Duckworth, A Dosovitskiy, J Uszkoreit, T Funkhouser, A Tagliasacchi

[Google Research]

场景表示Transformer：基于集合潜表示的免几何新视图合成。计算机视觉中的一个经典问题是从少数图像中推断出一个3D场景表示，该表示可用于以可交互速度渲染新视图。之前的工作集中在重建预定义3D表示，如纹理网格，或隐式表示，如辐射场，通常需要输入具有精确相机位置的图像，并且每个新场景的处理时间很长。本文提出场景表示Transformer(SRT)，可处理一个新区域的有姿态或无姿态RGB图像，推断出一种"集合潜场景表示"，并合成新视图，所有这些都在一个前馈通道中完成。为了计算场景表示，本文提出了视觉Transformer对图像集的一般化，以实现全局信息整合，从而进行3D推理。一个高效的解码器将光场参数化，并将其纳入场景表示，以呈现新视图。通过最小化新视图的重建误差，对学习进行端到端监督。这种方法在合成数据集上的PSNR和速度方面优于最近的基线，包括为本文创建的新数据集。证明了SRT的规模可以支持使用街景图像对现实世界的户外环境进行交互式可视化和语义分割。

A classical problem in computer vision is to infer a 3D scene representation from few images that can be used to render novel views at interactive rates. Previous work focuses on reconstructing pre-defined 3D representations, e.g. textured meshes, or implicit representations, e.g. radiance fields, and often requires input images with precise camera poses and long processing times for each novel scene. In this work, we propose the Scene Representation Transformer (SRT), a method which processes posed or unposed RGB images of a new area, infers a “set-latent scene representation”, and synthesises novel views, all in a single feed-forward pass. To calculate the scene representation, we propose a generalization of the Vision Transformer to sets of images, enabling global information integration, and hence 3D reasoning. An efficient decoder transformer parameterizes the light field by attending into the scene representation to render novel views. Learning is supervised end-to-end by minimizing a novel-view reconstruction error. We show that this method outperforms recent baselines in terms of PSNR and speed on synthetic datasets, including a new dataset created for the paper. Further, we demonstrate that SRT scales to support interactive visualization and semantic segmentation of real-world outdoor environments using Street View imagery

3、[CV] SegDiff: Image Segmentation with Diffusion Probabilistic Models

T Amit, E Nachmani, T Shaharbany, L Wolf

[Tel-Aviv University & Facebook AI Research]

SegDiff: 基于扩散概率模型的图像分割。扩散概率方法被用于最先进的图像生成。本文提出一种扩展此类模型的方法，用于执行图像分割。该方法是端到端的学习，不依赖预训练的骨干。输入图像中的信息和当前估计的分割图中的信息通过两个编码器的输出相加进行合并。额外的编码层和解码器用扩散模型迭代细化分割图。由于扩散模型是概率性的，被多次应用，其结果合并成最终的分割图。新方法在Cityscapes验证集、Vaihingen建筑分割基准和MoNuSeg数据集上获得了最先进的结果。

Diffusion Probabilistic Methods are employed for stateof-the-art image generation. In this work, we present a method for extending such models for performing image segmentation. The method learns end-to-end, without relying on a pre-trained backbone. The information in the input image and in the current estimation of the segmentation map is merged by summing the output of two encoders. Additional encoding layers and a decoder are then used to iteratively refine the segmentation map using a diffusion model. Since the diffusion model is probabilistic, it is applied multiple times and the results are merged into a final segmentation map. The new method obtains state-of-the-art results on the Cityscapes validation set, the Vaihingen building segmentation benchmark, and the MoNuSeg dataset.

4、[CV] DenseCLIP: Extract Free Dense Labels from CLIP

C Zhou, C C Loy, B Dai

[Nanyang Technological University]

DenseCLIP：CLIP稠密标记提取。对比语言-图像预训练(CLIP)在开放词表的零样本图像识别方面取得了显著突破。许多最近的研究利用预训练的CLIP模型进行图像级的分类和操纵。本文进一步探索CLIP在像素级稠密预测方面的潜力，特别是在语义分割方面。所提出的方法DenseCLIP，在没有标注和微调的情况下，对各种数据集的开放概念产生了合理的分割结果。通过添加伪标记和自训练，DenseCLIP+以很大幅度超过了SOTA的归纳式零样本语义分割方法，例如，PASCAL VOC/PASCAL Context/COCO Stuff上的未见类的mIoU从35.6/20.7/30.3提高到86.1/66.7/54.7。测试了DenseCLIP在输入损坏情况下的鲁棒性，并评估了它对细粒度目标和新概念的分辨能力。实验表明，DenseCLIP可以作为稠密预测任务的一个新的可靠的监督来源，以实现免标注的分割。

Contrastive Language-Image Pre-training (CLIP) has made a remarkable breakthrough in open-vocabulary zeroshot image recognition. Many recent studies leverage the pre-trained CLIP models for image-level classification and manipulation. In this paper, we further explore the potentials of CLIP for pixel-level dense prediction, specifically in semantic segmentation. Our method, DenseCLIP, in the absence of annotations and fine-tuning, yields reasonable segmentation results on open concepts across various datasets. By adding pseudo labeling and self-training, DenseCLIP+ surpasses SOTA transductive zero-shot semantic segmentation methods by large margins, e.g., mIoUs of unseen classes on PASCAL VOC/PASCAL Context/COCO Stuff are improved from 35.6/20.7/30.3 to 86.1/66.7/54.7. We also test the robustness of DenseCLIP under input corruption and evaluate its capability in discriminating fine-grained objects and novel concepts. Our finding suggests that DenseCLIP can serve as a new reliable source of supervision for dense prediction tasks to achieve annotation-free segmentation.

5、[CL] Learning to Recombine and Resample Data for Compositional Generalization

E Akyürek, A F Akyürek, J Andreas

[MIT CSAIL & Boston University]

面向组合泛化的数据重组和重采样学习。灵活的神经序列模型在各种任务中的表现优于基于语法和自动机的对应模型。然而，神经模型在需要训练数据以外的组合泛化的环境中表现不佳，特别是对罕见或未见过的子序列。之前的工作发现符号支架(如语法或自动机)在这些环境中至关重要。本文提出R&R，一种学习型的数据增强方案，可以在不求助于潜符号结构的情况下实现大类的组合泛化。R&R有两个组成部分：通过一个基于原型的生成模型重组原始训练样本，以及对生成的样本重采样以鼓励推断。在一个用重组和重采样的样本增强的数据集上训练一个普通的神经序列模型，大大改善了两个语言处理问题的泛化性——指令听从和形态学分析——R&R能从少至8个初始样本中学习新的结构和时态。

Flexible neural sequence models outperform grammarand automaton-based counterparts on a variety of tasks. However, neural models perform poorly in settings requiring compositional generalization beyond the training data—particularly to rare or unseen subsequences. Past work has found symbolic scaffolding (e.g. grammars or automata) essential in these settings. We describe R&R, a learned data augmentation scheme that enables a large category of compositional generalizations without appeal to latent symbolic structure. R&R has two components: recombination of original training examples via a prototype-based generative model and resampling of generated examples to encourage extrapolation. Training an ordinary neural sequence model on a dataset augmented with recombined and resampled examples significantly improves generalization in two language processing problems—instruction following (SCAN) and morphological analysis (SIGMORPHON 2018)—where R&R enables learning of new constructions and tenses from as few as eight initial examples.