爱可可AI前沿推介(6.11)

LG - 机器学习 CV - 计算机视觉 CL - 计算与语言 AS - 音频与语音 RO - 机器人

转自爱可可爱生活

摘要：超越模仿游戏基准(BIG-bench)、基于局部掩码重构的高效自监督视觉预训练、掩码引导分层深度细化、移动端视觉Transformer的可分离自注意力、面向3D场景操纵的体解缠、基于习得变形图的跨多域对齐样本生成、基于上下文RQ-Transformer的高效图像生成、基于分数生成模型高分辨率图像加速合成、基于噪声注入的过参数化模型显式正则化

1、[CL] Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

A Srivastava, A Rastogi, A Rao, A A M Shoeb, A Abid, A Fisch...

超越模仿游戏基准(BIG-bench)：语言模型能力的量化和推断。随着规模的扩大，语言模型在数量上的改进和新的定性能力方面都有所表现。尽管它们具有潜在的变革性影响，但这些新的能力还没有被很好地描述。为了给未来的研究提供信息，为颠覆性的新模型能力做准备，并改善社会上的有害影响，我们必须了解语言模型目前和未来的能力和限制。为了应对这一挑战，本文提出了Beyond the Imitation Game基准（BIG-bench）。BIG-bench目前由204个任务组成，由132个机构的442名作者贡献。任务主题是多样化的，从语言学、儿童发展、数学、常识推理、生物学、物理学、社会偏见、软件开发等方面汲取问题。BIG-bench专注于那些被认为超出目前语言模型能力的任务。本文在BIG-bench上评估了OpenAI的GPT模型、Google-internal dense transformer架构和Switch-style sparse transformer的行为，模型规模跨越了数百万到数千亿的参数。此外，一个人工专家评审员团队执行了所有任务，以提供一个强大的基线。研究结果包括：模型性能和校准都随着规模的扩大而提高，但绝对值较差(与评审员的表现相比也是如此)；各模型类别的性能非常相似，尽管从稀疏性中受益；逐渐改善和可预测的任务通常涉及大量的知识或记忆成分，而在关键规模上表现出"突破 "行为的任务通常涉及多个步骤或成分，或脆弱的指标；在有模糊背景的情况下，社会偏见通常随着规模扩大而增加，但这可以通过提示得到改善。

Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-future capabilities and limitations of language models. To address this challenge, we introduce the Beyond the Imitation Game benchmark (BIG-bench). BIG-bench currently consists of 204 tasks, contributed by 442 authors across 132 institutions. Task topics are diverse, drawing problems from linguistics, childhood development, math, common-sense reasoning, biology, physics, social bias, software development, and beyond. BIG-bench focuses on tasks that are believed to be beyond the capabilities of current language models. We evaluate the behavior of OpenAI’s GPT models, Google-internal dense transformer architectures, and Switch-style sparse transformers on BIG-bench, across model sizes spanning millions to hundreds of billions of parameters. In addition, a team of human expert raters performed all tasks in order to provide a strong baseline. Findings include: model performance and calibration both improve with scale, but are poor in absolute terms (and when compared with rater performance); performance is remarkably similar across model classes, though with benefits from sparsity; tasks that improve gradually and predictably commonly involve a large knowledge or memorization component, whereas tasks that exhibit “breakthrough” behavior at a critical scale often involve multiple steps or components, or brittle metrics; social bias typically increases with scale in settings with ambiguous context, but this can be improved with prompting.

https://arxiv.org/abs/2206.04615

2、[CV] Efficient Self-supervised Vision Pretraining with Local Masked Reconstruction

J Chen, M Hu, B Li, M Elhoseiny

[King Abdullah University of Science and Technology (KAUST) & Nanyang Technological University]

基于局部掩码重构的高效自监督视觉预训练。计算机视觉的自监督学习已经取得了巨大的进展，改善了许多下游的视觉任务，如图像分类、语义分割和目标检测。其中，生成式自监督视觉学习方法，如MAE和BEiT，显示出良好的性能。然而，其全局掩码重建机制在计算上要求很高。为解决这个问题，本文提出局部掩码重建(LoMaR)，一种简单有效的方法，在一个简单的Transformer编码器上的7×7小窗口内进行掩码重建，与整个图像的全局掩码重建相比，改善了效率和精度之间的权衡。广泛的实验表明，LoMaR在ImageNet-1K分类中达到了84.1%的最高准确率，比MAE高出0.5%。在384×384图像上对预训练的LoMaR进行微调后，最高准确率可以达到85.4%，比MAE高出0.6%。在MS COCO上，LoMaR在物体检测上超过MAE 0.5 AP，在实例分割上超过0.5 AP。LoMaR在预训练高分辨率图像上的计算效率尤其高，例如，在预训练448×448的图像上，比MAE快3.1倍，分类精度高0.2%。这种局部掩码重建学习机制可以很容易地集成到任何其他生成式自监督学习方法中。

Self-supervised learning for computer vision has achieved tremendous progress and improved many downstream vision tasks such as image classification, semantic segmentation, and object detection. Among these, generative self-supervised vision learning approaches such as MAE and BEiT show promising performance. However, their global masked reconstruction mechanism is computationally demanding. To address this issue, we propose local masked reconstruction (LoMaR), a simple yet effective approach that performs masked reconstruction within a small window of 7×7 patches on a simple Transformer encoder, improving the trade-off between efficiency and accuracy compared to global masked reconstruction over the entire image. Extensive experiments show that LoMaR reaches 84.1% top-1 accuracy on ImageNet-1K classification, outperforming MAE by 0.5%. After finetuning the pretrained LoMaR on 384×384 images, it can reach 85.4% top-1 accuracy, surpassing MAE by 0.6%. On MS COCO, LoMaR outperforms MAE by 0.5 AP on object detection and 0.5 AP on instance segmentation. LoMaR is especially more computation-efficient on pretraining high-resolution images, e.g., it is 3.1× faster than MAE with 0.2% higher classification accuracy on pretraining 448×448 images. This local masked reconstruction learning mechanism can be easily integrated into any other generative self-supervised learning approach. Our code will be publicly available.

https://arxiv.org/abs/2206.00790

3、[CV] Layered Depth Refinement with Mask Guidance

S Y Kim, J Zhang, S Niklaus, Y Fan, S Chen, Z Lin, M Kim

[KAIST & Adobe]

掩码引导分层深度细化。深度图被广泛用于从3D渲染到2D图像效果(如虚化)的各种应用中。然而，那些由单一图像深度估计(SIDE)模型预测的深度图往往无法捕捉到物体中孤立的洞和/或有不准确的边界区域。同时，高质量的掩码更容易获得，使用商业自动掩码工具或现成的分割和抠图方法，甚至通过手动编辑。本文构建了新的掩码引导的深度细化问题，用通用掩码来细化SIDE模型的深度预测。该框架执行分层细化和inpainting/outpainting，将深度图分解成两个独立的层，分别由掩码和反掩码表示。由于具有深度和掩码标注的数据集很少，本文提出一种自监督学习方案，用任意的掩码和RGB-D数据集。经验表明，所提出方法对不同类型的掩码和初始深度预测具有鲁棒性，能够准确地完善掩码内外边界区域的深度值。本文用一项消融研究进一步分析了所提出的模型，并在实际应用中展示了结果。

Depth maps are used in a wide range of applications from 3D rendering to 2D image effects such as Bokeh. However, those predicted by single image depth estimation (SIDE) models often fail to capture isolated holes in objects and/or have inaccurate boundary regions. Meanwhile, high-quality masks are much easier to obtain, using commercial automasking tools or off-the-shelf methods of segmentation and matting or even by manual editing. Hence, in this paper, we formulate a novel problem of mask-guided depth refinement that utilizes a generic mask to refine the depth prediction of SIDE models. Our framework performs layered refinement and inpainting/outpainting, decomposing the depth map into two separate layers signified by the mask and the inverse mask. As datasets with both depth and mask annotations are scarce, we propose a self-supervised learning scheme that uses arbitrary masks and RGB-D datasets. We empirically show that our method is robust to different types of masks and initial depth predictions, accurately refining depth values in inner and outer mask boundary regions. We further analyze our model with an ablation study and demonstrate results on real applications. More information can be found on our project page.

https://arxiv.org/abs/2206.03048

4、[CV] Separable Self-attention for Mobile Vision Transformers

S Mehta, M Rastegari

[Apple]

移动端视觉Transformer的可分离自注意力。移动端视觉Transformer(MobileViT)可以在几个移动端视觉任务中实现最先进的性能，包括分类和检测。尽管这些模型的参数较少，但与基于卷积神经网络的模型相比，它们延迟较高。MobileViT的主要效率瓶颈是Transformer中的多头自注意力(MHA)，需要O(k)的时间复杂度，与token(或图块)的数量k有关。此外，MHA需要昂贵的操作(例如，批量矩阵乘法)来计算自注意力，影响了资源有限设备上的延迟。本文提出一种具有线性复杂度，即O(k)的可分离自注意力方法。所提出方法一个简单而有效的特点是，采用元素级操作来计算自注意力，使其成为资源受限设备的一个很好的选择。改进后的模型，MobileViTv2，在几个移动端视觉任务上是最先进的，包括ImageNet物体分类和MS-COCO目标检测。凭借约300万个参数，MobileViTv2在ImageNet数据集上达到了75.6%的最高准确率，比MobileViT高出约1%，同时在移动设备上的运行速度快3.2倍。

Mobile vision transformers (MobileViT) can achieve state-of-the-art performance across several mobile vision tasks, including classification and detection. Though these models have fewer parameters, they have high latency as compared to convolutional neural network-based models. The main efficiency bottleneck in MobileViT is the multi-headed self-attention (MHA) in transformers, which requires O(k) time complexity with respect to the number of tokens (or patches) k. Moreover, MHA requires costly operations (e.g., batch-wise matrix multiplication) for computing self-attention, impacting latency on resource-constrained devices. This paper introduces a separable self-attention method with linear complexity, i.e. O(k). A simple yet effective characteristic of the proposed method is that it uses element-wise operations for computing self-attention, making it a good choice for resource-constrained devices. The improved model, MobileViTv2, is state-of-the-art on several mobile vision tasks, including ImageNet object classification and MS-COCO object detection. With about three million parameters, MobileViTv2 achieves a top-1 accuracy of 75.6% on the ImageNet dataset, outperforming MobileViT by about 1% while running 3.2× faster on a mobile device. Our source code is available at: https://github.com/apple/ml-cvnets

https://arxiv.org/abs/2206.02680

5、[CV] Volumetric Disentanglement for 3D Scene Manipulation

S Benaim, F Warburg, P E Christensen, S Belongie

[University of Copenhagen & Technical University of Denmark]

面向3D场景操纵的体解缠。最近，微分体渲染的进展使得复杂3D场景的照片级逼真和精细化重建取得了重大突破，这是许多虚拟现实应用的关键。然而，在增强现实的背景下，人们可能还希望对场景中的物体进行语义操作或增强。为此，本文提出一个体框架，用于（i）将给定的前景物体的体表示从背景中分离出来，以及（ii）对前景物体和背景进行语义操纵。该框架将一组指定训练视图所需前景物体的2D掩码以及相关的2D视图和姿态作为输入，并产生一个考虑到周围照明、反射和部分遮挡的前景-背景分离，可用于训练和新视图。所提出方法能单独控制像素颜色和深度，以及前景和背景物体的3D相似变换。本文展示了该框架在一些下游操作任务上的适用性，包括物体伪装、非负3D物体补全、3D物体变换、3D物体补全和基于3D文本的物体操作。

Recently, advances in differential volumetric rendering enabled significant breakthroughs in the photo-realistic and fine-detailed reconstruction of complex 3D scenes, which is key for many virtual reality applications. However, in the context of augmented reality, one may also wish to effect semantic manipulations or augmentations of objects within a scene. To this end, we propose a volumetric framework for (i) disentangling or separating, the volumetric representation of a given foreground object from the background, and (ii) semantically manipulating the foreground object, as well as the background. Our framework takes as input a set of 2D masks specifying the desired foreground object for training views, together with the associated 2D views and poses, and produces a foreground-background disentanglement that respects the surrounding illumination, reflections, and partial occlusions, which can be applied to both training and novel views. Our method enables the separate control of pixel color and depth as well as 3D similarity transformations of both the foreground and background objects. We subsequently demonstrate the applicability of our framework on a number of downstream manipulation tasks including object camouflage, non-negative 3D object inpainting, 3D object translation, 3D object inpainting, and 3D text-based object manipulation. Full results are given in our project webpage at https: //sagiebenaim.github.io/volumetric-disentanglement/

https://arxiv.org/abs/2206.02776