爱可可AI前沿推介(11.12)

LG - 机器学习 CV - 计算机视觉 CL - 计算与语言 AS - 音频与语音 RO - 机器人

转自爱可可爱生活

1、[CV] Advances in Neural Rendering

A Tewari, J Thies...

[MPI for Informatics & MPI for Intelligent Systems & Google Research & ETH Zurich & Reality Labs Research & MIT & Technical University of Munich & Stanford University]

神经渲染进展综述。合成照片般逼真的图像和视频是计算机图形学的核心，也是几十年来研究的重点。传统上，场景的合成图像是通过光栅化或光线追踪等渲染算法生成的，这些算法将具体定义的几何和材料属性的表示作为输入。总的来说，这些输入定义了实际的场景和被渲染的内容，并被称为场景表示(一个场景由一个或多个物体组成)。场景表示的例子是带有伴随纹理的三角形网格(例如，由艺术家创建)、点云(例如，来自深度传感器)、体网格(例如，来自CT扫描)或隐式表面函数(例如，截断有符号距离场)。使用可微渲染损失从观测中重建这样的场景表示，被称为逆向图形学或逆向渲染。神经渲染与此密切相关，它结合了经典计算机图形学和机器学习的思想，创造了从真实世界的观察中合成图像的算法。神经渲染是朝着合成照片般逼真的图像和视频内容这一目标的飞跃。近年来，通过数以百计的文献，我们看到了这一领域的巨大进步，这些文献展示了将可学习组件注入渲染管道的不同方法。本文是关于神经渲染进展的最新报告，着重介绍了将经典的渲染原理与学习的3D场景表示(现在通常被称为神经场景表示)相结合的方法。这些方法的一个关键优势是，在设计上是3D一致的，能实现诸如捕获场景的新视角合成的应用。除了处理静态场景的方法外，还包括用于非刚性变形物体建模以及场景编辑和组合神经场景表示。虽然这些方法大多是针对场景的，但也讨论了可以跨越物体类别并可用于生成任务的技术。除了回顾这些最新的方法，还提供了当前文献中使用的基本概念和定义的概述。最后，讨论了开放的挑战和社会影响。

Synthesizing photo-realistic images and videos is at the heart of computer graphics and has been the focus of decades of research. Traditionally, synthetic images of a scene are generated using rendering algorithms such as rasterization or ray tracing, which take specifically defined representations of geometry and material properties as input. Collectively, these inputs define the actual scene and what is rendered, and are referred to as the scene representation (where a scene consists of one or more objects). Example scene representations are triangle meshes with accompanied textures (e.g., created by an artist), point clouds (e.g., from a depth sensor), volumetric grids (e.g., from a CT scan), or implicit surface functions (e.g., truncated signed distance fields). The reconstruction of such a scene representation from observations using differentiable rendering losses is known as inverse graphics or inverse rendering. Neural rendering is closely related, and combines ideas from classical computer graphics and machine learning to create algorithms for synthesizing images from real-world observations. Neural rendering is a leap forward towards the goal of synthesizing photo-realistic image and video content. In recent years, we have seen immense progress in this field through hundreds of publications that show different ways to inject learnable components into the rendering pipeline. This state-of-the-art report on advances in neural rendering focuses on methods that combine classical rendering principles with learned 3D scene representations, often now referred to as neural scene representations. A key advantage of these methods is that they are 3D-consistent by design, enabling applications such as novel viewpoint synthesis of a captured scene. In addition to methods that handle static scenes, we cover neural scene representations for modeling nonrigidly deforming objects and scene editing and composition. While most of these approaches are scene-specific, we also discuss techniques that generalize across object classes and can be used for generative tasks. In addition to reviewing these state-ofthe-art methods, we provide an overview of fundamental concepts and definitions used in the current literature. We conclude with a discussion on open challenges and social implications.

https://weibo.com/1402400261/L150db9qH

2、[LG] Gradients are Not All You Need

L Metz, C. D Freeman, S S. Schoenholz, T Kachman

[Google Research & Radboud University]

梯度不是万能的。可微规划技术在社区中被广泛使用，是过去几十年机器学习复兴的原因。虽然这些方法很强大，但它们也有局限性。在这个简短的报告中，讨论了一个常见的基于混沌的失败模式，它出现在各种可微的情况下，从递归神经网络和数字物理模拟到训练学习优化器。本文将这种故障追溯到所研究的系统的雅各布频谱，并提供了从业者何时可能期望这种故障破坏其基于微分的优化算法的标准。

Differentiable programming techniques are widely used in the community and are responsible for the machine learning renaissance of the past several decades. While these methods are powerful, they have limits. In this short report, we discuss a common chaos based failure mode which appears in a variety of differentiable circumstances, ranging from recurrent neural networks and numerical physics simulation to training learned optimizers. We trace this failure to the spectrum of the Jacobian of the system under study, and provide criteria for when a practitioner might expect this failure to spoil their differentiation based optimization algorithms.

https://weibo.com/1402400261/L154fE24C

3、[CV] Palette: Image-to-Image Diffusion Models

C Saharia, W Chan, H Chang, C A. Lee, J Ho, T Salimans, D J. Fleet, M Norouzi

[Google Research]

Palette：图像到图像扩散模型。本文提出Palette，一种用条件扩散模型进行图像到图像转换的简单通用的框架。在四个具有挑战性的图像到图像转换任务(着色、绘画、反裁剪和JPEG解压缩)中，Palette的表现优于强大的GAN和回归基线，并达到了新的技术水平。这是在没有特定任务的超参数调整、架构定制或任何辅助损失的情况下实现的，显示了理想的通用性和灵活性。本文揭示了在去噪扩散目标中使用L2与L1损失对样本多样性的影响，并通过经验性的架构研究证明了自注意力的重要性。重要的是，本文倡导基于ImageNet的统一评价协议，并报告了几个样本质量分数，包括FID、Inception Score、预训练的ResNet-50的分类精度，以及针对各种基线的参考图像的感知距离。期望这个标准化的评价协议在推动图像到图像的翻译研究中发挥关键作用。实验表明，在3个任务(着色、绘画、JPEG解压)上训练的单一通用Palette模型的表现与特定任务的专家模型一样好，甚至更好。

We introduce Palette, a simple and general framework for image-to-image translation using conditional diffusion models. On four challenging image-to-image translation tasks (colorization, inpainting, uncropping, and JPEG decompression), Palette outperforms strong GAN and regression baselines, and establishes a new state of the art. This is accomplished without task-specific hyper-parameter tuning, architecture customization, or any auxiliary loss, demonstrating a desirable degree of generality and flexibility. We uncover the impact of using L2 vs. L1 loss in the denoising diffusion objective on sample diversity, and demonstrate the importance of self-attention through empirical architecture studies. Importantly, we advocate a unified evaluation protocol based on ImageNet, and report several sample quality scores including FID, Inception Score, Classification Accuracy of a pre-trained ResNet-50, and Perceptual Distance against reference images for various baselines. We expect this standardized evaluation protocol to play a critical role in advancing image-to-image translation research. Finally, we show that a single generalist Palette model trained on 3 tasks (colorization, inpainting, JPEG decompression) performs as well or better than task-specific specialist counterparts. Check out https://bit.ly/palette-diffusion for more details. Colorization Inpainting Uncropping JPEG decompression

https://weibo.com/1402400261/L157OqiPS

4、[CV] Are Transformers More Robust Than CNNs?

Y Bai, J Mei, A Yuille, C Xie

[Johns Hopkins University & University of California, Santa Cruz]

Transformer比CNN鲁棒性更好吗？Transformer作为一种强大的视觉识别工具而出现。除了在广泛的视觉基准上表现出有竞争力的性能外，最近的文献还认为Transformer比卷积神经网络(CNN)更强大。然而，令人惊讶的是，本文发现这些结论是在不公平的实验环境下得出的，Transformer和CNN在不同的规模下进行比较，并在不同的训练框架下应用。本文旨在提供Transformer和CNN之间第一次公平和深入的比较，重点是鲁棒性评价。通过统一的训练设置，首先挑战了之前的观点，即在测量对抗性鲁棒性时，Transformer胜过CNN。更令人惊讶的是，本文发现CNN在防御对抗性攻击方面可以很容易地和Transformer一样稳健，如果适当地采用Transformer的训练方案的话。虽然关于分布外样本的泛化问题，在(外部)大规模数据集上进行预训练并不是使Transformer取得比CNN更好性能的基本要求。消融实验表明，这种更强的泛化能力主要得益于Transformer的类似于自注意力的架构本身，而不是其他的训练设置。希望这项工作能够帮助社区更好地理解和衡量Transformer和CNN的鲁棒性。

Transformer emerges as a powerful tool for visual recognition. In addition to demonstrating competitive performance on a broad range of visual benchmarks, recent works also argue that Transformers are much more robust than Convolutions Neural Networks (CNNs). Nonetheless, surprisingly, we find these conclusions are drawn from unfair experimental settings, where Transformers and CNNs are compared at different scales and are applied with distinct training frameworks. In this paper, we aim to provide the first fair & in-depth comparisons between Transformers and CNNs, focusing on robustness evaluations. With our unified training setup, we first challenge the previous belief that Transformers outshine CNNs when measuring adversarial robustness. More surprisingly, we find CNNs can easily be as robust as Transformers on defending against adversarial attacks, if they properly adopt Transformers’ training recipes. While regarding generalization on out-of-distribution samples, we show pretraining on (external) large-scale datasets is not a fundamental request for enabling Transformers to achieve better performance than CNNs. Moreover, our ablations suggest such stronger generalization is largely benefited by the Transformer’s self-attention-like architectures per se, rather than by other training setups. We hope this work can help the community better understand and benchmark the robustness of Transformers and CNNs. The code and models are publicly available at https://github.com/ytongbai/ViTs-vs-CNNs.

https://weibo.com/1402400261/L15aBh2cH

5、[CL] Prune Once for All: Sparse Pre-Trained Language Models

O Zafrir, A Larey, G Boudoukh, H Shen, M Wasserblat

[Intel Labs]

一剪永逸：稀疏预训练语言模型。基于Transformer的语言模型被应用于自然语言处理的各个领域。然而，它们效率很低，而且难以部署。近年来，人们提出了许多压缩算法，以提高大型基于Transformer的模型在目标硬件上的执行效率。本文提出一种新方法，通过整合权重修剪和模型蒸馏训练稀疏的预训练Transformer语言模型。这些稀疏的预训练模型可用于广泛任务的迁移学习，同时保持其稀疏模式。用三种已知的架构演示了所提出的方法，以创建稀疏预训练的BERT-Base、BERT-Large和DistilBERT。展示了所训练的压缩稀疏预训练模型是如何将知识迁移到五个不同的下游自然语言任务中去的，而且准确性损失最小。展示了如何用量化感知训练将稀疏模型权重进一步压缩到8位精度。例如，用在SQuADv1.1上微调的稀疏预训练BERT-Large，并将其量化为8位，编码器实现了40倍的压缩率，而精度损失不到1%。

Transformer-based language models are applied to a wide range of applications in natural language processing. However, they are inefficient and difficult to deploy. In recent years, many compression algorithms have been proposed to increase the implementation efficiency of large Transformer-based models on target hardware. In this work we present a new method for training sparse pre-trained Transformer language models by integrating weight pruning and model distillation. These sparse pre-trained models can be used to transfer learning for a wide range of tasks while maintaining their sparsity pattern. We demonstrate our method with three known architectures to create sparse pre-trained BERT-Base, BERT-Large and DistilBERT. We show how the compressed sparse pre-trained models we trained transfer their knowledge to five different downstream natural language tasks with minimal accuracy loss. Moreover, we show how to further compress the sparse models’ weights to 8bit precision using quantization-aware training. For example, with our sparse pre-trained BERT-Large fine-tuned on SQuADv1.1 and quantized to 8bit we achieve a compression ratio of 40X for the encoder with less than 1% accuracy loss. To the best of our knowledge, our results show the best compression-to-accuracy ratio for BERT-Base, BERT-Large, and DistilBERT.

https://weibo.com/1402400261/L15dgtXtC

另外几篇值得关注的论文：

[CV] Multimodal Transformer with Variable-length Memory for Vision-and-Language Navigation

面向视觉语言导航的可变长记忆多模态Transformer

C Lin, Y Jiang, J Cai, L Qu, G Haffari, Z Yuan

[Monash University & ByteDance Inc]

https://weibo.com/1402400261/L15hypshI

[CV] CLIP2TV: An Empirical Study on Transformer-based Methods for Video-Text Retrieval

CLIP2TV：Transformer视频-文本检索实证研究

Z Gao, J Liu, S Chen, D Chang, H Zhang, J Yuan

[Tencent PCG]

https://weibo.com/1402400261/L15iKgNcC

[CV] Structure from Silence: Learning Scene Structure from Ambient Sound

静默中的结构：基于环境音的场景结构学习

Z Chen, X Hu, A Owens

[University of Michigan]

https://weibo.com/1402400261/L15kcsDMz

[LG] Active Sampling for Linear Regression Beyond the L2 Norm

超越L2范数的线性回归主动采样

C Musco, C Musco, D P. Woodruff, T Yasuda

[UMass Amherst & NYU & CMU]

https://weibo.com/1402400261/L15mg4EeI

内容中包含的图片若涉及版权问题，请及时与我们联系删除