爱可可AI前沿推介(3.24)

LG - 机器学习 CV - 计算机视觉 CL - 计算与语言 AS - 音频与语音 RO - 机器人

转自爱可可爱生活

摘要：面向物理深度学习的半逆梯度、基于自蒸馏和互蒸馏的无监督句对建模、视觉Transformer的梯度逆向、基于跨模态蒸馏的城市场景无监督语义分割、面向大规模场景重建的辐射场融合、基于Transformer的长文档高效分类、基于条件匹配的开放词表DETR、基于直接图块分布匹配的自然图像生成、面向预训练模型的大规模多模态数据集

1、[LG] Half-Inverse Gradients for Physical Deep Learning

P Schnell, P Holl, N Thuerey

[Technical University of Munich]

面向物理深度学习的半逆梯度。最近在深度学习方面的工作表明，将可微物理模拟器整合到训练过程中，可以极大提高结果质量。虽然这种组合代表了比有监督神经网络训练更复杂的优化任务，但通常采用相同的基于梯度的优化器来最小化损失函数。然而，综合物理求解器对梯度流有深远的影响，因为操纵大小和方向的尺度是许多物理过程的固有属性。因此，梯度流往往是非常不平衡的，并带来了一个现实的基于梯度优化器表现不佳的环境。本文分析了物理优化和神经网络优化的特点，得出了一种不受这种现象影响的新方法，基于雅各布矩阵的半逆，结合经典网络和物理优化器原理来解决综合优化任务。从对梯度下降和高斯-牛顿方法之间平滑过渡的分析中得出，所提新方法更有效地学习物理模式，而不会通过大量的权重更新使网络过度紧张，实现了学习目标更快、更准确的最小化。与最先进的神经网络优化器相比，所提出方法收敛更快，并产生了更好的解决方案，在涉及非线性振荡器、薛定谔方程和泊松问题的三个复杂学习问题上证明了这一点。

Recent works in deep learning have shown that integrating differentiable physics simulators into the training process can greatly improve the quality of results. Although this combination represents a more complex optimization task than supervised neural network training, the same gradient-based optimizers are typically employed to minimize the loss function. However, the integrated physics solvers have a profound effect on the gradient flow as manipulating scales in magnitude and direction is an inherent property of many physical processes. Consequently, the gradient flow is often highly unbalanced and creates an environment in which existing gradient-based optimizers perform poorly. In this work, we analyze the characteristics of both physical and neural network optimizations to derive a new method that does not suffer from this phenomenon. Our method is based on a halfinversion of the Jacobian and combines principles of both classical network and physics optimizers to solve the combined optimization task. Compared to state-ofthe-art neural network optimizers, our method converges more quickly and yields better solutions, which we demonstrate on three complex learning problems involving nonlinear oscillators, the Schrödinger equation and the Poisson problem.

2、[CL] Trans-Encoder: Unsupervised sentence-pair modelling through self- and mutual-distillations

F Liu, Y Jiao, J Massiah, E Yilmaz, S Havrylov

[University of Cambridge & Amazon]

Trans-Encoder：基于自蒸馏和互蒸馏的无监督句对建模。在NLP中，大量任务涉及两个序列之间的配对比较(例如，句子相似性和转述识别)。句对任务主要采用两种形式：双编码器和交叉编码器。双编码器产生固定维度的句子表示，计算效率高，但通常表现得比交叉编码器差。交叉编码器可以利用注意力头来利用句子间的交互以获得更好的性能，但需要任务微调，计算成本更高。本文提出一种完全无监督的句对模型TRANS-ENCODER，将这两种学习范式结合到一个迭代的联合框架中，以同时学习增强的双编码和交叉编码。在预训练语言模型(PLM)的基础上，先将其转换为无监督双编码器，在双编码器和交叉编码器的任务表述之间进行交替。在每次交替中，一个任务表述将产生伪标签，这些伪标签被用作另一个任务表述的学习信号。提出一种扩展方案，在多个PLM上并行进行这种自蒸馏方法，用其伪标签的平均值进行互蒸馏。据我们所知，TRANS-ENCODER创造了第一个完全无监督的交叉编码器和一个最先进的无监督的句子相似性双编码器。TRANSENCODER的双编码器和交叉编码器形式都比最近提出的最先进的无监督句子编码器，如Mirror-BERT和SimCSE在句子相似性基准上的表现高出5%。

In NLP, a large volume of tasks involve pairwise comparison between two sequences (e.g., sentence similarity and paraphrase identification). Predominantly, two formulations are used for sentence-pair tasks: bi-encoders and cross-encoders. Bi-encoders produce fixed-dimensional sentence representations and are computationally efficient, however, they usually underperform cross-encoders. Crossencoders can leverage their attention heads to exploit inter-sentence interactions for better performance but they require task finetuning and are computationally more expensive. In this paper, we present a completely unsupervised sentence-pair model termed as TRANS-ENCODER that combines the two learning paradigms into an iterative joint framework to simultaneously learn enhanced biand crossencoders. Specifically, on top of a pre-trained language model (PLM), we start with converting it to an unsupervised bi-encoder, and then alternate between the biand cross-encoder task formulations. In each alternation, one task formulation will produce pseudo-labels which are used as learning signals for the other task formulation. We then propose an extension to conduct such self-distillation approach on multiple PLMs in parallel and use the average of their pseudo-labels for mutual-distillation. TRANS-ENCODER creates, to the best of our knowledge, the first completely unsupervised cross-encoder and also a state-of-the-art unsupervised bi-encoder for sentence similarity. Both the bi-encoder and cross-encoder formulations of TRANSENCODER outperform recently proposed state-of-the-art unsupervised sentence encoders such as Mirror-BERT (Liu et al., 2021) and SimCSE (Gao et al., 2021) by up to 5% on the sentence similarity benchmarks. Code and models are released at https://github.com/amzn/trans-encoder.

3、[CV] GradViT: Gradient Inversion of Vision Transformers

A Hatamizadeh, H Yin, H Roth, W Li, J Kautz, D Xu, P Molchanov

[NVIDIA]

GradViT：视觉Transformer的梯度逆向。本文展示了视觉Transformer(ViT)对基于梯度逆向攻击的脆弱性。在这种攻击中，给出模型权重和相应梯度，原始数据批次可被重构出来。引入一种名为GradViT的方法，通过一个迭代过程将随机噪声优化为自然图像。优化目标包括：（i）匹配梯度的损失；（ii）以预训练CNN模型批次规范化统计的距离形式的图像先验；以及（iii）在图块上的总变化正则化，以指导正确的恢复位置。提出一种独特的损失规划函数来克服优化过程中的局部最小值。在ImageNet1K和MS-Celeb-1M数据集上评估了GadViT，并观察到前所未有的高保真度和对原始(隐藏)数据的接近程度。分析过程中发现，由于注意力机制的存在，视觉Transformer比之前研究的CNN明显更脆弱。所提出方法在定性和定量指标上都展示了梯度逆向的新的最先进的结果。

In this work we demonstrate the vulnerability of vision transformers (ViTs) to gradient-based inversion attacks. During this attack, the original data batch is reconstructed given model weights and the corresponding gradients. We introduce a method, named GradViT, that optimizes random noise into naturally looking images via an iterative process. The optimization objective consists of (i) a loss on matching the gradients, (ii) image prior in the form of distance to batchnormalization statistics of a pretrained CNN model, and (iii) a total variation regularization on patches to guide correct recovery locations. We propose a unique loss scheduling function to overcome local minima during optimization. We evaluate GadViT on ImageNet1K and MS-Celeb-1M datasets, and observe unprecedentedly high fidelity and closeness to the original (hidden) data. During the analysis we find that vision transformers are significantly more vulnerable than previously studied CNNs due to the presence of the attention mechanism. Our method demonstrates new state-of-the-art results for gradient inversion in both qualitative and quantitative metrics. Project page at https://gradvit.github.io/.

4、[CV] Drive&Segment: Unsupervised Semantic Segmentation of Urban Scenes via Cross-modal Distillation

A Vobecky, D Hurych, O Siméoni, S Gidaris, A Bursuc, P Pérez, J Sivic

[CTU in Prague & valeo.ai]

Drive&Segment：基于跨模态蒸馏的城市场景无监督语义分割。本文研究在没有任何人工标注的情况下学习城市场景像素化语义图像分割，这些数据是由配备了摄像头和LiDAR传感器的汽车在城市中行驶时采集的未经整理的原始数据。本文的贡献有三个方面：提出了一种新方法，通过利用同步的LiDAR和图像数据，进行跨模态的无监督图像语义分割学习，关键是用一个物体建议模块，分析LiDAR点云以获得空间一致的目标建议；这些3D目标建议可与输入的图像对齐，并可靠地聚类为有语义的伪类；开发了一种跨模态蒸馏方法，利用部分标注了所产生的伪类的图像数据，来训练基于Transformer的图像语义分割模型。通过在四个不同的测试数据集(Cityscapes, Dark Zurich, Nighttime Driving 和 ACDC)上进行测试，展示了所提出方法的通用能力，不需要进行任何微调，与该问题的现有技术水平相比有了明显改进。

This work investigates learning pixel-wise semantic image segmentation in urban scenes without any manual annotation, just from the raw noncurated data collected by cars which, equipped with cameras and LiDAR sensors, drive around a city. Our contributions are threefold. First, we propose a novel method for cross-modal unsupervised learning of semantic image segmentation by leveraging synchronized LiDAR and image data. The key ingredient of our method is the use of an object proposal module that analyzes the LiDAR point cloud to obtain proposals for spatially consistent objects. Second, we show that these 3D object proposals can be aligned with the input images and reliably clustered into semantically meaningful pseudo-classes. Finally, we develop a crossmodal distillation approach that leverages image data partially annotated with the resulting pseudo-classes to train a transformer-based model for image semantic segmentation. We show the generalization capabilities of our method by testing on four different testing datasets (Cityscapes, Dark Zurich, Nighttime Driving and ACDC) without any finetuning, and demonstrate significant improvements compared to the current state of the art on this problem. 3

5、[CV] NeRFusion: Fusing Radiance Fields for Large-Scale Scene Reconstruction

X Zhang, S Bi, K Sunkavalli, H Su, Z Xu

[University of California, San Diego & Adobe Research]

NeRFusion: 面向大规模场景重建的辐射场融合。虽然NeRF在神经重建和渲染方面取得了巨大的成功，但其有限的MLP容量和较长的每场景优化时间使其对大规模室内场景建模具有挑战性。相比之下，经典3D重建方法可以处理大规模场景，但不能产生真实渲染。本文提出NeRFusion方法，结合了NeRF和基于TSDF的融合技术的优点，以实现高效大规模重建和照片级渲染。对输入图像序列进行处理，通过直接的网络推理来预测每一帧的局部辐射度场。用一种新的递归网络对这些场进行融合，以22帧/秒的速度实时增量重建一个全局的、稀疏的场景表示。该全局量可以被进一步微调以提高渲染质量。实验证明NeRFusion在大规模的室内和小规模的目标场景中都达到了最先进的质量，而且重建速度大大超过了NeRF和其他最近的方法。

While NeRF [28] has shown great success for neural reconstruction and rendering, its limited MLP capacity and long per-scene optimization times make it challenging to model large-scale indoor scenes. In contrast, classical 3D reconstruction methods can handle large-scale scenes but do not produce realistic renderings. We propose NeRFusion, a method that combines the advantages of NeRF and TSDF-based fusion techniques to achieve efficient largescale reconstruction and photo-realistic rendering. We process the input image sequence to predict per-frame local radiance fields via direct network inference. These are then fused using a novel recurrent neural network that incrementally reconstructs a global, sparse scene representation in real-time at 22 fps. This global volume can be further finetuned to boost rendering quality. We demonstrate that NeRFusionachieves state-of-the-art quality on both large-scale indoor and small-scale object scenes, with substantially faster reconstruction than NeRF and other recent methods.