爱可可AI前沿推介 (4.20)

LG - 机器学习 CV - 计算机视觉 CL - 计算与语言 AS - 音频与语音 RO - 机器人

转自爱可可爱生活

摘要：统一文本图像掩码的文档AI预训练、面向视频实例分割的时间高效视觉Transformer、延伸句对NLI模型以推理长文档和群组、可扩展有效且高效的Transformer目标检测器、在机器学习模型中植入不可察觉后门、可逆神经网络普遍近似特性、基于离线强化学习的面向任务对话聊天机器人AI、多语种语言模型自适应微调实例研究、1,600多种语言任务上下文指令泛化测试

1、[CL] LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking

Y Huang, T Lv, L Cui, Y Lu, F Wei

[Sun Yat-sen University & Microsoft Research]

LayoutLMv3：统一文本图像掩码的文档AI预训练。自监督预训练技术在文档AI领域取得了显著进展。大多数多模态预训练模型用掩码语言建模目标，来学习文本模态上的双向表示，但它们在图像模态的预训练目标上有所不同。这种差异给多模态表示学习增加了难度。本文提出LayoutLMv3，用于预训练具有统一文本图像掩码的文档AI的多模态Transformer。此外，LayoutLMv3还预训练一个词块对齐目标，通过预测一个文本词的对应图像块是否被掩码来学习跨模态对齐。LayoutLMv3不依赖预训练的CNN或Faster R-CNN骨干来提取视觉特征，大大节省了参数并消除了局部标注。简单的统一架构和训练目标使LayoutLMv3成为通用的预训练模型，适用于以文本为中心和以图像为中心的文档AI任务。实验结果表明，LayoutLMv3不仅在以文本为中心的任务中，包括表单理解、小票理解和文档视觉问答等，而且在以图像为中心的任务中，如文档图像分类和文档布局分析等，均取得了最先进的性能。

Self-supervised pre-training techniques have achieved remarkable progress in Document AI. Most multimodal pretrained models use a masked language modeling objective to learn bidirectional representations on the text modality, but they differ in pre-training objectives for the image modality. This discrepancy adds difficulty to multimodal representation learning. In this paper, we propose LayoutLMv3 to pre-train multimodal Transformers for Document AI with unified text and image masking. Additionally, LayoutLMv3 is pre-trained with a word-patch alignment objective to learn cross-modal alignment by predicting whether the corresponding image patch of a text word is masked. The simple unified architecture and training objectives make LayoutLMv3 a general-purpose pre-trained model for both text-centric and image-centric Document AI tasks. Experimental results show that LayoutLMv3 achieves state-of-the-art performance not only in text-centric tasks, including form understanding, receipt understanding, and document visual question answering, but also in imagecentric tasks such as document image classification and document layout analysis. The code and models are publicly available at https://aka.ms/layoutlmv3.

https://arxiv.org/abs/2204.08387

2、[CV] Temporally Efficient Vision Transformer for Video Instance Segmentation

S Yang, X Wang, Y Li, Y Fang...

[Huazhong University of Science & Technology & Tencent PCG & International Digital Economy Academy (IDEA)]

面向视频实例分割的时间高效视觉Transformer。最近，视觉Transformer在图片级视觉识别任务上取得了巨大的成功。为高效模拟视频片段中的关键时间信息，本文提出一种用于视频实例分割的时间高效视觉Transformer(TeViT)。与之前基于Transformer的视觉方法不同，TeViT几乎是无卷积的，包含一个Transformer主干和一个基于查询的视频实例分割头。在骨干阶段，提出一种几乎无参数的messenger偏移机制，用于早期的时间上下文融合。在头部阶段，提出一种参数共享的时空查询交互机制，以建立视频实例和查询之间的一对一对应关系。TeViT充分利用了帧级和实例级的时空信息，以可忽略不计的额外计算成本获得了强大的时空建模能力。在三个广泛采用的VIS基准，即YouTube-VIS-2019、YouTube-VIS-2021和OVIS上，TeViT获得了最先进的结果，并保持了较高的推理速度，例如，在YouTube-VIS-2019上以68.9 FPS获得46.6 AP。

Recently vision transformer has achieved tremendous success on image-level visual recognition tasks. To effectively and efficiently model the crucial temporal information within a video clip, we propose a Temporally Efficient Vision Transformer (TeViT) for video instance segmentation (VIS). Different from previous transformer-based VIS methods, TeViT is nearly convolution-free, which contains a transformer backbone and a query-based video instance segmentation head. In the backbone stage, we propose a nearly parameter-free messenger shift mechanism for early temporal context fusion. In the head stages, we propose a parameter-shared spatiotemporal query interaction mechanism to build the one-to-one correspondence between video instances and queries. Thus, TeViT fully utilizes both framelevel and instance-level temporal context information and obtains strong temporal modeling capacity with negligible extra computational cost. On three widely adopted VIS benchmarks, i.e., YouTube-VIS-2019, YouTube-VIS-2021, and OVIS, TeViT obtains state-of-the-art results and maintains high inference speed, e.g., 46.6 AP with 68.9 FPS on YouTube-VIS-2019. Code is available at https:// github.com/hustvl/TeViT.

https://arxiv.org/abs/2204.08412

3、[CL] Stretching Sentence-pair NLI Models to Reason over Long Documents and Clusters

T Schuster, S Chen, S Buthpitiya, A Fabrikant, D Metzler

[Google Research]

延伸句对NLI模型以推理长文档和群组。自然语言推理(Natural Language Inference，NLI)作为估计句对之间语义关系的一种框架，已被NLP界广泛研究。虽然早期工作确定了NLI模型的某些偏差，但最近在建模和数据集方面的进展显示了有希望的性能。本文进一步探索了NLI模型在实际应用中的直接零次可用性，超越了其训练的句对设置。分析了这些模型对较长和域外输入的鲁棒性，开发了新的聚合方法，允许在完整的文档上操作，在ContractNLI数据集上达到最先进的性能。NLI的分数提供了强有力的检索信号，与普通的基于相似性的方法相比，带来了更多相关证据的提取。进一步研究了整个文档群，以确定来源间的差异和共识。在一个测试案例中，发现不同语言的维基百科页面之间关于同一主题实际上的不一致。

Natural Language Inference (NLI) has been extensively studied by the NLP community as a framework for estimating the semantic relation between sentence pairs. While early work identified certain biases in NLI models, recent advancements in modeling and datasets demonstrated promising performance. In this work, we further explore the direct zero-shot applicability of NLI models to real applications, beyond the sentence-pair setting they were trained on. First, we analyze the robustness of these models to longer and outof-domain inputs. Then, we develop new aggregation methods to allow operating over full documents, reaching state-of-the-art performance on the ContractNLI dataset. Interestingly, we find NLI scores to provide strong retrieval signals, leading to more relevant evidence extractions compared to common similarity-based methods. Finally, we go further and investigate whole document clusters to identify both discrepancies and consensus among sources. In a test case, we find real inconsistencies between Wikipedia pages in different languages about the same topic.

https://arxiv.org/abs/2204.07447

4、[CV] An Extendable, Efficient and Effective Transformer-based Object Detector

H Song, D Sun, S Chun, V Jampani, D Han, B Heo, W Kim, M Yang

[NAVER AI Lab & Google]

可扩展有效且高效的Transformer目标检测器。Transformer已被广泛用于许多视觉问题中，特别是视觉识别和检测。检测Transformer是第一个用于目标检测的完全端到端的学习系统，而视觉Transformer是第一个用于图像分类的完全基于Transformer的架构。本文整合了视觉和检测Transformer(ViDT)来构建一个有效和高效的目标检测器。ViDT引入一个重新配置的注意力模块，将最近的Swin Transformer扩展为一个独立的目标检测器，然后是一个计算高效的Transformer解码器，利用多尺度特征和辅助技术来提高检测性能，而不增加太多计算负荷。将其扩展到ViDT+，以支持目标检测和实例分割的联合任务学习。附加了一个有效的多尺度特征融合层，并利用两个更多的辅助训练损失，即IoU感知损失和标记损失。对微软COCO基准数据集的广泛评估结果表明，ViDT在现有的完全基于Transformer的目标检测器中获得了最佳的AP和延迟权衡，由于其对大型模型的高可扩展性，其扩展的ViDT+实现了53.2AP。

Transformers have been widely used in numerous vision problems especially for visual recognition and detection. Detection transformers are the first fully end-to-end learning systems for object detection, while vision transformers are the first fully transformer-based architecture for image classification. In this paper, we integrate Vision and Detection Transformers (ViDT) to construct an effective and efficient object detector. ViDT introduces a reconfigured attention module to extend the recent Swin Transformer to be a standalone object detector, followed by a computationally efficient transformer decoder that exploits multi-scale features and auxiliary techniques essential to boost the detection performance without much increase in computational load. In addition, we extend it to ViDT+ to support joint-task learning for object detection and instance segmentation. Specifically, we attach an efficient multi-scale feature fusion layer and utilize two more auxiliary training losses, IoU-aware loss and token labeling loss. Extensive evaluation results on the Microsoft COCO benchmark dataset demonstrate that ViDT obtains the best AP and latency trade-off among existing fully transformer-based object detectors, and its extended ViDT+ achieves 53.2AP owing to its high scalability for large models. The source code and trained models are available at https://github.com/naver-ai/vidt.

https://arxiv.org/abs/2204.07962

5、[LG] Planting Undetectable Backdoors in Machine Learning Models

S Goldwasser, M P. Kim, V Vaikuntanathan, O Zamir

[UC Berkeley & MIT & IAS]

在机器学习模型中植入不可察觉后门。鉴于训练机器学习模型所需的计算成本和技术专业性，用户可将学习任务委托给服务提供商。委托学习有明显的好处，同时也引起了对信任的严重关切。本文研究了不受信任的学习服务者可能滥用权力的情况。展示了一个恶意的学习者如何在分类器中植入一个无法察觉的后门。表面上看，这样的后门分类器行为正常，但实际上，学习者保持着一种机制，可以改变任何输入的分类，只需稍作扰动。重要的是，如果没有适当的"后门密钥"，这种机制是隐蔽的，任何计算受限的观察者都无法发现。展示了两个种植入不可检测后门的框架，具有不可比的保证。- 首先，展示了如何用数字签名方案在任何模型中植入后门。该结构保证，鉴于对原始模型和后门版本的查询访问，要找到它们之间不同的单一输入，在计算上是不可行的。这一特性意味着反向封锁的模型具有与原始模型相当的泛化误差。此外，即使区分者可以要求它所选择的后门输入，也不能后门一个新输入——即不可复制性属性。- 其次，展示了如何在用随机傅里叶特征(RFF)学习范式训练的模型中插入不可检测后门。在这种结构中，对强大的白盒区分器来说，不可检测性是成立的：给定网络和训练数据的完整描述，任何有效的区分器都无法猜测模型是"干净的"还是含有后门。后门算法在给定的训练数据上忠实地执行RFF算法，只对其随机硬币进行篡改。在有错误的连续学习问题的硬度下证明了这个强有力的保证。根据稀疏PCA的硬度，为随机ReLU网络展示了类似的白盒不可检测后门。所构建的不可检测后门也揭示了对对抗性样本鲁棒性的相关问题。特别是，通过为一个"对抗性鲁棒"的学习算法构建不可检测的后门，可以产生一个与鲁棒分类器无法区分的分类器，但其中每个输入都有一个对抗性样本，这样一来，不可检测后门的存在代表了认证对抗性鲁棒性的一个重要的理论障碍。

Given the computational cost and technical expertise required to train machine learning models, users may delegate the task of learning to a service provider. We show how a malicious learner can plant an undetectable backdoor into a classifier. On the surface, such a backdoored classifier behaves normally, but in reality, the learner maintains a mechanism for changing the classification of any input, with only a slight perturbation. Importantly, without the appropriate "backdoor key", the mechanism is hidden and cannot be detected by any computationally-bounded observer. We demonstrate two frameworks for planting undetectable backdoors, with incomparable guarantees.First, we show how to plant a backdoor in any model, using digital signature schemes. The construction guarantees that given black-box access to the original model and the backdoored version, it is computationally infeasible to find even a single input where they differ. This property implies that the backdoored model has generalization error comparable with the original model. Second, we demonstrate how to insert undetectable backdoors in models trained using the Random Fourier Features (RFF) learning paradigm or in Random ReLU networks. In this construction, undetectability holds against powerful white-box distinguishers: given a complete description of the network and the training data, no efficient distinguisher can guess whether the model is "clean" or contains a backdoor.Our construction of undetectable backdoors also sheds light on the related issue of robustness to adversarial examples. In particular, our construction can produce a classifier that is indistinguishable from an "adversarially robust" classifier, but where every input has an adversarial example! In summary, the existence of undetectable backdoors represent a significant theoretical roadblock to certifying adversarial robustness.

https://arxiv.org/abs/2204.06974

另外几篇值得关注的论文：

[LG] Universal approximation property of invertible neural networks

可逆神经网络普遍近似特性

I Ishikawa, T Teshima, K Tojo, K Oono, M Ikeda, M Sugiyama

[Ehime University & The University of Tokyo & RIKEN]

https://arxiv.org/abs/2204.07415

[CL] CHAI: A CHatbot AI for Task-Oriented Dialogue with Offline Reinforcement Learning

CHAI：基于离线强化学习的面向任务对话聊天机器人AI

S Verma, J Fu, M Yang, S Levine

[UC Berkeley]

https://arxiv.org/abs/2204.08426

[CL] Multilingual Language Model Adaptive Fine-Tuning: A Study on African Languages

多语种语言模型自适应微调：非洲语言研究

J O. Alabi, D I Adelani, M Mosbach, D Klakow

[Inria & Saarland University]

https://arxiv.org/abs/2204.06487

[CL] Benchmarking Generalization via In-Context Instructions on 1,600+ Language Tasks

1,600多种语言任务上下文指令泛化测试

Y Wang, S Mishra...

[Allen Institute for AI & Univ. of Washington & Arizona State Univ. & Sharif Univ. of Tech...]

https://arxiv.org/abs/2204.07705

内容中包含的图片若涉及版权问题，请及时与我们联系删除