爱可可AI前沿推介(1.10)

LG - 机器学习 CV - 计算机视觉 CL - 计算与语言 AS - 音频与语音 RO - 机器人

转自爱可可爱生活

1、[CL] Self-Training with Weak Supervision

G Karamanolakis, S Mukherjee, G Zheng, A H Awadallah

[Columbia University & Microsoft Research]

弱监督自训练。最先进的深度神经网络需要大规模的标记训练数据，而这些数据的获取往往是昂贵的，甚至在许多任务中是不可用的。这种情况下，特定领域规则形式的弱监督被证明是有用的，可以自动生成弱标记训练数据。然而，由于其固有的启发式和含噪的特性，利用弱规则进行学习是挑战性的。一个额外的挑战是规则的覆盖率和重叠率，之前关于弱监督的工作只考虑弱规则所覆盖的实例，因此忽略了有价值的未标记的数据。本文提出一种弱监督框架(ASTRA)，可利用特定任务的所有可用数据。通过对一个模型(学生)进行自训练来利用特定任务的未标记数据，该模型考虑了上下文表示，并预测了可能不被弱规则覆盖的实例的伪标签。进一步开发了一个规则注意力网络(教师)，学习如何将学生的伪标签与弱规则标签结合起来，条件是它们的保真度和实例的基本上下文。构建了一个半监督学习目标，用未标记数据、特定领域的规则和少量标记数据进行端到端训练。在六个文本分类基准数据集上的广泛实验，证明了该方法的有效性，比最先进的基线有明显的改善。

State-of-the-art deep neural networks require large-scale labeled training data that is often expensive to obtain or not available for many tasks. Weak supervision in the form of domainspecific rules has been shown to be useful in such settings to automatically generate weakly labeled training data. However, learning with weak rules is challenging due to their inherent heuristic and noisy nature. An additional challenge is rule coverage and overlap, where prior work on weak supervision only considers instances that are covered by weak rules, thus leaving valuable unlabeled data behind. In this work, we develop a weak supervision framework (ASTRA1) that leverages all the available data for a given task. To this end, we leverage task-specific unlabeled data through self-training with a model (student) that considers contextualized representations and predicts pseudo-labels for instances that may not be covered by weak rules. We further develop a rule attention network (teacher) that learns how to aggregate student pseudo-labels with weak rule labels, conditioned on their fidelity and the underlying context of an instance. Finally, we construct a semi-supervised learning objective for end-to-end training with unlabeled data, domain-specific rules, and a small amount of labeled data. Extensive experiments on six benchmark datasets for text classification demonstrate the effectiveness of our approach with significant improvements over state-of-the-art baselines.

2、[CV] Transfer Learning for Pose Estimation of Illustrated Characters

S Chen, M Zwicker

[University of Maryland]

迁移学习插画人物姿态估计。人体姿态信息是许多下游图像处理任务的重要组成部分，如活动识别和运动跟踪。同样，插画人物领域的姿态估计器将为辅助性内容创建任务提供有价值的先验，如参考姿态检索和自动人物动画。但是，尽管现代数据驱动技术已经大大改善了自然图像上的姿态估计性能，但对插画所做的工作还很少。本文通过从特定领域和特定任务的源模型中有效地迁移学习，来弥补这一领域差距。升级并扩展了现有的插画姿态估计数据集，并为分类和分割子任务引入两个新的数据集。应用由此产生的最先进的角色姿态估计器来解决姿态引导插画检索新任务。

Human pose information is a critical component in many downstream image processing tasks, such as activity recognition and motion tracking. Likewise, a pose estimator for the illustrated character domain would provide a valuable prior for assistive content creation tasks, such as reference pose retrieval and automatic character animation. But while modern data-driven techniques have substantially improved pose estimation performance on natural images, little work has been done for illustrations. In our work, we bridge this domain gap by efficiently transfer-learning from both domain-specific and task-specific source models. Additionally, we upgrade and expand an existing illustrated pose estimation dataset, and introduce two new datasets for classification and segmentation subtasks. We then apply the resultant state-of-the-art character pose estimator to solve the novel task of pose-guided illustration retrieval. All data, models, and code will be made publicly available.

3、[LG] Feature-Attending Recurrent Modules for Generalization in Reinforcement Learning

W Carvalho, A Lampinen, K Nikiforou, F Hill, M Shanahan

[University of Michigan & DeepMind]

面向强化学习泛化的特征伴随循环模块。最近，深度强化学习(Deep RL)在开发泛化算法方面取得了重大进展。然而，大多数算法只针对单一类型的泛化设置。本文研究了三种不同任务结构的泛化问题。(a) 由定期发生的目标运动的空间和时间成分组成的任务；(b) 由对定期发生的3D目标的主动感知和导航组成的任务；以及(c) 由在定期发生的目标配置序列上记忆目标信息组成的任务。这些不同的任务结构都有一个基本的组成思想：任务的完成总是涉及到将面向任务的感知和行为的重复部分结合起来。假设一个智能体能发现捕捉这些重复出现的任务片段的表示，就能在任务结构中进行泛化。对于所处理的任务来说，相当于识别单个目标运动的表示，用于对3D目标的导航，以及通过目标配置的导航。从认知科学中得到启发，将智能体经验中重复出现的片段的表示称为"感知模式"。本文提出了"特征伴随递归模块"(FARM)，学习一种状态表示，其中感知模式分布在多个相对较小的递归模块中。将FARM与利用空间注意力的递归结构进行比较，后者将观察特征降为空间位置的加权平均。实验表明，所提出的特征注意力机制能更好地使FARM在我们研究的各种以目标为中心的领域中得到推广。

Deep reinforcement learning (Deep RL) has recently seen significant progress in developing algorithms for generalization. However, most algorithms target a single type of generalization setting. In this work, we study generalization across three disparate task structures: (a) tasks composed of spatial and temporal compositions of regularly occurring object motions; (b) tasks composed of active perception of and navigation towards regularly occurring 3D objects; and (c) tasks composed of remembering goal-information over sequences of regularly occurring object-configurations. These diverse task structures all share an underlying idea of compositionality: task completion always involves combining recurring segments of task-oriented perception and behavior. We hypothesize that an agent can generalize within a task structure if it can discover representations that capture these recurring task-segments. For our tasks, this corresponds to representations for recognizing individual object motions, for navigation towards 3D objects, and for navigating through object-configurations. Taking inspiration from cognitive science, we term representations for recurring segments of an agent's experience, "perceptual schemas". We propose Feature Attending Recurrent Modules (FARM), which learns a state representation where perceptual schemas are distributed across multiple, relatively small recurrent modules. We compare FARM to recurrent architectures that leverage spatial attention, which reduces observation features to a weighted average over spatial positions. Our experiments indicate that our feature-attention mechanism better enables FARM to generalize across the diverse object-centric domains we study.

4、[CV] 3D Question Answering

S Ye, D Chen, S Han, J Liao

[City University of Hong Kong & Microsoft Cloud AI & University of California San Diego]

3D问答。近年来，视觉问答(VQA)已经有了巨大的进步。然而，大多数努力只集中在2D图像问答任务。本文尝试将VQA扩展到3D领域，促进人工智能对3D真实世界场景的感知。与基于图像的VQA不同，3D问答(3DQA)将彩色点云作为输入，需要外观和3D几何的理解能力来回答3D相关问题。本文提出一种新的基于Transformer的3DQA框架 "3DQA-TR"，由两个编码器组成，分别用于利用外观和几何信息。外观、几何和语言问题的多模态信息最终通过3D-语言BERT来预测目标答案。为了验证所提出的3DQA框架的有效性，进一步开发了第一个3DQA数据集"ScanQA"，建立在ScanNet数据集基础上，包含了806个场景的∼6K个问题，∼30K个答案。在该数据集上进行的大量实验表明，所提出的3DQA框架明显优于现有VQA框架。

Visual Question Answering (VQA) has witnessed tremendous progress in recent years. However, most efforts only focus on the 2D image question answering tasks. In this paper, we present the first attempt at extending VQA to the 3D domain, which can facilitate artificial intelligence’s perception of 3D real-world scenarios. Different from image based VQA, 3D Question Answering (3DQA) takes the color point cloud as input and requires both appearance and 3D geometry comprehension ability to answer the 3D-related questions. To this end, we propose a novel transformer-based 3DQA framework “3DQA-TR”, which consists of two encoders for exploiting the appearance and geometry information, respectively. The multi-modal information of appearance, geometry, and the linguistic question can finally attend to each other via a 3D-Linguistic Bert to predict the target answers. To verify the effectiveness of our proposed 3DQA framework, we further develop the first 3DQA dataset “ScanQA”, which builds on the ScanNet dataset and contains∼6K questions,∼30K answers for 806 scenes. Extensive experiments on this dataset demonstrate the obvious superiority of our proposed 3DQA framework over existing VQA frameworks, and the effectiveness of our major designs. Our code and dataset will be made publicly available to facilitate the research in this direction.

5、[CL] Boosted Dense Retriever

P Lewis, B Oğuz, W Xiong, F Petroni, W Yih, S Riedel

[Meta AI]

提升稠密检索器。本文提出DrBoost，一种受boosting启发的稠密检索集成。DrBoost分阶段进行训练：每个组件模型都是按顺序学习，通过只关注当前集成所犯检索错误实现专门化。最终表现是所有组件模型输出向量的串联，在测试时可以直接替代标准的稠密检索器。与标准的稠密检索模型相比，DrBoost有几个优势。它所产生的表示要紧凑4倍，同时能提供相当的检索结果。在粗量化的近似搜索中表现也令人惊讶，将延迟和带宽需求再减少4倍。实践中，可以使从磁盘提供索引与从内存提供索引之间产生差异，为更便宜的部署铺平道路。

We propose DrBoost, a dense retrieval ensemble inspired by boosting. DrBoost is trained in stages: each component model is learned sequentially and specialized by focusing only on retrieval mistakes made by the current ensemble. The final representation is the concatenation of the output vectors of all the component models, making it a drop-in replacement for standard dense retrievers at test time. DrBoost enjoys several advantages compared to standard dense retrieval models. It produces representations which are 4x more compact, while delivering comparable retrieval results. It also performs surprisingly well under approximate search with coarse quantization, reducing latency and bandwidth needs by another 4x. In practice, this can make the difference between serving indices from disk versus from memory, paving the way for much cheaper deployments.