爱可可AI前沿推介(11.10)

LG - 机器学习 CV - 计算机视觉 CL - 计算与语言 AS - 音频与语音 RO - 机器人

1、[LG] Procedural Generalization by Planning with Self-Supervised World Models

A Anand, J Walker, Y Li, E Vértes, J Schrittwieser, S Ozair, T Weber, J B. Hamrick

[DeepMind]

基于自监督世界模型的过程泛化规划。基于模型的强化学习的关键承诺之一，是能用世界的内部模型进行泛化，以便在新的环境和任务中做出预测。然而，基于模型的智能体的泛化能力并没有得到很好的理解，因为现有的工作在衡量泛化能力时都集中在无模型智能体上。本文明确地测量了基于模型的智能体的泛化能力与无模型的智能体的比较。重点分析了MuZero，一种强大的基于模型的智能体，并评估其在过程和任务泛化方面的表现。确定了过程泛化的三个因素——规划、自监督表示学习和过程数据多样性——通过结合这些技术，在Procgen上实现了最先进的泛化性能和数据效率。然而，这些因素并不总是为Meta-World中的任务泛化基准提供相同的收益，表明迁移仍然是个挑战，可能需要与过程泛化不同的方法。总的来说，本文建议，建立可泛化的智能体需要超越单一任务、无模型的范式，并转向在丰富的、过程性的、多任务环境中训练的、基于模型的自监督智能体。

One of the key promises of model-based reinforcement learning is the ability to generalize using an internal model of the world to make predictions in novel environments and tasks. However, the generalization ability of model-based agents is not well understood because existing work has focused on model-free agents when benchmarking generalization. Here, we explicitly measure the generalization ability of model-based agents in comparison to their model-free counterparts. We focus our analysis on MuZero (Schrittwieser et al., 2020), a powerful model-based agent, and evaluate its performance on both procedural and task generalization. We identify three factors of procedural generalization -- planning, self-supervised representation learning, and procedural data diversity -- and show that by combining these techniques, we achieve state-of-the art generalization performance and data efficiency on Procgen (Cobbe et al., 2019). However, we find that these factors do not always provide the same benefits for the task generalization benchmarks in Meta-World (Yu et al., 2019), indicating that transfer remains a challenge and may require different approaches than procedural generalization. Overall, we suggest that building generalizable agents requires moving beyond the single-task, model-free paradigm and towards self-supervised model-based agents that are trained in rich, procedural, multi-task environments.

https://weibo.com/1402400261/L0M8EdP4Q

2、[CL] NLP From Scratch Without Large-Scale Pretraining: A Simple and Efficient Framework

X Yao, Y Zheng, X Yang, Z Yang

[Tsinghua University]

无需大规模预训练从头开始NLP：一种简单高效的框架。由于其强大的性能，预训练语言模型已经成为许多NLP任务的标准方法，但其训练成本非常高。本文提出一种简单高效的学习框架TLM，不依赖于大规模的预训练。给定一些标记的任务数据和一个大型的通用语料库，TLM使用任务数据作为查询，以检索通用语料库的一个微小的子集，并从头开始共同优化任务目标和语言建模目标。在四个领域的八个分类数据集上，TLM取得了优于或类似于预训练语言模型(如RoBERTa-Large)的结果，同时将训练FLOPs减少了两个数量级。凭借高精确度和高效率，希望TLM能够为NLP的大众化和加速其发展作出贡献。

Pretrained language models have become the standard approach for many NLP tasks due to strong performance, but they are very expensive to train. We propose a simple and efficient learning framework TLM that does not rely on large-scale pretraining1. Given some labeled task data and a large general corpus, TLM uses task data as queries to retrieve a tiny subset of the general corpus and jointly optimizes the task objective and the language modeling objective from scratch. On eight classification datasets in four domains, TLM achieves results better than or similar to pretrained language models (e.g., RoBERTa-Large) while reducing the training FLOPs by two orders of magnitude. With high accuracy and efficiency, we hope TLM will contribute to democratizing NLP and expediting its development2.

https://weibo.com/1402400261/L0Mey5fBO

3、[LG] Varuna: Scalable, Low-cost Training of Massive Deep Learning Models

S Athlur, N Saran, M Sivathanu, R Ramjee, N Kwatra

[Microsoft Research India]

Varuna：大规模深度学习模型的可扩展、低成本训练。如今，用于训练(数十亿参数)大规模深度学习模型的系统假设并需要专门的“超集群”：用专门的高带宽互连的成百上千的GPU，例如 NV-Link 和 Infiniband。除了昂贵之外，这种对超集群和自定义高速互连的依赖限制了此类集群的大小，从而造成 (a) 作业并行性的可扩展性限制；(b) 跨超集群的资源碎片化。本文提出了一种新系统Varuna，该系统能在基础网络上训练大规模的深度学习模型。Varuna节俭地使用网络资源并自动配置用户的训练作业以有效地使用任何给定的资源集。因此，Varuna 能够利用成本比专用GPU便宜约5倍的“低优先级”虚拟机，从而显着降低训练海量模型的成本。通过训练海量模型(2000亿参数模型)来证明 Varuna 的功效，在便宜5倍的“spot VM”上，同时保持高训练吞吐量。即使在超集群资源可用的情况下，与替代方法相比，Varuna 也将端到端的训练时间缩短了 20-78%。

Systems for training massive deep learning models (billions of parameters) today assume and require specialized "hyperclusters": hundreds or thousands of GPUs wired with specialized high-bandwidth interconnects such as NV-Link and Infiniband. Besides being expensive, such dependence on hyper-clusters and custom high-speed inter-connects limits the size of such clusters, creating (a) scalability limits on job parallelism; (b) resource fragmentation across hyperclusters. In this paper, we presentVaruna a new system that enables training massive deep learning models on commodity networking. Varuna makes thrifty use of networking resources and automatically configures the user’s training job to efficiently use any given set of resources. Therefore, Varuna is able to leverage “low-priority" VMs that cost about 5x cheaper than dedicated GPUs, thus significantly reducing the cost of training massive models. We demonstrate the efficacy of Varuna by training massive models, including a 200 billion parameter model, on 5x cheaper “spot VMs", while maintaining high training throughput. Even in scenarios where hyper-cluster resources are available, Varuna improves end-to-end training time by 20-78% compared to alternative approaches. The code for Varuna is available at https://github.com/ microsoft/varuna.

https://weibo.com/1402400261/L0MjkduOB

4、[CV] Natural Adversarial Objects

F Lau, N Subramani, S Harrison, A Kim, E Branson, R Liu

[Scale AI & Allen Institute for AI & ML Collective]

天然对抗性目标数据集。尽管最先进的目标检测方法表现出了令人信服的性能，但模型往往对对抗性攻击和分布外数据不够鲁棒。本文提出一种新的数据集——天然对抗性目标(NAO)，以评估目标检测模型的鲁棒性。NAO包含7,934幅图像和9,943个物体，这些物体没有经过修改，代表了现实世界的场景，但却导致最先进的检测模型以高置信度进行错误分类。与标准MSCOCO验证集相比，在NAO上评估EfficientDet-D7的平均精度(mAP)下降了74.5%。此外，通过比较各种目标检测架构，发现在MSCOCO验证集上的较好表现并不一定能转化为在NAO上的较好表现，这表明鲁棒性不能简单地通过训练一个更准确的模型来实现。进一步研究为什么NAO中的样本难以检测和分类。对图块进行混排的实验显示，模型对局部纹理过于敏感。此外，通过使用集成梯度和背景替换，发现检测模型依赖于边框内的像素信息，而在预测类别标签时对上下文环境不敏感。

Although state-of-the-art object detection methods have shown compelling performance, models often are not robust to adversarial attacks and out-of-distribution data. We introduce a new dataset, Natural Adversarial Objects (NAO), to evaluate the robustness of object detection models. NAO contains 7,934 images and 9,943 objects that are unmodified and representative of real-world scenarios, but cause state-of-the-art detection models to misclassify with high confidence. The mean average precision (mAP) of EfficientDet-D7 drops 74.5% when evaluated on NAO compared to the standard MSCOCO validation set. Moreover, by comparing a variety of object detection architectures, we find that better performance on MSCOCO validation set does not necessarily translate to better performance on NAO, suggesting that robustness cannot be simply achieved by training a more accurate model. We further investigate why examples in NAO are difficult to detect and classify. Experiments of shuffling image patches reveal that models are overly sensitive to local texture. Additionally, using integrated gradients and background replacement, we find that the detection model is reliant on pixel information within the bounding box, and insensitive to the background context when predicting class labels. NAO can be downloaded here.

https://weibo.com/1402400261/L0Mncpijq

5、[RO] LILA: Language-Informed Latent Actions

S Karamcheti, M Srivastava, P Liang, D Sadigh

[Stanford University]

LILA：基于语言知识的潜动作。本文提出基于语言知识的潜动作(LILA)，一种在人与机器人合作的背景下学习自然语言界面的框架。LILA属于共享自治范式：除了提供离散的语言输入外，还提供给人工一个低维控制器——例如，一个可以左右和上下移动的两自由度(DoF)操纵杆——用于操作机器人。LILA学习使用语言来调节这个控制器，为用户提供一个基于语言知识的控制空间：给定一个类似"把麦片碗放在托盘上"的指令，LILA可以学习一个两自由度空间，其中一个维度控制机器人末端执行器到碗的距离，另一个维度控制机器人末端执行器相对于碗的抓取点的姿态。通过现实世界的用户研究来评估LILA，用户可以在操作7-DoF Franka Emika Panda Arm时提供语言指令，完成一系列复杂的操纵任务。实验表明，LILA模型不仅比模仿学习和end-effector控制基线有更高的样本效率和性能，在质量上也受到用户的青睐。

We introduce Language-Informed Latent Actions (LILA), a framework for learning natural language interfaces in the context of human-robot collaboration. LILA falls under the shared autonomy paradigm: in addition to providing discrete language inputs, humans are given a low-dimensional controller – e.g., a 2 degree-of-freedom (DoF) joystick that can move left/right and up/down – for operating the robot. LILA learns to use language to modulate this controller, providing users with a language-informed control space: given an instruction like “place the cereal bowl on the tray,” LILA may learn a 2-DoF space where one dimension controls the distance from the robot’s end-effector to the bowl, and the other dimension controls the robot’s end-effector pose relative to the grasp point on the bowl. We evaluate LILA with real-world user studies, where users can provide a language instruction while operating a 7-DoF Franka Emika Panda Arm to complete a series of complex manipulation tasks. We show that LILA models are not only more sample efficient and performant than imitation learning and endeffector control baselines, but that they are also qualitatively preferred by users.1

https://weibo.com/1402400261/L0MqXvCS0

另外几篇值得关注的论文：

[CV] Deep Marching Tetrahedra: a Hybrid Representation for High-Resolution 3D Shape Synthesis

深度匹配四面体：高分辨率3D形状合成的混合表示

T Shen, J Gao, K Yin, M Liu, S Fidler

[NVIDIA]

https://weibo.com/1402400261/L0Mv62baf

[CV] GAN Inversion: A Survey

GAN逆推综述

W Xia, Y Zhang, Y Yang, J Xue, B Zhou, M Yang

[Tsinghua University & Northeastern University & University College London & The Chinese University of Hong Kong & University of California at Merced]

https://weibo.com/1402400261/L0MyS534H

[LG] Mixed-Integer Optimization with Constraint Learning

约束学习混合整数优化

D Maragno, H Wiberg, D Bertsimas, S. I Birbil, D d Hertog, A Fajemisin

[University of Amsterdam & MIT]

https://weibo.com/1402400261/L0MB1iKcJ

[LG] URLB: Unsupervised Reinforcement Learning Benchmark

URLB：无监督强化学习基准

M Laskin, D Yarats, H Liu, K Lee, A Zhan, K Lu, C Cang, L Pinto, P Abbeel

[UC Berkeley & NYU]

https://weibo.com/1402400261/L0MCN0XJz

内容中包含的图片若涉及版权问题，请及时与我们联系删除