爱可可AI前沿推介(10.7)

LG - 机器学习 CV - 计算机视觉 CL - 计算与语言 AS - 音频与语音 RO - 机器人

转自爱可可爱生活

摘要：用强化学习发现更快的矩阵乘法算法、基于生物制约的解缠、将网络规模扩散模型引入机器人领域、提示语言模型的简单策略、无需训练用耦合数据将单模态模型转化为多模态模型、非单调自终止语言模型、基于交替移动卷积与注意力的强大视觉模型、基于单目视频的化人体与服装捕捉和动画、机器学习数据预算

1、[LG] Discovering faster matrix multiplication algorithms with reinforcement learning

A Fawzi, M Balog, A Huang...
[DeepMind]
用强化学习发现更快的矩阵乘法算法。提高基本计算方法的效率可以产生广泛的影响，因为它可以影响大量计算的整体速度。矩阵乘法就是这样一项原始任务，它出现在许多系统中——从神经网络到科学计算程序。使用机器学习自动发现算法，提供了超越人类直觉的前景，并超越了目前人类设计的最佳算法。然而，自动发现算法的程序是复杂的，因为可能的算法空间是巨大的。本文报告了一种基于AlphaZero的深度强化学习方法，用于发现高效且可证明正确的任意矩阵乘法算法。所提出的智能体AlphaTensor，被训练来玩一个单人游戏，目标是在有限的因子空间内找到张量分解。AlphaTensor发现了许多矩阵大小的算法，其复杂度超过了最先进的算法。特别是在有限域中的4×4矩阵的情况下，AlphaTensor的算法改进了Strassen的两级算法，据我们所知，这是自50年前发现该算法以来第一次。通过不同的用例进一步展示了AlphaTensor的灵活性：结构化矩阵乘法的算法具有最先进的复杂度，通过优化矩阵乘法在特定硬件上的运行时间，提高了实际效率。展示的结果强调了AlphaTensor有能力加速一系列问题的算法发现过程，并针对不同的标准进行优化。

Improving the efficiency of algorithms for fundamental computations can have a widespread impact, as it can affect the overall speed of a large amount of computations. Matrix multiplication is one such primitive task, occurring in many systems—from neural networks to scientific computing routines. The automatic discovery of algorithms using machine learning offers the prospect of reaching beyond human intuition and outperforming the current best human-designed algorithms. However, automating the algorithm discovery procedure is intricate, as the space of possible algorithms is enormous. Here we report a deep reinforcement learning approach based on AlphaZero(1) for discovering efficient and provably correct algorithms for the multiplication of arbitrary matrices. Our agent, AlphaTensor, is trained to play a single-player game where the objective is finding tensor decompositions within a finite factor space. AlphaTensor discovered algorithms that outperform the state-of-the-art complexity for many matrix sizes. Particularly relevant is the case of 4 × 4 matrices in a finite field, where AlphaTensor’s algorithm improves on Strassen’s two-level algorithm for the first time, to our knowledge, since its discovery 50 years ago(2). We further showcase the flexibility of AlphaTensor through different use-cases: algorithms with state-of-the-art complexity for structured matrix multiplication and improved practical efficiency by optimizing matrix multiplication for runtime on specific hardware. Our results highlight AlphaTensor’s ability to accelerate the process of algorithmic discovery on a range of problems, and to optimize for different criteria.

https://nature.com/articles/s41586-022-05172-4…

2、[LG] Disentangling with Biological Constraints: A Theory of Functional Cell Types

J C.R. Whittington, W Dorrell, S Ganguli, T E.J. Behrens
[Stanford University & UCL]
基于生物制约的解缠：功能性细胞类型理论。大脑神经元经常为特定的任务变量进行精细的调整。此外，这种解缠的表示在机器学习中是非常受欢迎的。本文从数学上证明了神经元的简单生物约束，即活动和权重的非负性和能量效率，通过强制神经元对任务变化的单一因素的选择性来促进这种所寻求的解缠表示。本文证明了这些制约因素在各种任务和架构中导致了分化，包括变分自编码器。本文还用这一理论解释了为什么大脑将其细胞划分为不同的细胞类型，如网格和对象-矢量细胞，还解释了大脑何时会对纠缠的任务因素进行纠缠表示。总的来说，本文提供了对大脑和机器中神经元为何、何时以及如何表示因素的数学理解，并向理解任务需求如何构造神经表示迈出了第一步。

Neurons in the brain are often finely tuned for specific task variables. Moreover, such disentangled representations are highly sought after in machine learning. Here we mathematically prove that simple biological constraints on neurons, namely nonnegativity and energy efficiency in both activity and weights, promote such sought after disentangled representations by enforcing neurons to become selective for single factors of task variation. We demonstrate these constraints lead to disentangling in a variety of tasks and architectures, including variational autoencoders. We also use this theory to explain why the brain partitions its cells into distinct cell types such as grid and object-vector cells, and also explain when the brain instead entangles representations in response to entangled task factors. Overall, this work provides a mathematical understanding of why, when, and how neurons represent factors in both brains and machines, and is a first step towards understanding of how task demands structure neural representations.

https://arxiv.org/abs/2210.01768

3、[RO] DALL-E-Bot: Introducing Web-Scale Diffusion Models to Robotics

I Kapelyukh, V Vosylius, E Johns
[Imperial College London]
DALL-E-Bot：将网络规模扩散模型引入机器人领域。本文介绍了第一项为机器人技术探索网络规模扩散模型的工作。DALL-E-Bot使机器人能够重新排列场景中的物体，首先推断出这些物体的文字描述，然后生成代表这些物体的自然的、类似人类的排列的图像，最后根据该图像对物体进行物理排列。重要的是，用DALL-E实现了零样本、不需要任何进一步数据收集或训练。令人鼓舞的现实世界与人工研究的结果表明，这是未来网络规模机器人学习算法的一个令人兴奋的方向。本文还为文本-图像社区提出了一个建议清单，以使这些模型的进一步发展与机器人的应用相一致。

We introduce the first work to explore web-scale diffusion models for robotics. DALL-E-Bot enables a robot to rearrange objects in a scene, by first inferring a text description of those objects, then generating an image representing a natural, human-like arrangement of those objects, and finally physically arranging the objects according to that image. The significance is that we achieve this zero-shot using DALL-E, without needing any further data collection or training. Encouraging real-world results with human studies show that this is an exciting direction for the future of web-scale robot learning algorithms. We also propose a list of recommendations to the text-to-image community, to align further developments of these models with applications to robotics. Videos are available at: this https URL

https://arxiv.org/abs/2210.02438

4、[CL] Ask Me Anything: A simple strategy for prompting language models

S Arora, A Narayan, M F. Chen, L J. Orr, N Guha...
[Stanford University & Numbers Station & University of Wisconsin-Madison]
Ask Me Anything：提示语言模型的简单策略。大型语言模型(LLM)在开箱即用的情况下能很好地迁移到新任务上，只需给与自然语言提示，演示如何执行任务，无需额外训练。提示是一个脆弱的过程，对提示的微小修改就会导致模型预测的巨大变化，因此，大量艰苦的努力用于为一项任务设计一个"完美提示"。为了减轻提示设计中所涉及的高度努力，本文反过来问，产生多个有效但不完美的提示并将其汇总是否能导致高质量的提示策略。以上观察促使本文提出了提示方法，即"尽管问我"（AMA）。首先发展了对有效提示格式的理解，发现鼓励开放式生成的问答(QA)提示("谁去了公园？")往往优于那些限制模型输出的提示("约翰去了公园。输出真或假。")。所提出方法递归地使用LLM本身，将任务输入转化为有效的QA格式。应用收集到的提示来获得输入的真实标签的几个噪声投票。这些提示可能具有非常不同的准确性和复杂的依赖性，因此建议使用弱监督，即结合噪声预测的程序，以产生对输入的最终预测。对AMA在不同的开源模型族(如Neo、BLOOM、OPT和T0)和模型大小(125M-175B参数)中进行了评估，显示出比少样本基线平均提升10.2%的性能。这个简单的策略使开源的GPT-Neo-6B模型在20个流行的基准测试中的15个上匹配并超过了少样本GPT3-175B的性能。从这些任务的平均值来看，GPT-Neo-6B模型的性能超过了少样本GPT3-175B。

Large language models (LLMs) transfer well to new tasks out-of-the-box simply given a natural language prompt that demonstrates how to perform the task and no additional training. Prompting is a brittle process wherein small modifications to the prompt can cause large variations in the model predictions, and therefore significant effort is dedicated towards designing a painstakingly "perfect prompt" for a task. To mitigate the high degree of effort involved in prompt-design, we instead ask whether producing multiple effective, yet imperfect, prompts and aggregating them can lead to a high quality prompting strategy. Our observations motivate our proposed prompting method, ASK ME ANYTHING (AMA). We first develop an understanding of the effective prompt formats, finding that question-answering (QA) prompts, which encourage open-ended generation ("Who went to the park?") tend to outperform those that restrict the model outputs ("John went to the park. Output True or False."). Our approach recursively uses the LLM itself to transform task inputs to the effective QA format. We apply the collected prompts to obtain several noisy votes for the input's true label. We find that the prompts can have very different accuracies and complex dependencies and thus propose to use weak supervision, a procedure for combining the noisy predictions, to produce the final predictions for the inputs. We evaluate AMA across open-source model families (e.g., Neo, BLOOM, OPT, and T0) and model sizes (125M-175B parameters), demonstrating an average performance lift of 10.2% over the few-shot baseline. This simple strategy enables the open-source GPT-Neo-6B model to match and exceed the performance of few-shot GPT3-175B on 15 of 20 popular benchmarks. Averaged across these tasks, the GPT-Neo-6B model outperforms few-shot GPT3-175B. We release our code here: this https URL

https://arxiv.org/abs/2210.02441

5、[LG] ASIF: Coupled Data Turns Unimodal Models to Multimodal Without Training

A Norelli, M Fumero, V Maiorca, L Moschella, E Rodolà, F Locatello
[Sapienza University & Amazon Web Services]
ASIF：无需训练用耦合数据将单模态模型转化为多模态模型。统一视觉和语言空间需要在巨大的多模态数据集上从头开始训练深度神经网络；CLIP同时训练图像和文本编码器，而LiT通过利用预训练的视觉网络，只训练后者。本文表明稀疏的相对表示足以在不训练任何网络的情况下对准文本和图像。所提方法依赖于现成的(在有监督或无监督情况下训练的)单域编码器和(相对)数量不多的图像-文本对。ASIF重新定义了什么是多模态模型，明确地将记忆与处理分离开来：这里的模型是由多模态数据集中所有条目的嵌入对定义的，此外还有两个编码器的参数。在标准的零样本视觉基准上的实验证明了图像-文本模型的典型迁移能力。所提出方法代表了一个简单但令人惊讶的强大的基础多模态模型的基线，提出了关于其数据效率和机器学习中检索作用的重要问题。

Aligning the visual and language spaces requires to train deep neural networks from scratch on giant multimodal datasets; CLIP trains both an image and a text encoder, while LiT manages to train just the latter by taking advantage of a pretrained vision network. In this paper, we show that sparse relative representations are sufficient to align text and images without training any network. Our method relies on readily available single-domain encoders (trained with or without supervision) and a modest (in comparison) number of image-text pairs. ASIF redefines what constitutes a multimodal model by explicitly disentangling memory from processing: here the model is defined by the embedded pairs of all the entries in the multimodal dataset, in addition to the parameters of the two encoders. Experiments on standard zero-shot visual benchmarks demonstrate the typical transfer ability of image-text models. Overall, our method represents a simple yet surprisingly strong baseline for foundation multimodal models, raising important questions on their data efficiency and on the role of retrieval in machine learning.

https://arxiv.org/abs/2210.01738

另外几篇值得关注的论文：

[LG] A Non-monotonic Self-terminating Language Model非单调自终止语言模型E Choi, C Lee, K Cho

[New York University]
https://arxiv.org/abs/2210.00660

[CV] MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models

MOAT：基于交替移动卷积与注意力的强大视觉模型
C Yang, S Qiao, Q Yu, X Yuan, Y Zhu, A Yuille, H Adam, L Chen
[The Johns Hopkins University & Google Research] https://arxiv.org/abs/2210.01820

[CV] Capturing and Animation of Body and Clothing from Monocular Video

基于单目视频的化人体与服装捕捉和动画
Y Feng, J Yang, M Pollefeys, M J. Black, T Bolkart
[Max Planck Institute for Intelligent Systems & ETH Zürich] https://arxiv.org/abs/2210.01868

[LG] Data Budgeting for Machine Learning

机器学习数据预算
X Zhao, W Liang, J Zou
[Tsinghua University & Stanford University]
https://arxiv.org/abs/2210.00987

内容中包含的图片若涉及版权问题，请及时与我们联系删除