爱可可AI前沿推介-3.4

LG - 机器学习 CV - 计算机视觉 CL - 计算与语言 AS - 音频与语音 RO - 机器人

转自爱可可爱生活

1、[LG] FastFold: Reducing AlphaFold Training Time from 11 Days to 67 Hours

S Cheng, R Wu, Z Yu, B Li, X Zhang, J Peng, Y You

[National University of Singapore & HPC-AI Technology Inc & Helixon & Shanghai Jiao Tong University]

FastFold：将AlphaFold的训练时间从11天缩短到67小时。蛋白质结构预测是结构生物学领域理解基因翻译和蛋白质功能的一个重要方法。AlphaFold将Transformer模型引入蛋白质结构预测领域，具有原子级的精度。然而，由于AlphaFold模型的特殊性能特点和巨大的内存消耗，其训练和推理都很耗时和昂贵。本文提出FastFold，一种高效蛋白质结构预测模型训练和推理实现，包括一系列基于对AlphaFold性能的透彻分析而进行的GPU优化。同时，通过动态轴向并行和对偶异步操作，FastFold实现了很高的模型并行化扩展效率，超越了现有的流行模型并行化技术。实验结果表明，FastFold将总体训练时间从11天缩短到67小时，并实现了长序列推理7.5∼9.5倍的速度提升。将FastFold扩展到512个GPU，实现了6.02PetaFLOPs的总量，并行效率达到90.1%。FastFold大大降低了蛋白质结构预测模型训练和推理的时间成本和经济成本，提高了蛋白质结构预测模型领域的设计和部署效率。同时，动态轴向平行性使得设计和训练更大的模型以获得更高的性能成为可能。

Protein structure prediction is an important method for understanding gene translation and protein function in the domain of structural biology. AlphaFold introduced the Transformer model to the field of protein structure prediction with atomic accuracy. However, training and inference of AlphaFold model are time-consuming and expensive because of the special performance characteristics and huge memory consumption. In this paper, we propose FastFold, a highly efficient implementation of protein structure prediction model for training and inference. FastFold includes a series of GPU optimizations based on thorough analysis of AlphaFold’s performance. Meanwhile, with Dynamic Axial Parallelism and Duality Async Operation, FastFold achieves high model parallelism scaling efficiency, surpassing existing popular model parallelism techniques. Experimental results show that FastFold reduces overall training time from 11 days to 67 hours and achieves 7.5 ∼ 9.5× speedup for long-sequence inference. Furthermore, We scaled FastFold to 512 GPUs and achieved aggregate 6.02 PetaFLOPs with 90.1% parallel efficiency. The implementation can be found at https: //github.com/hpcaitech/FastFold.

2、[LG] Bayesian Structure Learning with Generative Flow Networks

T Deleu, A Góis, C Emezue, M Rankawat, S Lacoste-Julien, S Bauer, Y Bengio

[Mila & Technical University of Munich & KTH Stockholm]

基于生成式流网络的贝叶斯结构学习。在贝叶斯结构学习中，对从数据中推断出贝叶斯网络的有向无环图(DAG)结构的分布感兴趣。由于组合性的大样本空间，定义这样的分布是非常具有挑战性的，通常需要基于MCMC的近似方法。最近，一类新的概率模型，即生成式流网络(GFlowNets)，被提出作为离散和复合对象(例如图)生成建模的通用框架。本文建议用GFlowNet作为MCMC的替代方案，在给定观察数据集的情况下，对贝叶斯网络结构的后验分布进行近似。从这个近似分布中生成一个样本DAG看作是一个顺序决策问题，其中图根据学到的过渡概率，每次一条边地进行构建。对模拟数据和真实数据的评估表明，所提出的方法DAG-GFlowNet，为DAG的后验提供了准确的近似，与其他基于MCMC或变分推理方法相比更有优势。

In Bayesian structure learning, we are interested in inferring a distribution over the directed acyclic graph (DAG) structure of Bayesian networks, from data. Defining such a distribution is very challenging, due to the combinatorially large sample space, and approximations based on MCMC are often required. Recently, a novel class of probabilistic models, called Generative Flow Networks (GFlowNets), have been introduced as a general framework for generative modeling of discrete and composite objects, such as graphs. In this work, we propose to use a GFlowNet as an alternative to MCMC for approximating the posterior distribution over the structure of Bayesian networks, given a dataset of observations. Generating a sample DAG from this approximate distribution is viewed as a sequential decision problem, where the graph is constructed one edge at a time, based on learned transition probabilities. Through evaluation on both simulated and real data, we show that our approach, called DAG-GFlowNet, provides an accurate approximation of the posterior over DAGs, and it compares favorably against other methods based on MCMC or variational inference.

3、[CL] Advancing an Interdisciplinary Science of Conversation: Insights from a Large Multimodal Corpus of Human Speech

A Reece, G Cooney, P Bull, C Chung, B Dawson, C Fitzpatrick, T Glazer, D Knox, A Liebscher, S Marin

[BetterUp Inc & University of Pennsylvania & DrivenData Inc]

推进跨学科的对话科学：从人类语音的大型多模态语料库中获得的启示。人们在生活中很大一部分时间都在进行对话，但我们对对话的科学理解仍处于起步阶段。这份报告通过一个大型的、新的、多模态的英语口语对话语料库的发现，推进了一门跨学科的对话科学。这个700多万词、850小时的语料库共有超过1TB的音频、视频和文字记录，包括对声音、人脸和语义表达的时刻对时刻测量，以及对说话者的对话后反应的广泛总结。利用该语料库的相当大的范围，(1)扩展文献中的关键发现，如人类轮流发言的合作性；(2)定义新的算法程序，将语音分割成对话回合；(3)在各种文本、听觉和视觉特征中应用机器学习的见解，分析是什么使对话成功或失败；以及(4)探索对话如何与人们一生的福祉相关。还报告了（5）一份全面的混合方法报告，基于对每段录音的定量分析和定性总结，展示了来自不同背景的个人如何改变他们的沟通模式，并找到连接的方法。讨论了这个大规模开发数据集如何为未来研究提供新的方向，特别是跨学科的研究，因为来自不同领域的学者对对话的研究似乎越来越感兴趣。

People spend a substantial portion of their lives engaged in conversation—and yet our scientific understanding of conversation is still in its infancy. In this report we advance an interdisciplinary science of conversation, with findings from a large, novel, multimodal corpus of 1,656 recorded conversations in spoken English. This 7+ million word, 850-hour corpus totals over 1TB of audio, video, and transcripts, with moment-to-moment measures of vocal, facial, and semantic expression, along with an extensive survey of speakers’ post-conversation reflections. We leverage the considerable scope of the corpus to: (1) extend key findings from the literature, such as the cooperativeness of human turn-taking; (2) define novel algorithmic procedures for the segmentation of speech into conversational turns; (3) apply machine learning insights across various textual, auditory, and visual features to analyze what makes conversations succeed or fail; and (4) explore how conversations are related to people’s well-being across the lifespan. We also report (5) a comprehensive mixed-method report, based on quantitative analysis and qualitative review of each recording, that showcases how individuals from diverse backgrounds alter their communication patterns and find ways to connect. We conclude with a discussion of how this large-scale public dataset may offer new directions for future research, especially across disciplinary boundaries, as scholars from a variety of fields appear increasingly interested in the study of conversation.

4、[AI] The Quest for a Common Model of the Intelligent Decision Maker

R S. Sutton

[University of Alberta]

智能决策器通用模型探索。强化学习和决策的多学科会议的前提是，多个学科对目标导向的决策有着共同的兴趣。本文的想法是，通过提出一个关于决策器的实质性观点，并在心理学、人工智能、经济学、控制理论和神经科学中广泛遵循，称为智能体的通用模型，来明确和深化这一思想。这个通用模型不包括任何生物体、世界或应用领域的具体内容。通用模型确实包括决策器与其世界的互动(必须有输入和输出，以及目标)和决策器内部组件(用于感知、决策、内部评估和世界模型)。本文确定了这些方面和组成部分，注意到它们在不同学科中被赋予了不同的名称，但基本上是指相同的思路，并讨论了设计一个可跨学科使用的中性术语的挑战和好处。现在是时候承认并建立在多个不同学科对智能体的实质性通用模型的融合上了。

The premise of Multi-disciplinary Conference on Reinforcement Learning and Decision Making is that multiple disciplines share an interest in goal-directed decision making over time. The idea of this paper is to sharpen and deepen this premise by proposing a perspective on the decision maker that is substantive and widely held across psychology, artificial intelligence, economics, control theory, and neuroscience, which I call the common model of the intelligent agent. The common model does not include anything specific to any organism, world, or application domain. The common model does include aspects of the decision maker’s interaction with its world (there must be input and output, and a goal) and internal components of the decision maker (for perception, decision-making, internal evaluation, and a world model). I identify these aspects and components, note that they are given different names in different disciplines but refer essentially to the same ideas, and discuss the challenges and benefits of devising a neutral terminology that can be used across disciplines. It is time to recognize and build on the convergence of multiple diverse disciplines on a substantive common model of the intelligent agent.

5、[CV] TableFormer: Table Structure Understanding with Transformers

A Nassar, N Livathinos, M Lysak, P Staar

[IBM Research]

TableFormer：基于Transformer的表结构理解。表将有价值的内容组织成一种简明扼要的表达。这些内容对于诸如搜索引擎、知识图谱等系统来说是非常有价值的，因为它们增强了预测能力。不幸的是，表格有各种各样的形状和大小，可能有复杂的列/行标题设置、多行、不同种类的分隔线、缺失条目等。因此，从图像正确识别表格结构是一项不简单的任务。本文提出一种新的表结构识别模型，基于端到端Transformer的方法来预测表格结构和来自图像的单元格边框，在两个重要方面改进了最新的端到端深度学习模型(即PubTabNet的编码器-双解码器)。为表单元引入一种新的目标检测解码器，可以直接从程序化的PDF中获得表格单元内容，避免了自定义OCR解码器的训练。这种架构上的变化实现了更准确的表格内容提取，并能处理非英语表格。用基于Transformer的解码器取代了LSTM解码器，大大改善了之前最先进的树状编辑距离得分(TEDS)，在简单表格上从91%提高到98.5%，在复杂表格上从88.7%提高到95%。提出了 "SynthTabNet"，一个具有挑战性的合成性数据集，强化了其他数据集的缺失特征。

Tables organize valuable content in a concise and compact representation. This content is extremely valuable for systems such as search engines, Knowledge Graph’s, etc, since they enhance their predictive capabilities. Unfortunately, tables come in a large variety of shapes and sizes. Furthermore, they can have complex column/row-header configurations, multiline rows, different variety of separation lines, missing entries, etc. As such, the correct identification of the table-structure from an image is a nontrivial task. In this paper, we present a new table-structure identification model. The latter improves the latest end-toend deep learning model (i.e. encoder-dual-decoder from PubTabNet) in two significant ways. First, we introduce a new object detection decoder for table-cells. In this way, we can obtain the content of the table-cells from programmatic PDF’s directly from the PDF source and avoid the training of the custom OCR decoders. This architectural change leads to more accurate table-content extraction and allows us to tackle non-english tables. Second, we replace the LSTM decoders with transformer based decoders. This upgrade improves significantly the previous state-of-the-art tree-editing-distance-score (TEDS) from 91% to 98.5% on simple tables and from 88.7% to 95% on complex tables.