爱可可AI前沿推介(4.5)

LG - 机器学习 CV - 计算机视觉 CL - 计算与语言 AS - 音频与语音 RO - 机器人

转自爱可可爱生活

摘要：大规模分子建模数据集上的Graphormer基准测试、SELFIES和分子串表示的未来、基于语言的零次多模态推理、从统计学习到因果学习、用于解释视觉-语言Transformer的互动可视化工具、大模型路线图、基于序列复现链的BERT先验探测、具身自适应目标检测、通过机器学习模型投毒揭示其秘密

1、[LG] Benchmarking Graphormer on Large-Scale Molecular Modeling Datasets

Y Shi, S Zheng, G Ke, Y Shen, J You, J He, S Luo, C Liu, D He, T Liu

[Microsoft Research Asia & HKUST & Tsinghua University & USTC & Peking University]

大规模分子建模数据集上的Graphormer基准测试。本文介绍了Graphormer最近的更新，包括架构设计的修改，以及对3D分子动力学模拟的自适应。通过这些简单的修改，Graphormer可以在大规模分子建模数据集上获得比vanilla架构更好的结果，并且在2D和3D分子图建模任务上可持续获得性能提升。在全局感受野和自适应聚合策略下，Graphormer比基于消息传递的经典GNN更强大。根据经验，Graphormer在KDD Cup 2021中使用的PCQM4M量子化学数据集上取得的MAE比最初报告的结果要小得多。同时，在最近的"开放催化剂挑战赛 "中大大超过了竞争对手，该挑战赛是NeurIPS 2021研讨会的一个竞赛项目，旨在用先进的人工智能模型为催化剂-吸附剂反应系统建模。

This technical note describes the recent updates of Graphormer, including architecture design modifications, and the adaption to 3D molecular dynamics simulation. With these simple modifications, Graphormer could attain better results on large-scale molecular modeling datasets than the vanilla one, and the performance gain could be consistently obtained on 2D and 3D molecular graph modeling tasks. In addition, we show that with a global receptive field and an adaptive aggregation strategy, Graphormer is more powerful than classic message-passing-based GNNs. Empirically, Graphormer could achieve much less MAE than the originally reported results on the PCQM4M quantum chemistry dataset used in KDD Cup 2021. In the meanwhile, it greatly outperforms the competitors in the recent Open Catalyst Challenge, which is a competition track on NeurIPS 2021 workshop, and aims to model the catalyst-adsorbate reaction system with advanced AI models. All codes could be found at this https URL.

https://arxiv.org/abs/2203.04810

2、[LG] SELFIES and the future of molecular string representations

M Krenn, Q Ai, S Barthel, N Carson...

[Max Planck Institute for the Science of Light (MPL) & Fordham University & Vrije Universiteit Amsterdam...]

SELFIES和分子串表示的未来。人工智能(AI)和机器学习(ML)在化学和材料科学的挑战性任务中正得到越来越广泛的应用。典型例子包括特性预测、新反应途径发现或新分子设计。在这些任务中，机器需要用化学语言流畅地阅读和书写。串是表示分子图的常用工具，最流行的分子串表示法SMILES，自20世纪80年代末以来一直为化学信息学提供动力。然而，在化学领域的人工智能和机器学习方面，SMILES有几个缺点——最相关的是，大多数符号组合会导致无效结果，缺乏有效的化学解释。为克服该问题，在2020年为分子提出了一种新语言SELFIES(自引用嵌入串)，可保证100%的鲁棒性。SELFIES简化了化学中许多新应用。本文展望未来，讨论了分子串的表示，以及它们各自的机会和挑战。为鲁棒分子表示法提出了16个具体的未来项目。这些项目涉及到向新的化学领域的扩展，人工智能和鲁棒语言界面的令人兴奋的问题，以及人工和机器的可解释性。希望这些建议能够激发一些后续工作，充分挖掘分子串表示在化学和材料科学领域人工智能的未来潜力。

Artificial intelligence (AI) and machine learning (ML) are expanding in popularity for broad applications to challenging tasks in chemistry and materials science. Examples include the prediction of properties, the discovery of new reaction pathways, or the design of new molecules. The machine needs to read and write fluently in a chemical language for each of these tasks. Strings are a common tool to represent molecular graphs, and the most popular molecular string representation, SMILES, has powered cheminformatics since the late 1980s. However, in the context of AI and ML in chemistry, SMILES has several shortcomings -- most pertinently, most combinations of symbols lead to invalid results with no valid chemical interpretation. To overcome this issue, a new language for molecules was introduced in 2020 that guarantees 100\% robustness: SELFIES (SELF-referencIng Embedded Strings). SELFIES has since simplified and enabled numerous new applications in chemistry. In this manuscript, we look to the future and discuss molecular string representations, along with their respective opportunities and challenges. We propose 16 concrete Future Projects for robust molecular representations. These involve the extension toward new chemical domains, exciting questions at the interface of AI and robust languages and interpretability for both humans and machines. We hope that these proposals will inspire several follow-up works exploiting the full potential of molecular string representations for the future of AI in chemistry and materials science.

https://arxiv.org/abs/2204.00056

3、[CV] Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language

A Zeng, A Wong, S Welker, K Choromanski, F Tombari, A Purohit, M Ryoo, V Sindhwani, J Lee, V Vanhoucke, P Florence

[Google]

苏格拉底模型：基于语言的零次多模态推理。大型基础模型可以表现出独特的能力，取决于它们所训练的数据域。虽然这些域是通用的，但它们可能只是勉强重叠。例如，视觉语言模型(VLM)是在互联网规模的图像标题上训练的，而大型语言模型(LM)是在互联网规模的无图像文本上进一步训练的。因此，这些模型在不同域存储了不同形式的常识性知识。本文表明这种模态的多样性是共生的，可以利用其建立具有结构化苏格拉底式对话的人工智能系统——其中新的多模态任务被制定为不同的预先存在的基础模型之间基于语言的指导性交流，无需额外的微调。在自我为中心的感知方面，本文提出了一个苏格拉底模型(SM)案例研究，该模型可以为复杂的任务提供有意义的结果，例如通过将视频问答制定为短篇故事问答，即把视频总结为一个短篇故事，然后回答有关问题，从而生成关于自我中心视频的自由形式答案。此外，SM可以为互联网图像生成描述文本，并且在MSR-VTT 1k-A上以42.8 R@1的成绩与最先进的零次视频-文本检索竞争。SM展示了如何在没有特定域数据收集的情况下，在零次情况下组成基础模型来捕捉新的多模态功能。

Large foundation models can exhibit unique capabilities depending on the domain of data they are trained on. While these domains are generic, they may only barely overlap. For example, visual-language models (VLMs) are trained on Internet-scale image captions, but large language models (LMs) are further trained on Internet-scale text with no images (e.g. from spreadsheets, to SAT questions). As a result, these models store different forms of commonsense knowledge across different domains. In this work, we show that this model diversity is symbiotic, and can be leveraged to build AI systems with structured Socratic dialogue -- in which new multimodal tasks are formulated as a guided language-based exchange between different pre-existing foundation models, without additional finetuning. In the context of egocentric perception, we present a case study of Socratic Models (SMs) that can provide meaningful results for complex tasks such as generating free-form answers to contextual questions about egocentric video, by formulating video Q&A as short story Q&A, i.e. summarizing the video into a short story, then answering questions about it. Additionally, SMs can generate captions for Internet images, and are competitive with state-of-the-art on zero-shot video-to-text retrieval with 42.8 R@1 on MSR-VTT 1k-A. SMs demonstrate how to compose foundation models zero-shot to capture new multimodal functionalities, without domain-specific data collection. Prototypes are available at this http URL.

https://arxiv.org/abs/2204.00598

4、[LG] From Statistical to Causal Learning

B Schölkopf, J v Kügelgen

[Max Planck Institute for Intelligent Systems]

从统计学习到因果学习。本文阐述了建立和理解人工智能系统研究的基本思路：从通过统计学习的符号方法到依靠因果关系概念的介入模型。机器学习和人工智能的一些开放性难题与因果关系有内在联系，要取得进展可能需要在理解如何从数据中建模和推断因果关系方面取得进展。

We describe basic ideas underlying research to build and understand artificially intelligent systems: from symbolic approaches via statistical learning to interventional models relying on concepts of causality. Some of the hard open problems of machine learning and AI are intrinsically related to causality, and progress may require advances in our understanding of how to model and infer causality from data.

https://arxiv.org/abs/2204.00607

5、[CV] VL-InterpreT: An Interactive Visualization Tool for Interpreting Vision-Language Transformers

E Aflalo, M Du, S Tseng, Y Liu, C Wu, N Duan, V Lal

[Intel Labs & Microsoft Research]

VL-InterpreT：用于解释视觉-语言Transformer的互动可视化工具。基于Transformer的模型的突破，不仅彻底改变了NLP领域，也改变了视觉和多模态系统。然而，尽管可视化和可解释性工具已可用于NLP模型，但视觉和多模态Transformer的内部机制在很大程度上仍不透明。随着这些Transformer的成功，了解其内部工作机制越来越关键，因为解开这些黑箱将带来更有能力和更值得信赖的模型。为促进这一探索，本文提出VL-InterpreT，为解释多模态Transformer注意力和隐藏表示提供了新的交互式可视化。VL-InterpreT是一个与任务无关的综合工具，它(1)跟踪视觉和语言成分在所有层中的注意力头的各种统计数据，(2)通过易于阅读的热力图将跨模态和模态内注意力可视化，以及(3)在视觉和语言Token通过Transformer层时绘制其隐性表示。本文面向视觉共感推理(VCR)和WebQA这两个视觉问题回答基准任务，通过对KD-VLP的分析，证明了VL-InterpreT的功能，KD-VLP是一个端到端的预训练视觉-语言多模态Transformer的模型。此外，还介绍了一些关于多模态Transformer行为的有趣发现，这些发现是通过所提出工具学到的。

Breakthroughs in transformer-based models have revolutionized not only the NLP field, but also vision and multimodal systems. However, although visualization and interpretability tools have become available for NLP models, internal mechanisms of vision and multimodal transformers remain largely opaque. With the success of these transformers, it is increasingly critical to understand their inner workings, as unraveling these black-boxes will lead to more capable and trustworthy models. To contribute to this quest, we propose VL-InterpreT, which provides novel interactive visualizations for interpreting the attentions and hidden representations in multimodal transformers. VL-InterpreT is a task agnostic and integrated tool that (1) tracks a variety of statistics in attention heads throughout all layers for both vision and language components, (2) visualizes cross-modal and intra-modal attentions through easily readable heatmaps, and (3) plots the hidden representations of vision and language tokens as they pass through the transformer layers. In this paper, we demonstrate the functionalities of VL-InterpreT through the analysis of KD-VLP, an end-to-end pretraining vision-language multimodal transformer-based model, in the tasks of Visual Commonsense Reasoning (VCR) and WebQA, two visual question answering benchmarks. Furthermore, we also present a few interesting findings about multimodal transformer behaviors that were learned through our tool.

https://arxiv.org/abs/2203.17247