爱可可AI前沿推介 (3.23)

LG - 机器学习 CV - 计算机视觉 CL - 计算与语言 AS - 音频与语音 RO - 机器人

转自爱可可爱生活

摘要：用自洽性提高语言模型思维链推理能力、基于Transformer的少图片非NeRF神经渲染、基于零次迁移学习的游戏视频bug识别探索、广泛边缘AI应用的运行挑战、形态学感知的基尼亚卢旺达语语言模型、基于摊余神经后验估计加速贝叶斯光谱能量分布建模、深度变分蒙特卡洛电子激发态、视觉Transformer面向上下文泛化的反思、多模态融合注意力瓶颈

1、[CL] Self-Consistency Improves Chain of Thought Reasoning in Language Models

X Wang, J Wei, D Schuurmans, Q Le, E Chi, D Zhou

[Google Research]

用自洽性提高语言模型思维链推理能力。本文探索了一种简单的集成策略，即自洽性，能显著提高大型语言模型的推理精度。其思路是对语言模型的不同输出集进行采样，并返回该集合中最具自洽性的答案。这样的集成方法在与思维链提示相结合时，可提高推理的准确性。对于算术和常识推理的基准，发现自洽性在各种数据集中产生了显著的改进，如GSM8K（+10%）、SVAMP（+14%）、MultiArith（+24%）、CommonsenseQA（+5%）和ARC（简单+4%，挑战+5%）。除了明显的性能提升，这项工作还有可能在语言模型执行推理任务时收集理由，并帮助提供语言模型的不确定性估计和校准。

We explore a simple ensemble strategy, self-consistency, that significantly improves the reasoning accuracy of large language models. The idea is to sample a diverse set of outputs from a language model and return the most consistent answer in the set. Such ensembling method improves reasoning accuracy when combined with chain of thought prompting. For arithmetic and commonsense reasoning benchmarks we find that self-consistency yields significant accuracy improvements in a variety of datasets, such as GSM8K (+10%), SVAMP (+14%), MultiArith (+24%), CommonsenseQA (+5%) and ARC (easy +4%, challenge +5%).

2、[CV] ViewFormer: NeRF-free Neural Rendering from Few Images Using Transformers

J Kulhánek, E Derner, T Sattler, R Babuška

[Czech Technical University in Prague]

ViewFormer：基于Transformer的少图片非NeRF神经渲染。新视图合成是一个长期存在的问题。本文考虑该问题的一个变种，即只给出一些稀疏覆盖了场景或目标的上下文视图，目标是预测场景中的新视角，在只有少量图像的情况下归纳出一个新场景，这需要对先验的学习。目前的最先进技术是基于神经辐射场(NeRF)，虽然取得了令人印象深刻的结果，但其训练时间很长，需要通过每个图像的深度神经网络评估成千上万的3D点样本。本文提出一种纯2D方法，在神经网络的一次传递中，将多个上下文视图和一个查询姿态映射到一个新图像。该模型使用了一种两阶段架构，包括一个码本和一个Transformer模型。码本用于将单个图像嵌入到一个较小的潜空间，而Transformer则在这个更紧凑的空间中解决视图合成任务。为了有效地训练模型，引入一种新的分支注意力机制，不仅可以将同一模型用于神经渲染，还可以用于相机姿态估计。所提出的方法，ViewFormer，可以在93毫秒内渲染一个之前未见过的场景视图，不需要任何3D推理。在真实世界场景中的实验结果表明，该方法与基于NeRF的方法相比是有竞争力的，同时不需要在3D中进行推理，训练速度也比较快。

Novel view synthesis is a long-standing problem. In this work, we consider a variant of the problem where we are given only a few context views sparsely covering a scene or an object. The goal is to predict novel viewpoints in the scene, which requires learning priors. The current state of the art is based on Neural Radiance Fields (NeRFs), and while achieving impressive results, the methods suffer from long training times as they require evaluating thousands of 3D point samples via a deep neural network for each image. We propose a 2D-only method that maps multiple context views and a query pose to a new image in a single pass of a neural network. Our model uses a two-stage architecture consisting of a codebook and a transformer model. The codebook is used to embed individual images into a smaller latent space, and the transformer solves the view synthesis task in this more compact space. To train our model efficiently, we introduce a novel branching attention mechanism that allows us to use the same model not only for neural rendering but also for camera pose estimation. Experimental results on real-world scenes show that our approach is competitive compared to NeRF-based methods while not reasoning in 3D, and it is faster to train.

3、[CV] CLIP meets GamePhysics: Towards bug identification in gameplay videos using zero-shot transfer learning

M R Taesiri, F Macklon, C Bezemer

[University of Alberta]

CLIP与GamePhysics的结合：基于零次迁移学习的游戏视频bug识别探索。游戏视频包含了关于玩家如何与游戏互动以及游戏如何回应的丰富信息。在社交媒体平台上分享游戏视频，如Reddit，已经成为许多玩家的普遍做法。通常情况下，玩家会分享展示视频游戏bug的视频。这种游戏视频是可以用于游戏测试，因为它们为bug分析提供了洞察。虽然存在大量的游戏视频库，但以有效和结构化的方式解析和挖掘仍然是一个很大的挑战。本文提出一种搜索方法，接受任何英文文本查询作为输入，从大型游戏视频库中检索相关视频。该方法不依赖于任何外部信息(如视频元数据)；完全基于视频的内容来工作。利用对比性语言-图像预训练(CLIP)模型的零次迁移能力，不需要任何数据标签或训练。为评估该方法，提出了GamePhysics数据集，包括来自1873个游戏的26954个视频，这些视频是从Reddit网站的GamePhysics部分收集的。在对简单查询、复合查询和错误查询的广泛分析中，该方法显示了有希望的结果，表明所提出方法对于游戏视频中的目标和事件检测是有用的。该方法的一个应用实例是作为一个游戏视频搜索引擎来帮助重现视频游戏的错误。所提出方法为在视频游戏中利用对比学习模型进行零次bug识别奠定了基础。

Gameplay videos contain rich information about how players interact with the game and how the game responds. Sharing gameplay videos on social media platforms, such as Reddit, has become a common practice for many players. Often, players will share gameplay videos that showcase video game bugs. Such gameplay videos are software artifacts that can be utilized for game testing, as they provide insight for bug analysis. Although large repositories of gameplay videos exist, parsing and mining them in an effective and structured fashion has still remained a big challenge. In this paper, we propose a search method that accepts any English text query as input to retrieve relevant videos from large repositories of gameplay videos. Our approach does not rely on any external information (such as video metadata); it works solely based on the content of the video. By leveraging the zero-shot transfer capabilities of the Contrastive Language-Image Pre-Training (CLIP) model, our approach does not require any data labeling or training. To evaluate our approach, we present the GamePhysics dataset consisting of 26,954 videos from 1,873 games, that were collected from the GamePhysics section on the Reddit website. Our approach shows promising results in our extensive analysis of simple queries, compound queries, and bug queries, indicating that our approach is useful for object and event detection in gameplay videos. An example application of our approach is as a gameplay video search engine to aid in reproducing video game bugs. Please visit the following link for the code and the data: https://asgaardlab.github.io/CLIPxGamePhysics/

4、[LG] TinyMLOps: Operational Challenges for Widespread Edge AI Adoption

S Leroux, P Simoens, M Lootus, K Kathore, A Sharma

[hotg.ai & Ghent University]

TinyMLOps：广泛边缘AI应用的运行挑战。在边缘设备上部署机器学习应用可以带来明显的好处，如改善可靠性、延迟和隐私，但也引入了一系列挑战。大多数工作都集中在边缘平台的有限计算资源上，但这并不是阻碍广泛采用的唯一瓶颈。本文列出了TinyML从业者在边缘设备上操作应用时可能需要考虑的其他几个挑战。本文专注于监测和管理应用等任务，也是MLOps平台的常见功能，展示了它们如何因边缘部署的分布式性质而变得复杂。讨论了边缘应用特有的问题，如保护模型的知识产权和验证其完整性。作为一个领域，TinyML仍然非常年轻，大多数工具和框架仍然处于早期阶段。希望本文能启发和指导TinyMLOps平台的发展，使TinyML能为开发者所接受，并可扩展到数十亿的边缘设备。

Deploying machine learning applications on edge devices can bring clear benefits such as improved reliability, latency and privacy but it also introduces its own set of challenges. Most works focus on the limited computational resources of edge platforms but this is not the only bottleneck standing in the way of widespread adoption. In this paper we list several other challenges that a TinyML practitioner might need to consider when operationalizing an application on edge devices. We focus on tasks such as monitoring and managing the application, common functionality for a MLOps platform, and show how they are complicated by the distributed nature of edge deployment. We also discuss issues that are unique to edge applications such as protecting a model’s intellectual property and verifying its integrity.

5、[CL] KinyaBERT: a Morphology-aware Kinyarwanda Language Model

A Nzeyimana, A N Rubungo

[University of Massachusetts Amherst & Polytechnic University of Catalonia]

KinyaBERT：形态学感知的基尼亚卢旺达语语言模型。像BERT这样的预训练语言模型已经成功地解决了许多自然语言处理任务。然而，这些模型中常用的无监督子词token化方法(例如，字节对编码 - BPE)在处理形态丰富的语言方面是次优的。即使给定一个形态分析器，将形态词天真地排序到一个标准的BERT架构中，在捕捉形态构成性和表达与词有关的句法规律性方面也是低效的。本文提出一种简单有效的双层BERT架构来应对这些挑战，该架构利用形态分析器并显式表示了形态构成性。尽管BERT很成功，但它的大多数评估都是在高资源语言上进行的，掩盖了它对低资源语言的适用性。本文在低资源的形态丰富的基尼亚卢旺达语上评估了所提出的方法，并将提出的模型架构命名为KinyaBERT。一组鲁棒的实验结果显示，KinyaBERT在命名实体识别任务中的F1得分超出基线2%，在机器翻译的GLUE基准中的平均得分超出了4.3%。KinyaBERT的微调具有更好的收敛性，即使在存在翻译噪音的情况下，也能在多个任务上取得更鲁棒的结果。本文工作证明了在语言模型预训练中显式纳入形态学信息的有效性。

Pre-trained language models such as BERT have been successful at tackling many natural language processing tasks. However, the unsupervised sub-word tokenization methods commonly used in these models (e.g., byte-pair encoding – BPE) are sub-optimal at handling morphologically rich languages. Even given a morphological analyzer, naive sequencing of morphemes into a standard BERT architecture is inefficient at capturing morphological compositionality and expressing word-relative syntactic regularities. We address these challenges by proposing a simple yet effective twotier BERT architecture that leverages a morphological analyzer and explicitly represents morphological compositionality. Despite the success of BERT, most of its evaluations have been conducted on high-resource languages, obscuring its applicability on low-resource languages. We evaluate our proposed method on the low-resource morphologically rich Kinyarwanda language, naming the proposed model architecture KinyaBERT. A robust set of experimental results reveal that KinyaBERT outperforms solid baselines by 2% in F1 score on a named entity recognition task and by 4.3% in average score of a machine-translated GLUE benchmark. KinyaBERT fine-tuning has better convergence and achieves more robust results on multiple tasks even in the presence of translation noise.1