爱可可AI前沿推介(9.9)

LG - 机器学习 CV - 计算机视觉 CL - 计算与语言 AS - 音频与语音 RO - 机器人

转自爱可可爱生活

摘要：面向音频生成的语言建模方法、(面向NeRF的)可微体渲染、面向目标重排的多技能移动操纵、基于量化逆向探测的无监督表示可解释性度量、脑部形态保持自回归3D生成式建模、遥感领域Transformer应用综述、用于图像到图像转换的开源显微镜机器视觉工具箱、开源的显微镜图像到图像转换机器视觉工具箱、高维单峰分布的高斯先验自由能垒和MCMC失效研究、紧凑生物医学Transformer有效性研究

1、[AS] AudioLM: a Language Modeling Approach to Audio Generation

Z Borsos, R Marinier, D Vincent, E Kharitonov, O Pietquin, M Sharifi, O Teboul, D Grangier, M Tagliasacchi, N Zeghidour
[Google Research]
AudioLM：面向音频生成的语言建模方法。本文提出AudioLM，一种具有长程一致性的高质量音频生成框架。AudioLM将输入的音频映射为一串离散的Token，将音频生成作为这一表示空间的语言建模任务。展示了现有的音频标记器如何在重建质量和长程结构之间提供不同的权衡，并且提出了一种混合标记方案来实现这两个目标。利用音频上预训练的掩码语言模型的离散激活来捕捉长程结构，利用神经音频编解码器产生的离散代码来实现高质量合成。通过对原始音频波形的大型语料库进行训练，AudioLM学会了在简短提示下产生自然和连贯的连续语音。当对语音进行训练时，在没有任何转录文字和标注的情况下，AudioLM生成了句法和语义上可信的连续语音，同时还能保持说话人的身份和未见说话人的语气。本文还展示了所提出方法是如何通过生成连贯的钢琴音乐连续音频的，尽管训练时没有任何音乐的乐谱表示。

We introduce AudioLM, a framework for high-quality audio generation with long-term consistency. AudioLM maps the input audio to a sequence of discrete tokens and casts audio generation as a language modeling task in this representation space. We show how existing audio tokenizers provide different trade-offs between reconstruction quality and long-term structure, and we propose a hybrid tokenization scheme to achieve both objectives. Namely, we leverage the discretized activations of a masked language model pre-trained on audio to capture long-term structure and the discrete codes produced by a neural audio codec to achieve high-quality synthesis. By training on large corpora of raw audio waveforms, AudioLM learns to generate natural and coherent continuations given short prompts. When trained on speech, and without any transcript or annotation, AudioLM generates syntactically and semantically plausible speech continuations while also maintaining speaker identity and prosody for unseen speakers. Furthermore, we demonstrate how our approach extends beyond speech by generating coherent piano music continuations, despite being trained without any symbolic representation of music.

https://arxiv.org/abs/2209.03143

2、[CV] Volume Rendering Digest (for NeRF)

A Tagliasacchi, B Mildenhall
[Google Research]
(面向NeRF的)可微体渲染。神经辐射场采用简单的体渲染方式，通过利用可见性的概率概念来克服微分射线-三角形交叉点的挑战。这是通过假设场景是由密度在空间变化的发光粒子云组成的。本技术报告总结了可微体渲染的推导方法。它是之前报告的浓缩版，但在NeRF的背景下重写，并采用其常用的符号。

Neural Radiance Fields employ simple volume rendering as a way to overcome the challenges of differentiating through ray-triangle intersections by leveraging a probabilistic notion of visibility. This is achieved by assuming the scene is composed by a cloud of light-emitting particles whose density changes in space. This technical report summarizes the derivations for differentiable volume rendering. It is a condensed version of previous reports, but rewritten in the context of NeRF, and adopting its commonly used notation.

https://arxiv.org/abs/2209.02417

3、[RO] Multi-skill Mobile Manipulation for Object Rearrangement

J Gu, D S Chaplot, H Su, J Malik
[UC San Diego & Meta AI Research]
面向目标重排的多技能移动操纵。本文研究了一种模块化的方法，来解决目标重排的长距离移动复制任务，它将一个完整的任务分解为一连串的子任务。为了处理整个任务，之前的工作将多个固定的操纵技能与一个点目标的导航技能连接起来，这些技能在子任务上被单独学习。虽然比单体的端到端强化学习策略更有效，但这种框架在技能链中存在复合错误，例如，导航到一个不好的位置，固定的操纵技能不能到达其目标进行操纵。为此，本文建议操纵技能应包括移动性，以便灵活地从多个地点与目标对象互动，同时，导航技能可以有多个终点，进而成功操纵。本文通过实施移动的复制技能而不是固定的技能，以及训练一个以区域目标而不是以点为目标的导航技能来实现这些想法。在家庭助理基准(HAB)中的3个具有挑战性的长程移动复制任务上评估了所提出的多技能移动复制方法M3，并显示出与基线相比的卓越性能。

We study a modular approach to tackle long-horizon mobile manipulation tasks for object rearrangement, which decomposes a full task into a sequence of subtasks. To tackle the entire task, prior work chains multiple stationary manipulation skills with a point-goal navigation skill, which are learned individually on subtasks. Although more effective than monolithic end-to-end RL policies, this framework suffers from compounding errors in skill chaining, e.g., navigating to a bad location where a stationary manipulation skill can not reach its target to manipulate. To this end, we propose that the manipulation skills should include mobility to have flexibility in interacting with the target object from multiple locations and at the same time the navigation skill could have multiple end points which lead to successful manipulation. We operationalize these ideas by implementing mobile manipulation skills rather than stationary ones and training a navigation skill trained with region goal instead of point goal. We evaluate our multi-skill mobile manipulation method M3 on 3 challenging long-horizon mobile manipulation tasks in the Home Assistant Benchmark (HAB), and show superior performance as compared to the baselines.

https://arxiv.org/abs/2209.02778

4、[CV] Measuring the Interpretability of Unsupervised Representations via Quantized Reverse Probing

I Laina, Y M. Asano, A Vedaldi
[University of Oxford & University of Amsterdam]
基于量化逆向探测的无监督表示可解释性度量。自监督视觉表示学习最近引起了很大的研究兴趣。虽然评估自监督表示的常见方法是通过迁移到各种下游任务，本文转而研究了衡量其可解释性的问题，即理解原始表示所编码的语义。将后者表述为估计表示和人工标记的概念空间之间的互信息。为了量化这一点，引入了一个解码瓶颈：信息必须被简单的预测器所捕获，将概念映射到表示空间中的群。这种方法称为逆向线性探测，提供了一种对表示语义性敏感的单个数字。这种度量方法还能检测出表示是否包含概念的组合(例如，"红苹果"），而不仅仅是单个属性(单独的"红"和"苹果")。本文建议用监督分类器来自动标记大型数据集，以丰富用于探测的概念空间。用所提出的方法来评估大量的自监督表示，根据可解释性对它们进行排序，强调与线性探测的标准评估相比出现的差异，并讨论了几个定性的见解。

Self-supervised visual representation learning has recently attracted significant research interest. While a common way to evaluate self-supervised representations is through transfer to various downstream tasks, we instead investigate the problem of measuring their interpretability, i.e. understanding the semantics encoded in raw representations. We formulate the latter as estimating the mutual information between the representation and a space of manually labelled concepts. To quantify this we introduce a decoding bottleneck: information must be captured by simple predictors, mapping concepts to clusters in representation space. This approach, which we call reverse linear probing, provides a single number sensitive to the semanticity of the representation. This measure is also able to detect when the representation contains combinations of concepts (e.g., “red apple”) instead of just individual attributes (“red” and “apple” independently). Finally, we propose to use supervised classifiers to automatically label large datasets in order to enrich the space of concepts used for probing. We use our method to evaluate a large number of self-supervised representations, ranking them by interpretability, highlight the differences that emerge compared to the standard evaluation with linear probes and discuss several qualitative insights.

https://arxiv.org/abs/2209.03268

5、[CV] Morphology-preserving Autoregressive 3D Generative Modelling of the Brain

P Tudosiu, W H L Pinaya, M S. Graham, P Borges...
[King’s College London, London & NVIDIA & DeepMind]
脑部形态保持自回归3D生成式建模。人体解剖学、形态学和相关疾病可以用医学成像数据来研究。然而，医疗成像数据的获取受到管理和隐私问题、数据所有权和获取成本的限制，从而限制了我们理解人体的能力。这个问题一个可能的解决方案是建立一种能学习的模型，根据相关的具体特征(如年龄、性别和疾病状况)生成人体的合成图像。深度生成式模型，以神经网络的形式，最近被用来创建自然场景的合成2D图像。然而，产生具有正确解剖形态的高分辨率3D体成像数据的能力仍然受到数据稀缺和算法及计算限制的阻碍。本文提出一种生成模型，可以按比例生成解剖学上正确的、高分辨率的和逼真的人脑图像，并具有必要的质量，以便进一步进行下游分析。产生潜在的无限数据的能力，不仅使大规模的人体解剖学和病理学研究不会危及病人隐私，还大大推进了异常检测、模态综合、有限数据下的学习以及公平和道德的人工智能等领域的研究。

Human anatomy, morphology, and associated diseases can be studied using medical imaging data. However, access to medical imaging data is restricted by governance and privacy concerns, data ownership, and the cost of acquisition, thus limiting our ability to understand the human body. A possible solution to this issue is the creation of a model able to learn and then generate synthetic images of the human body conditioned on specific characteristics of relevance (e.g., age, sex, and disease status). Deep generative models, in the form of neural networks, have been recently used to create synthetic 2D images of natural scenes. Still, the ability to produce high-resolution 3D volumetric imaging data with correct anatomical morphology has been hampered by data scarcity and algorithmic and computational limitations. This work proposes a generative model that can be scaled to produce anatomically correct, high-resolution, and realistic images of the human brain, with the necessary quality to allow further downstream analyses. The ability to generate a potentially unlimited amount of data not only enables large-scale studies of human anatomy and pathology without jeopardizing patient privacy, but also significantly advances research in the field of anomaly detection, modality synthesis, learning under limited data, and fair and ethical AI. Code and trained models are available at: https://github.com/AmigoLab/SynthAnatomy.

https://arxiv.org/abs/2209.03177

另外几篇值得关注的论文：

[CV] Transformers in Remote Sensing: A Survey

遥感领域Transformer应用综述
A A Aleissaee, A Kumar, R M Anwer, S Khan, H Cholakkal, G Xia, F S khan
[MBZ University of Artificial Intelligence & Wuhan University]
https://arxiv.org/abs/2209.01206

[CV] MMV_Im2Im: An Open Source Microscopy Machine Vision Toolbox for Image-to-Image Transformation

MMV_Im2Im：开源的显微镜图像到图像转换机器视觉工具箱
J Sonneck, J Chen
[Leibniz-Institut für Analytische Wissenschaften]
https://arxiv.org/abs/2209.02498

[LG] On free energy barriers in Gaussian priors and failure of MCMC for high-dimensional unimodal distributions

高维单峰分布的高斯先验自由能垒和MCMC失效研究
A S. Bandeira, A Maillard, R Nickl, S Wang
[ETH Zürich & University of Cambridge & MIT]
https://arxiv.org/abs/2209.02001

[CL] On the Effectiveness of Compact Biomedical Transformers

紧凑生物医学Transformer有效性研究
O Rohanian, M Nouriborji, S Kouchaki, D A. Clifton
[University of Oxford & NLPie Research]
https://arxiv.org/abs/2209.03182

内容中包含的图片若涉及版权问题，请及时与我们联系删除