LG - 机器学习   CV - 计算机视觉   CL - 计算与语言   AS - 音频与语音 RO - 机器人



1、[AS] AudioLM: a Language Modeling Approach to Audio Generation

Z Borsos, R Marinier, D Vincent, E Kharitonov, O Pietquin, M Sharifi, O Teboul, D Grangier, M Tagliasacchi, N Zeghidour
[Google Research]

We introduce AudioLM, a framework for high-quality audio generation with long-term consistency. AudioLM maps the input audio to a sequence of discrete tokens and casts audio generation as a language modeling task in this representation space. We show how existing audio tokenizers provide different trade-offs between reconstruction quality and long-term structure, and we propose a hybrid tokenization scheme to achieve both objectives. Namely, we leverage the discretized activations of a masked language model pre-trained on audio to capture long-term structure and the discrete codes produced by a neural audio codec to achieve high-quality synthesis. By training on large corpora of raw audio waveforms, AudioLM learns to generate natural and coherent continuations given short prompts. When trained on speech, and without any transcript or annotation, AudioLM generates syntactically and semantically plausible speech continuations while also maintaining speaker identity and prosody for unseen speakers. Furthermore, we demonstrate how our approach extends beyond speech by generating coherent piano music continuations, despite being trained without any symbolic representation of music.



2、[CV] Volume Rendering Digest (for NeRF)

A Tagliasacchi, B Mildenhall
[Google Research]

Neural Radiance Fields employ simple volume rendering as a way to overcome the challenges of differentiating through ray-triangle intersections by leveraging a probabilistic notion of visibility. This is achieved by assuming the scene is composed by a cloud of light-emitting particles whose density changes in space. This technical report summarizes the derivations for differentiable volume rendering. It is a condensed version of previous reports, but rewritten in the context of NeRF, and adopting its commonly used notation.



3、[RO] Multi-skill Mobile Manipulation for Object Rearrangement

J Gu, D S Chaplot, H Su, J Malik
[UC San Diego & Meta AI Research]

We study a modular approach to tackle long-horizon mobile manipulation tasks for object rearrangement, which decomposes a full task into a sequence of subtasks. To tackle the entire task, prior work chains multiple stationary manipulation skills with a point-goal navigation skill, which are learned individually on subtasks. Although more effective than monolithic end-to-end RL policies, this framework suffers from compounding errors in skill chaining, e.g., navigating to a bad location where a stationary manipulation skill can not reach its target to manipulate. To this end, we propose that the manipulation skills should include mobility to have flexibility in interacting with the target object from multiple locations and at the same time the navigation skill could have multiple end points which lead to successful manipulation. We operationalize these ideas by implementing mobile manipulation skills rather than stationary ones and training a navigation skill trained with region goal instead of point goal. We evaluate our multi-skill mobile manipulation method M3 on 3 challenging long-horizon mobile manipulation tasks in the Home Assistant Benchmark (HAB), and show superior performance as compared to the baselines.



4、[CV] Measuring the Interpretability of Unsupervised Representations via Quantized Reverse Probing

I Laina, Y M. Asano, A Vedaldi
[University of Oxford & University of Amsterdam]

Self-supervised visual representation learning has recently attracted significant research interest. While a common way to evaluate self-supervised representations is through transfer to various downstream tasks, we instead investigate the problem of measuring their interpretability, i.e. understanding the semantics encoded in raw representations. We formulate the latter as estimating the mutual information between the representation and a space of manually labelled concepts. To quantify this we introduce a decoding bottleneck: information must be captured by simple predictors, mapping concepts to clusters in representation space. This approach, which we call reverse linear probing, provides a single number sensitive to the semanticity of the representation. This measure is also able to detect when the representation contains combinations of concepts (e.g., “red apple”) instead of just individual attributes (“red” and “apple” independently). Finally, we propose to use supervised classifiers to automatically label large datasets in order to enrich the space of concepts used for probing. We use our method to evaluate a large number of self-supervised representations, ranking them by interpretability, highlight the differences that emerge compared to the standard evaluation with linear probes and discuss several qualitative insights.



5、[CV] Morphology-preserving Autoregressive 3D Generative Modelling of the Brain

P Tudosiu, W H L Pinaya, M S. Graham, P Borges...
[King’s College London, London & NVIDIA & DeepMind]

Human anatomy, morphology, and associated diseases can be studied using medical imaging data. However, access to medical imaging data is restricted by governance and privacy concerns, data ownership, and the cost of acquisition, thus limiting our ability to understand the human body. A possible solution to this issue is the creation of a model able to learn and then generate synthetic images of the human body conditioned on specific characteristics of relevance (e.g., age, sex, and disease status). Deep generative models, in the form of neural networks, have been recently used to create synthetic 2D images of natural scenes. Still, the ability to produce high-resolution 3D volumetric imaging data with correct anatomical morphology has been hampered by data scarcity and algorithmic and computational limitations. This work proposes a generative model that can be scaled to produce anatomically correct, high-resolution, and realistic images of the human brain, with the necessary quality to allow further downstream analyses. The ability to generate a potentially unlimited amount of data not only enables large-scale studies of human anatomy and pathology without jeopardizing patient privacy, but also significantly advances research in the field of anomaly detection, modality synthesis, learning under limited data, and fair and ethical AI. Code and trained models are available at: https://github.com/AmigoLab/SynthAnatomy.



[CV] Transformers in Remote Sensing: A Survey

A A Aleissaee, A Kumar, R M Anwer, S Khan, H Cholakkal, G Xia, F S khan
[MBZ University of Artificial Intelligence & Wuhan University]


[CV] MMV_Im2Im: An Open Source Microscopy Machine Vision Toolbox for Image-to-Image Transformation

J Sonneck, J Chen
[Leibniz-Institut für Analytische Wissenschaften]


[LG] On free energy barriers in Gaussian priors and failure of MCMC for high-dimensional unimodal distributions

A S. Bandeira, A Maillard, R Nickl, S Wang
[ETH Zürich & University of Cambridge & MIT]


[CL] On the Effectiveness of Compact Biomedical Transformers

O Rohanian, M Nouriborji, S Kouchaki, D A. Clifton
[University of Oxford & NLPie Research]


