爱可可AI前沿推介(11.14)

LG - 机器学习 CV - 计算机视觉 CL - 计算与语言 AS - 音频与语音 RO - 机器人

转自爱可可爱生活

1、[CV] A Survey of Visual Transformers

Y Liu, Y Zhang, Y Wang, F Hou, J Yuan, J Tian, Y Zhang, Z Shi, J Fan, Z He

[Chinese Academy of Sciences & Southeast University & Lenovo Research]

视觉Transformer综述。Transformer是一种基于注意力的编-解码器架构，它彻底改变了自然语言处理领域。在这一重大成果的激励下，最近有一些开创性的工作是将Transformer架构用于计算机视觉(CV)领域，这些工作已经在各种CV任务中证明了其有效性。与现代卷积神经网络(CNN)相比，依靠有竞争力的建模能力，视觉Transformer在ImageNet、COCO和ADE20k等多个基准测试中取得了令人印象深刻的性能。本文对超过一百种不同的视觉Transformer进行了全面的回顾，这些Transformer适用于三种基本的CV任务(分类、检测和分割)，本文提出了一种分类法，根据这些方法的动机、结构和使用场景来进行组织。由于训练环境和定向任务的不同，在不同的配置上对这些方法进行了评估，以方便和直观地进行比较，而不仅仅是各种基准。揭示了一系列重要的但未被利用的方面，这些方面可能会使Transformer从众多的架构中脱颖而出，例如，松弛的高级语义嵌入来弥补视觉和顺序Transformer之间的差距。提出了三个有希望的未来研究方向。

Transformer, an attention-based encoder-decoder architecture, has revolutionized the field of natural language processing. Inspired by this significant achievement, some pioneering works have recently been done on adapting Transformerliked architectures to Computer Vision (CV) fields, which have demonstrated their effectiveness on various CV tasks. Relying on competitive modeling capability, visual Transformers have achieved impressive performance on multiple benchmarks such as ImageNet, COCO and ADE20k as compared with modern Convolution Neural Networks (CNN). In this paper, we have provided a comprehensive review of over one hundred different visual Transformers for three fundamental CV tasks (classification, detection, and segmentation), where a taxonomy is proposed to organize these methods according to their motivations, structures, and usage scenarios. Because of the differences in training settings and oriented tasks, we have also evaluated these methods on different configurations for easy and intuitive comparison instead of only various benchmarks. Furthermore, we have revealed a series of essential but unexploited aspects that may empower Transformer to stand out from numerous architectures, e.g., slack high-level semantic embeddings to bridge the gap between visual and sequential Transformers. Finally, three promising future research directions are suggested for further investment.

https://weibo.com/1402400261/L1nSc24Qy

2、[LG] Properties from Mechanisms: An Equivariance Perspective on Identifiable Representation Learning

K Ahuja, J Hartford, Y Bengio

[Mila]

用机制识别潜在属性：可识别表示学习的等变视角。无监督表示学习的一个关键目标是"反转"数据生成过程以恢复其潜在属性。现有的可以证明实现这一目标的工作依赖于对潜变量间关系的强假设(例如，以辅助信息为条件的独立性)。本文从一个非常不同的角度来看待这个问题，并提出问题："是否可以通过利用支配其演变的机制的知识来识别潜在属性？当改变关于一组可能机制的知识时，本文提供了一个完整的不可识别性来源的特征。特别的，如果我们知道潜在属性演化的确切机制，那么在基本机制所共享的等变性之下，都可以实现识别。将这一特征推广到只知道可能机制的一些假设类的情况，以及机制是随机的情况。展示了这种基于机制的观点的力量，表明可以利用该结果来概括现有的可识别表示学习结果。这些结果表明，通过利用机制上的归纳偏差，有可能设计出一系列新的可识别表示学习方法。

A key goal of unsupervised representation learning is “inverting” a data generating process to recover its latent properties. Existing work that provably achieves this goal relies on strong assumptions on relationships between the latent variables (e.g., independence conditional on auxiliary information). In this paper, we take a very different perspective on the problem and ask, “Can we instead identify latent properties by leveraging knowledge of the mechanisms that govern their evolution?” We provide a complete characterization of the sources of non-identifiability as we vary knowledge about a set of possible mechanisms. In particular, we prove that if we know the exact mechanisms under which the latent properties evolve, then identification can be achieved up to any equivariances that are shared by the underlying mechanisms. We generalize this characterization to settings where we only know some hypothesis class over possible mechanisms, as well as settings where the mechanisms are stochastic. We demonstrate the power of this mechanism-based perspective by showing that we can leverage our results to generalize existing identifiable representation learning results. These results suggest that by exploiting inductive biases on mechanisms, it is possible to design a range of new identifiable representation learning approaches.

https://weibo.com/1402400261/L1nWoEl9h

3、[AS] Hybrid Spectrogram and Waveform Source Separation

A Défossez

[Facebook AI Research]

频谱波形混合音源分离。音源分离模型要么利用频谱图，要么利用波形。本文展示了如何进行端到端的混合源分离，让模型决定哪个域最适合某个源，甚至结合两者。所提出的混合版Demucs架构赢得了索尼公司组织的2021年音乐脱混挑战赛。该架构还带有额外的改进，如压缩残差分支、局部注意力或奇异值正则化。总的来说，在MusDB HQ数据集上测得的所有音源的信号失真率(SDR)都有1.4分贝的改进，这一改进被人工的主观评价所证实，总体质量被评为2.83(非混合Demucs为2.36)，无污染为3.04。

Source separation models either work on the spectrogram or waveform domain. In this work, we show how to perform end-to-end hybrid source separation, letting the model decide which domain is best suited for each source, and even combining both. The proposed hybrid version of the Demucs architecture (Défossez et al., 2019) won the Music Demixing Challenge 2021 organized by Sony. This architecture also comes with additional improvements, such as compressed residual branches, local attention or singular value regularization. Overall, a 1.4 dB improvement of the Signal-To-Distortion (SDR) was observed across all sources as measured on the MusDB HQ dataset (Rafii et al., 2019), an improvement confirmed by human subjective evaluation, with an overall quality rated at 2.83 out of 5 (2.36 for the non hybrid Demucs), and absence of contamination at 3.04 (against 2.37 for the non hybrid Demucs and 2.44 for the second ranking model submitted at the competition).

https://weibo.com/1402400261/L1o1usQZm

4、[CL] Pre-training Co-evolutionary Protein Representation via A Pairwise Masked Language Model

L He, S Zhang, L Wu, H Xia, F Ju, H Zhang, S Liu, Y Xia, J Zhu, P Deng, B Shao, T Qin, T Liu

[Microsoft Research Asia & Nanyang Technological University & Xi’an Jiaotong University & Sun Yat-sen University]

基于成对掩码语言模型的共同进化蛋白表示预训练。理解蛋白质序列对生物学、医疗保健和医学来说是非常重要和紧迫的。标记方法是昂贵而耗时的，而由于低成本、高通量的测序方法，未标记的数据量比标记的数据量增加得更快。为了从这些未标记的数据中提取知识，表示学习对于蛋白质相关的任务具有重要的价值，并且在帮助学习更多关于蛋白质功能和结构方面具有巨大的潜力。蛋白质序列表示学习的关键问题是捕捉由序列中残基间共变所反映的共同进化信息。本文提出一种新方法，通过专门的语言模型，即成对掩码语言模型(PMLM)进行预训练，而不是像通常那样利用多序列比对，直接捕捉这种信息。在传统的掩码语言模型中，被掩码的标记(即氨基酸残基)仅通过对未被掩码的标记的条件进行建模，但彼此独立处理。本文提出的PMLM考虑到了被掩码标记间的依赖性，即一个标记对的概率不等于两个标记的概率之积。通过应用该模型，预训练编码器能为蛋白质序列产生更好的表示。实验结果表明，所提出方法能有效地捕捉到残基间的相关性，在相同设置下，与MLM基线相比，接触预测的性能提高了9%。当对MSA产生的序列数据库的一个子集进行预训练时，所提出的模型在TAPE接触预测基准上也明显优于MSA基线7%以上，揭示了序列预训练方法在总体上超越MSA方法的潜力。

Understanding protein sequences is vital and urgent for biology, healthcare, and medicine. Labeling approaches are expensive yet time-consuming, while the amount of unlabeled data is increasing quite faster than that of the labeled data due to low-cost, high-throughput sequencing methods. In order to extract knowledge from these unlabeled data, representation learning is of significant value for protein-related tasks and has great potential for helping us learn more about protein functions and structures. The key problem in the protein sequence representation learning is to capture the co-evolutionary information reflected by the interresidue co-variation in the sequences. Instead of leveraging multiple sequence alignment as is usually done, we propose a novel method to capture this information directly by pre-training via a dedicated language model, i.e., Pairwise Masked Language Model (PMLM). In a conventional masked language model, the masked tokens (i.e. amino acid residues) are modeled by conditioning on the unmasked tokens only, but processed independently to each other. However, our proposed PMLM takes the dependency among masked tokens into consideration, i.e., the probability of a token pair is not equal to the product of the probability of the two tokens. By applying this model, the pre-trained encoder is able to generate a better representation for protein sequences. Our result shows that the proposed method can effectively capture the inter-residue correlations and improves the performance of contact prediction by up to 9% compared to the MLM baseline under the same setting. The proposed model also significantly outperforms the MSA baseline by more than 7% on the TAPE contact prediction benchmark when pretrained on a subset of the sequence database which the MSA is generated from, revealing the potential of the sequence pre-training method to surpass MSA based methods in general.

https://weibo.com/1402400261/L1o5DANU1

5、[CV] Out-of-Domain Human Mesh Reconstruction via Dynamic Bilevel Online Adaptation

S Guan, J Xu, M Z. He, Y Wang, B Ni, X Yang

[Shanghai Jiao Tong University]

基于动态两级在线自适应的域外人体网格重建。本文考虑一个新问题，即自适应域外流媒体视频的人体网格重建模型，现有的基于SMPL的模型的性能受到不同摄像机参数、骨骼长度、背景和遮挡物所代表的分布偏移的显著影响。本文通过在线自适应来解决这个问题，在测试过程中逐步纠正模型偏差。有两个主要的挑战。首先，缺乏3D标记增加了训练难度，并导致3D歧义。其次，非平稳的数据分布使得在拟合常规帧和有严重遮挡或剧烈变化的硬样本之间很难取得平衡。为此，本文提出动态两级在线自适应算法(DynaBOA)。首先引入了时间约束来补偿不可用的3D标记，并利用两级优化程序来解决多目标之间的冲突。DynaBOA通过与类似的源实例进行联合训练，提供额外的3D指导，尽管分布发生了变化。此外，它可以自适应地调整单个帧上的优化步骤数量，以充分适应硬样本，并避免过拟合常规帧。DynaBOA在三个域外人体网格重建基准上取得了最先进的结果。

We consider a new problem of adapting a human mesh reconstruction model to out-of-domain streaming videos, where performance of existing SMPL-based models are significantly affected by the distribution shift represented by different camera parameters, bone lengths, backgrounds, and occlusions. We tackle this problem through online adaptation, gradually correcting the model bias during testing. There are two main challenges: First, the lack of 3D annotations increases the training difficulty and results in 3D ambiguities. Second, non-stationary data distribution makes it difficult to strike a balance between fitting regular frames and hard samples with severe occlusions or dramatic changes. To this end, we propose the Dynamic Bilevel Online Adaptation algorithm (DynaBOA). It first introduces the temporal constraints to compensate for the unavailable 3D annotations, and leverages a bilevel optimization procedure to address the conflicts between multi-objectives. DynaBOA provides additional 3D guidance by co-training with similar source examples retrieved efficiently despite the distribution shift. Furthermore, it can adaptively adjust the number of optimization steps on individual frames to fully fit hard samples and avoid overfitting regular frames. DynaBOA achieves state-of-the-art results on three out-of-domain human mesh reconstruction benchmarks.

https://weibo.com/1402400261/L1odwly4W

另外几篇值得关注的论文：

[LG] MaGNET: Uniform Sampling from Deep Generative Network Manifolds Without Retraining

MaGNET：无需再训练的深度生成网络流形均匀采样

A I Humayun, R Balestriero, R Baraniuk

[Rice University]

https://weibo.com/1402400261/L1oiv7TuR

[CL] Reason first, then respond: Modular Generation for Knowledge-infused Dialogue

先推理，再回应：知识注入对话的模块化生成

L Adolphs, K Shuster, J Urbanek, A Szlam, J Weston

[Facebook AI Research]

https://weibo.com/1402400261/L1oke8c5M

[CL] Machine-in-the-Loop Rewriting for Creative Image Captioning

面向创意图像描述的机器在环路改写

V Padmakumar, H He

[New York University]

https://weibo.com/1402400261/L1oni7b6m

[CV] Self-Supervised Audio-Visual Representation Learning with Relaxed Cross-Modal Temporal Synchronicity

基于宽松跨模态时同步的自监督音-视表示学习

P Sarkar, A Etemad

[Queen’s University]

https://weibo.com/1402400261/L1otZcGD5

内容中包含的图片若涉及版权问题，请及时与我们联系删除