爱可可AI前沿推介(5.5)

LG - 机器学习 CV - 计算机视觉 CL - 计算与语言 AS - 音频与语音 RO - 机器人

转自爱可可爱生活

摘要：基于图神经网络的解缠铰接神经人体表示、用于训练和评估不断进化语言模型的终身基准、Transformer的概率解释、数据集平衡的局限性研究、持续学习基础模型、算法绘画创作过程建模研究、为科技新闻寻找科学证据文献、基于简化Transformer的深度估计、可控人-椅交互研究

1、[CV] DANBO: Disentangled Articulated Neural Body Representations via Graph Neural Networks

S Su, T Bagautdinov, H Rhodin

[University of British Columbia & Reality Labs Research]

DANBO: 基于图神经网络的解缠铰接神经人体表示。深度学习通过从3D扫描、模板网格和多视图像的集合中学习几何和外观，大大改善了可动画化的人体模型的真实性。高分辨率的模型能实现照片般逼真的化身，但代价是需要最终用户无法得到的工作室设置。本文的目标是直接从原始图像中创建化身，而不依赖昂贵的工作室设置和表面跟踪。虽然存在一些这样的方法，但这些方法的泛化能力有限，而且容易在不相关的身体部位之间学到虚假的(偶然的)关联，导致在未见的姿态上出现难以置信的变形和身体部位缺失。本文提出一种从视频中学习可动画的人体模型的非表面方法，适用于单目记录，减轻了模板或参数化模型的限制，可在室内和室外条件下工作。提出一种三阶段方法，归纳出两个归纳偏差，以更好地解缠姿态相关的变形。用图神经网络对身体部位的相关性进行显式建模。为进一步减少偶然相关性的影响，提出了局部化单块骨骼特征，使用因子化体表示和新的聚合函数。证明了所提出模型在具有挑战性的未见姿态下产生真实的身体形状，并显示出高质量的图像合成。所提出的表示方法在模型容量、表现力和鲁棒性之间取得了比其他竞争方法更好的平衡。

Deep learning greatly improved the realism of animatable human models by learning geometry and appearance from collections of 3D scans, template meshes, and multi-view imagery. High-resolution models enable photo-realistic avatars but at the cost of requiring studio settings not available to end users. Our goal is to create avatars directly from raw images without relying on expensive studio setups and surface tracking. While a few such approaches exist, those have limited generalization capabilities and are prone to learning spurious (chance) correlations between irrelevant body parts, resulting in implausible deformations and missing body parts on unseen poses. We introduce a three-stage method that induces two inductive biases to better disentangled pose-dependent deformation. First, we model correlations of body parts explicitly with a graph neural network. Second, to further reduce the effect of chance correlations, we introduce localized per-bone features that use a factorized volumetric representation and a new aggregation function. We demonstrate that our model produces realistic body shapes under challenging unseen poses and shows high-quality image synthesis. Our proposed representation strikes a better trade-off between model capacity, expressiveness, and robustness than competing methods. Project website: https://lemonatsu.github.io/danbo.

https://arxiv.org/abs/2205.01666

2、[CL] TemporalWiki: A Lifelong Benchmark for Training and Evaluating Ever-Evolving Language Models

J Jang, S Ye, C Lee, S Yang, J Shin, J Han, G Kim, M Seo

[KAIST & LG AI Research & Korea University]

TemporalWiki: 用于训练和评估不断进化语言模型的终身基准。语言模型(LM)随着世界的变化而变得过时，往往无法执行需要最近事实信息的任务，而这些信息在训练期间是不存在的或不同的，这种现象被称为时间错位。这是一个特别具有挑战性的问题，因为研究界仍然缺乏一个连贯的数据集来评估语言模型对经常更新的知识语料库(如维基百科)的适应性。本文提出TEMPORALWIKI，一个用于不断发展的语言模型的终身基准，利用英文维基百科和英文维基数据的连续快照间的差异，分别进行训练和评估。该基准允许研究人员定期跟踪语言模型在每个时间点上保留之前的知识和获得更新/新知识的能力。在该基准中，通过持续学习方法在不同的数据上训练语言模型，可达到与整个快照相似或更好的困惑度，而且计算成本降低了12倍，验证了语言模型中的事实知识可通过持续学习以最小的训练数据安全地更新。

Language Models (LMs) become outdated as the world changes; they often fail to perform tasks requiring recent factual information which was absent or different during training, a phenomenon called temporal misalignment. This is especially a challenging problem because the research community still lacks a coherent dataset for assessing the adaptability of LMs to frequently-updated knowledge corpus such as Wikipedia. To this end, we introduce TEMPORALWIKI, a lifelong benchmark for ever-evolving LMs that utilizes the difference between consecutive snapshots of English Wikipedia and English Wikidata for training and evaluation, respectively. The benchmark hence allows researchers to periodically track an LM’s ability to retain previous knowledge and acquire updated/new knowledge at each point in time. We also find that training an LM on the diff data through continual learning methods achieves similar or better perplexity than on the entire snapshot in our benchmark with 12 times less computational cost, which verifies that factual knowledge in LMs can be safely updated with minimal training data via continual learning. The dataset and the code are available at this link.

https://arxiv.org/abs/2204.14211

3、[LG] A Probabilistic Interpretation of Transformers

A Shim

[ML Collective]

Transformer的概率解释。本文提出对Transformer的指数点积注意力的概率解释，以及基于指数族的对比学习。Transformer的注意力子层相当于对数规范化器的梯度上升步骤，也就是注意力的Hopfield理论中的对数和exp项。这个上升步骤带来了点的平行扩展，被来自层归一化的收缩所抵消了。本文还说明了所述理论和Hopfield理论的理论局限性，并提出了解决的方向。

We propose a probabilistic interpretation of exponential dot product attention of transformers and contrastive learning based off of exponential families. The attention sublayer of transformers is equivalent to a gradient ascent step of the log normalizer, which is the log-sum-exp term in the Hopfield theory of attention. This ascent step induces a parallel expansion of points, which is counterbalanced by a contraction from layer normalization. We also state theoretical limitations of our theory and the Hopfield theory and suggest directions for resolution.

https://arxiv.org/abs/2205.01080

4、[CL] On the Limitations of Dataset Balancing: The Lost Battle Against Spurious Correlations

R Schwartz, G Stanovsky

[The Hebrew University of Jerusalem]

数据集平衡的局限性研究：对抗虚假关联的损失之战。最近的工作表明，NLP中的深度学习模型对简单特征和特定输出标签之间的低级相关性非常敏感，从而导致过拟合以及泛化能力不足。为缓解该问题，一种常见做法是通过增加新的样本或过滤掉"容易"的样本来平衡数据集，最终导致最近提出完全消除单个词相关性。本文发现，尽管有这些努力，越来越强大的模型不断利用越来越小的虚假关联，因此，即使平衡所有单个词特征也不足以减轻所有这些关联。同时，一个真正平衡的数据集可能必然会"伤敌一千，自损八百"，错过编码常识和世界知识的重要信号。本文强调了数据集平衡的几个替代方案，重点是用更丰富的语境来增强数据集，允许模型弃权并与用户交互，以及从大规模的微调转向零次或少次的设置。

Recent work has shown that deep learning models in NLP are highly sensitive to low level correlations between simple features and specific output labels, leading to overfitting and lack of generalization. To mitigate this problem, a common practice is to balance datasets by adding new instances or by filtering out “easy” instances (Sakaguchi et al., 2020), culminating in a recent proposal to eliminate single-word correlations altogether (Gardner et al., 2021). In this opinion paper, we identify that despite these efforts, increasingly-powerful models keep exploiting ever-smaller spurious correlations, and as a result even balancing all single-word features is insufficient for mitigating all of these correlations. In parallel, a truly balanced dataset may be bound to “throw the baby out with the bathwater” and miss important signal encoding common sense and world knowledge. We highlight several alternatives to dataset balancing, focusing on enhancing datasets with richer contexts, allowing models to abstain and interact with users, and turning from large-scale fine-tuning to zeroor few-shot setups.

https://arxiv.org/abs/2204.12708

5、[LG] Foundational Models for Continual Learning: An Empirical Study of Latent Replay

O Ostapenko, T Lesort, P Rodríguez, M R Arefin, A Douillard, I Rish, L Charlin

[Mila & ServiceNow & Heuritech]

持续学习基础模型：潜回放的实证研究。大规模预训练的快速发展，导致了基础模型在各种下游任务和领域中可以作为有效的特征提取器。受此启发，本文研究了预训练视觉模型作为下游持续学习(CL)场景的基础的效用。本文研究具有双重目标：了解在原始数据空间和预训练编码器的潜空间中持续学习之间的计算-精度权衡；研究编码器、预训练算法和数据的特点以及所产生的潜空间如何影响持续学习的性能。本文比较了各种预训练模型在大规模基准场景中的效用，以及在潜空间和原始数据空间中应用的虚拟重放设置。值得注意的是，本研究显示了迁移、遗忘、任务相似性和学习是如何依赖于输入数据特征的，而不一定是依赖于持续学习算法的。在某些情况下，合理的持续学习性能可以很容易地通过非参数分类器在可忽略不计的计算中实现。展示了在更广泛的数据上预训练模型如何在各种重放规模下产生更好的性能。用这些表示的相似性和迁移特性来解释这一点。最后，展示了自监督(SSL)预训练对下游域的有效性，与预训练域相比，这些域是分布外的。本文指出并验证了几个研究方向，这些方向可以进一步提高潜在的持续学习的功效，包括表示集成。本研究中使用的多样化的数据集可以作为进一步研究潜在语言学的一个高效计算平台。

Rapid development of large-scale pre-training has resulted in foundation models that can act as effective feature extractors on a variety of downstream tasks and domains. Motivated by this, we study the efficacy of pre-trained vision models as a foundation for downstream continual learning (CL) scenarios. Our goal is twofold. First, we want to understand the compute-accuracy trade-off between CL in the raw-data space and in the latent space of pre-trained encoders. Second, we investigate how the characteristics of the encoder, the pre-training algorithm and data, as well as of the resulting latent space affect CL performance. For this, we compare the efficacy of various pre-trained models in large-scale benchmarking scenarios with a vanilla replay setting applied in the latent and in the raw-data space. Notably, this study shows how transfer, forgetting, task similarity and learning are dependent on the input data characteristics and not necessarily on the CL algorithms. First, we show that under some circumstances reasonable CL performance can readily be achieved with a non-parametric classifier at negligible compute. We then show how models pre-trained on broader data result in better performance for various replay sizes. We explain this with representational similarity and transfer properties of these representations. Finally, we show the effectiveness of self-supervised (SSL) pre-training for downstream domains that are out-of-distribution as compared to the pre-training domain. We point out and validate several research directions that can further increase the efficacy of latent CL including representation ensembling. The diverse set of datasets used in this study can serve as a compute-efficient playground for further CL research. Codebase is available under https://github.com/oleksost/latent_CL.

https://arxiv.org/abs/2205.00329