爱可可AI前沿推介(12.15)

LG - 机器学习 CV - 计算机视觉 CL - 计算与语言 AS - 音频与语音 RO - 机器人

转自爱可可爱生活

1、[LG] Extending the WILDS Benchmark for Unsupervised Adaptation

S Sagawa, P W Koh, T Lee, I Gao, S M Xie, K Shen, A Kumar, W Hu, M Yasunaga, H Marklund, S Beery, E David, I Stavness, W Guo, J Leskovec, K Saenko, T Hashimoto, S Levine, C Finn, P Liang

[Stanford University & Caltech]

面向无监督自适应的WILDS基准扩展。部署在现实场景的机器学习系统往往是在一个源分布上训练的，但部署在不同的目标分布上。无标签数据可以成为缓解这些分布变化的有力杠杆，因为它们经常比有标签的数据更可用。然而，现有的无标签数据的分布迁移基准并不反映现实世界应用中出现的广泛情况。b本文提出了WILDS 2.0更新，扩展了WILDS分布偏移基准10个数据集中的8个，以包括在部署中实际可获得的精心策划的无标签数据。为保持一致性，标记的训练、验证和测试集以及评估指标与原始WILDS基准完全相同。这些数据集涵盖了广泛的应用(从组织学到野生动物保护)、任务(分类、回归和检测)和模式(照片、卫星图像、显微镜幻灯片、文本、分子图)。对利用无标签数据的最先进的方法进行了系统的基准测试，包括领域变量、自训练和自监督方法，并表明它们在WILDS上的成功是有限的。为促进方法的开发和评估，提供了一个开源包，可以自动加载数据，并包含本文使用的所有模型架构和方法。

Machine learning systems deployed in the wild are often trained on a source distribution but deployed on a different target distribution. Unlabeled data can be a powerful point of leverage for mitigating these distribution shifts, as it is frequently much more available than labeled data. However, existing distribution shift benchmarks for unlabeled data do not reflect the breadth of scenarios that arise in real-world applications. In this work, we present the Wilds 2.0 update, which extends 8 of the 10 datasets in the Wilds benchmark of distribution shifts to include curated unlabeled data that would be realistically obtainable in deployment. To maintain consistency, the labeled training, validation, and test sets, as well as the evaluation metrics, are exactly the same as in the original Wilds benchmark. These datasets span a wide range of applications (from histology to wildlife conservation), tasks (classification, regression, and detection), and modalities (photos, satellite images, microscope slides, text, molecular graphs). We systematically benchmark state-of-the-art methods that leverage unlabeled data, including domain-invariant, self-training, and self-supervised methods, and show that their success on Wilds is limited. To facilitate method development and evaluation, we provide an open-source package that automates data loading and contains all of the model architectures and methods used in this paper. Code and leaderboards are available at https://wilds.stanford.edu.

2、[LG] DR3: Value-Based Deep Reinforcement Learning Requires Explicit Regularization

A Kumar, R Agarwal, T Ma, A Courville, G Tucker, S Levine

[ UC Berkeley & Google Research & MILA & Stanford University]

DR3: 基于价值的深度强化学习需要显式正则化。通过监督学习训练的深度网络尽管过参数化，但很容易优化，并表现出优秀的泛化能力。解释这一点的一个假设是，过参数化的深度网络享有随机梯度下降带来的隐性正则化的好处，这有利于在测试输入上泛化良好的解析性解决方案。有理由推测，深度强化学习(RL)方法也可以从这种效应中受益。本文讨论了在监督学习中看到的SGD的隐性正则化效应在离线深度强化学习设置中实际上是有害的，会导致不良的泛化和退化的特征表示。理论分析表明，当现有的隐式正则化模型被应用于时间差分学习时，得出的正则化器偏向于具有过度"混叠"的退化解决方案，这与监督学习的情况形成鲜明对比。从经验上支持这些发现，表明通过引导训练的深度网络价值函数所学习的特征表示确实会变得退化，使出现在Bellman备份两边的状态-动作对的表征出现异化。为解决这个问题，本文推导出这种隐式正则化的形式，并受此启发，提出了一种简单有效的显式正则化DR3，以抵消隐式正则化的不良影响。当与现有的离线强化学习方法相结合时，DR3大大改善了性能和稳定性，缓解了Atari 2600游戏、D4RL领域和机器人操纵图像中的unlearning现象。

Despite overparameterization, deep networks trained via supervised learning are easy to optimize and exhibit excellent generalization. One hypothesis to explain this is that overparameterized deep networks enjoy the benefits of implicit regularization induced by stochastic gradient descent, which favors parsimonious solutions that generalize well on test inputs. It is reasonable to surmise that deep reinforcement learning (RL) methods could also benefit from this effect. In this paper, we discuss how the implicit regularization effect of SGD seen in supervised learning could in fact be harmful in the offline deep RL setting, leading to poor generalization and degenerate feature representations. Our theoretical analysis shows that when existing models of implicit regularization are applied to temporal difference learning, the resulting derived regularizer favors degenerate solutions with excessive “aliasing”, in stark contrast to the supervised learning case. We back up these findings empirically, showing that feature representations learned by a deep network value function trained via bootstrapping can indeed become degenerate, aliasing the representations for state-action pairs that appear on either side of the Bellman backup. To address this issue, we derive the form of this implicit regularizer and, inspired by this derivation, propose a simple and effective explicit regularizer, called DR3, that counteracts the undesirable effects of this implicit regularizer. When combined with existing offline RL methods, DR3 substantially improves performance and stability, alleviating unlearning in Atari 2600 games, D4RL domains and robotic manipulation from images.

3、[CL] GLaM: Efficient Scaling of Language Models with Mixture-of-Experts

N Du, Y Huang, A M. Dai, S Tong, D Lepikhin, Y Xu, M Krikun, Y Zhou, A W Yu, O Firat, B Zoph, L Fedus, M Bosma, Z Zhou, T Wang, Y E Wang, K Webster, M Pellat, K Robinson, K Meier-Hellstern, T Duke, L Dixon, K Zhang, Q V Le, Y Wu, Z Chen, C Cui

[Google]

GLaM：基于专家混合的语言模型高效扩展。用更多的数据、计算和参数来扩展语言模型，推动了自然语言处理的重大进展。例如，得益于扩展，GPT-3能在语境学习任务中取得强大的成果。然而，训练这些大型密集模型需要大量的计算资源。本文提出了名为GLaM(通用语言模型)的语言模型族，使用稀疏激活的专家混合架构来扩展模型容量，同时与密集变体相比，训练成本也大大降低。最大的GLaM有1.2万亿个参数，比GPT-3大约大7倍。所消耗的能量只有训练GPT-3的1/3，推理所需的计算量也只有一半，同时在29个NLP任务中仍然取得了更好的整体零样本和单样本性能。

Scaling language models with more data, compute and parameters has driven significant progress in natural language processing. For example, thanks to scaling, GPT-3 was able to achieve strong results on in-context learning tasks. However, training these large dense models requires significant amounts of computing resources. In this paper, we propose and develop a family of language models named GLaM (Generalist Language Model), which uses a sparsely activated mixture-of-experts architecture to scale the model capacity while also incurring substantially less training cost compared to dense variants. The largest GLaM has 1.2 trillion parameters, which is approximately 7x larger than GPT-3. It consumes only 1/3 of the energy used to train GPT-3 and requires half of the computation flops for inference, while still achieving better overall zero-shot and one-shot performance across 29 NLP tasks.

4、[CV] FLAVA: A Foundational Language And Vision Alignment Model

A Singh, R Hu, V Goswami, G Couairon, W Galuba, M Rohrbach, D Kiela

[Facebook AI Research (FAIR)]

FLAVA：基础性语言-视觉对齐模型。最先进的视觉模型和视觉-语言模型依赖于大规模的视觉语言学预训练，以便在各种下游任务中获得良好的表现。一般来说，这样的模型通常是跨模态(对比性)或多模态(更早期融合)，但不是两者兼具；而且通常只针对特定的模态或任务。一个有希望的方向是使用一个单一的整体通用模型，作为"基础(foundation)"，同时针对所有模态——真正的视觉-语言基础模型应该擅长视觉任务、语言任务以及跨模态和多模态视觉-语言任务。本文提出FLAVA模型，并在横跨这些目标模态的35个任务中展示了令人印象深刻的性能。

State-of-the-art vision and vision-and-language models rely on large-scale visio-linguistic pretraining for obtaining good performance on a variety of downstream tasks. Generally, such models are often either cross-modal (contrastive) or multi-modal (with earlier fusion) but not both; and they often only target specific modalities or tasks. A promising direction would be to use a single holistic universal model, as a “foundation”, that targets all modalities at once—a true vision and language foundation model should be good at vision tasks, language tasks, and crossand multi-modal vision and language tasks. We introduce FLAVA as such a model and demonstrate impressive performance on a wide range of 35 tasks spanning these target modalities.

5、[CV] PP-MSVSR: Multi-Stage Video Super-Resolution

L Jiang, N Wang, Q Dang, R Liu, B Lai

[Baidu]

PP-MSVSR: 多阶段视频超分辨率。与单图像超分辨率(SISR)任务不同，视频超分辨率(VSR)任务的关键是充分利用各帧的互补信息来重建高分辨率序列。由于不同帧的图像具有不同的运动和场景，准确地对齐多帧并有效融合不同帧，一直是VSR任务的关键研究工作。为了利用相邻帧的丰富互补信息，本文提出了一种多阶段的VSR深度架构PP-MSVSR，其中包括局部融合模块、辅助损失和再对齐模块，以逐步细化增强结果。为了加强特征传播中各帧特征的融合，在第一阶段设计了一个局部融合模块，在特征传播前进行局部特征融合。在第二阶段引入了一个辅助损失，使传播模块得到的特征保留了更多与HR空间相关的信息，在第三阶段引入了一个再对齐模块，以充分利用前一阶段的特征信息。大量实验证明，PP-MSVSR在Vid4数据集上取得了可喜的表现，仅用1.45M的参数就达到了28.13dB的PSNR。而PP-MSVSR-L在REDS4数据集上以相当大的参数超过了所有先进的方法。

Different from the Single Image Super-Resolution(SISR) task, the key for Video Super-Resolution(VSR) task is to make full use of complementary information across frames to reconstruct the high-resolution sequence. Since images from different frames with diverse motion and scene, accurately aligning multiple frames and effectively fusing different frames has always been the key research work of VSR tasks. To utilize rich complementary information of neighboring frames, in this paper, we propose a multi-stage VSR deep architecture, dubbed as PP-MSVSR, with local fusion module, auxiliary loss and re-align module to refine the enhanced result progressively. Specifically, in order to strengthen the fusion of features across frames in feature propagation, a local fusion module is designed in stage1 to perform local feature fusion before feature propagation. Moreover, we introduce an auxiliary loss in stage-2 to make the features obtained by the propagation module reserve more correlated information connected to the HR space, and introduce a re-align module in stage-3 to make full use of the feature information of the previous stage. Extensive experiments substantiate that PP-MSVSR achieves a promising performance of Vid4 datasets, which achieves a PSNR of 28.13dB with only 1.45M parameters. And the PP-MSVSR-L exceeds all state of the art method on REDS4 datasets with considerable parameters.

另外几篇值得关注的论文：

[CV] Neural Radiance Fields for Outdoor Scene Relighting

面向户外场景重打光的神经辐射场

V Rudnev, M Elgharib, W Smith, L Liu, V Golyanik, C Theobalt

[MPI for Informatics & University of York]

[CL] Causal Distillation for Language Models

语言模型的因果蒸馏

Z Wu, A Geiger, J Rozner, E Kreiss, H Lu, T Icard, C Potts, N D. Goodman

[Stanford University]

[CV] HairCLIP: Design Your Hair by Text and Reference Image

HairCLIP：基于文本和参考图像的发型设计

T Wei, D Chen, W Zhou, J Liao, Z Tan, L Yuan, W Zhang, N Yu

[University of Science and Technology of China & Microsoft Cloud AI & City University of Hong Kong]

[CV] CLIP-NeRF: Text-and-Image Driven Manipulation of Neural Radiance Fields

CLIP-NeRF：文本-图像驱动的神经辐射场操作

C Wang, M Chai, M He, D Chen, J Liao

[City University of Hong Kong & Snap Inc & USC Institute for Creative Technologies & Microsoft Cloud AI]

内容中包含的图片若涉及版权问题，请及时与我们联系删除