爱可可AI前沿推介(5.30)

LG - 机器学习 CV - 计算机视觉 CL - 计算与语言 AS - 音频与语音 RO - 机器人 IR - 信息检索

转自爱可可爱生活

摘要：Pinterest购物推荐产品嵌入学习、大规模多任务学习系统中动态引入任务的进化方法、为什么GAN对NLP来说过犹不及、通过2D-3D互学习实现一致3D场景风格化的风格化NeRF、基于锐度感知最小化的更可压缩模型学习、推荐系统开放基准研究、等变网格注意力网络、理解深度学习稳定边缘梯度下降、基于CLIP Reward的细粒度图像描述

1、[IR] ItemSage: Learning Product Embeddings for Shopping Recommendations at Pinterest

P Baltescu, H Chen, N Pancha, A Zhai, J Leskovec, C Rosenberg

[Pinterest]

ItemSage：Pinterest购物推荐产品嵌入学习。产品嵌入学习是网络学习电子商务推荐系统的一个重要构件。Pinterest建立了一套名为ItemSage的产品嵌入，以在所有购物使用案例中提供相关推荐，包括基于用户、图片和搜索的推荐。这种方法使参与度和转化率指标得到了明显的改善，同时降低了基础设施和维护成本。虽然之前的工作大多集中在从单一模式的特征中建立产品嵌入，本文提出了一种基于Transformer的架构，能聚合来自文本和图像模式的信息，并表明其明显优于单一模式的基线。利用多任务学习使ItemSage针对几种参与类型进行了优化，从而形成了一个对端到端推荐系统的所有参与目标都有效的候选生成系统。广泛的离线实验说明了所提出方法的有效性，在线A/B实验的结果显示在关键业务指标上有很大的收益(高达+7%的商品总值/用户和+11%的点击量)。

Learned embeddings for products are an important building block for web-scale e-commerce recommendation systems. At Pinterest, we build a single set of product embeddings called ItemSage to provide relevant recommendations in all shopping use cases including user, image and search based recommendations. This approach has led to significant improvements in engagement and conversion metrics, while reducing both infrastructure and maintenance cost. While most prior work focuses on building product embeddings from features coming from a single modality, we introduce a transformer-based architecture capable of aggregating information from both text and image modalities and show that it significantly outperforms single modality baselines. We also utilize multi-task learning to make ItemSage optimized for several engagement types, leading to a candidate generation system that is efficient for all of the engagement objectives of the end-to-end recommendation system. Extensive offline experiments are conducted to illustrate the effectiveness of our approach and results from online A/B experiments show substantial gains in key business metrics (up to +7% gross merchandise value/user and +11% click volume).

https://arxiv.org/abs/2205.11728

2、[LG] An Evolutionary Approach to Dynamic Introduction of Tasks in Large-scale Multitask Learning Systems

A Gesmundo, J Dean

[Google Research]

大规模多任务学习系统中动态引入任务的进化方法。多任务学习假定能从多个任务中学习的模型能通过知识迁移实现更好的质量和效率，这是人类学习的一个关键特征。虽然，最先进的机器学习模型依赖于每个任务的高度定制，利用模型规模和数据规模，而不是扩展任务的数量。此外，持续学习，增加了多任务的时间方面，往往集中在研究常见的陷阱，如灾难性遗忘，而不是作为建立下一代人工智能的重要组成部分进行大规模的研究。本文提出了一种进化方法，可以生成大规模的多任务模型，并且可以支持动态和持续增加新任务。生成的多任务模型是稀疏激活的，整合了一个基于任务的路由，保证了随着模型的扩展，每个任务的计算成本和增加的参数都是有界的。所提出的方法依赖于一种知识分区技术，以实现对灾难性遗忘和其他常见缺陷的免疫，如梯度干扰和负迁移。经验表明，所提出的方法可以共同解决69个图像分类任务并取得有竞争力的结果，例如，在竞争性任务(如cifar10)中，实现了只用公共数据训练的模型的最佳测试精度：99.43%。

Multitask learning assumes that models capable of learning from multiple tasks can achieve better quality and efficiency via knowledge transfer, a key feature of human learning. Though, state of the art ML models rely on high customization for each task and leverage size and data scale rather than scaling the number of tasks. Also, continual learning, that adds the temporal aspect to multitask, is often focused to the study of common pitfalls such as catastrophic forgetting instead of being studied at a large scale as a critical component to build the next generation artificial intelligence. We propose an evolutionary method that can generate a large scale multitask model, and can support the dynamic and continuous addition of new tasks. The generated multitask model is sparsely activated and integrates a task-based routing that guarantees bounded compute cost and fewer added parameters per task as the model expands. The proposed method relies on a knowledge compartmentalization technique to achieve immunity against catastrophic forgetting and other common pitfalls such as gradient interference and negative transfer. We empirically show that the proposed method can jointly solve and achieve competitive results on 69 image classification tasks, for example achieving the best test accuracy reported for a model trained only on public data for competitive tasks such as cifar10: 99.43%.

https://arxiv.org/abs/2205.12755

3、[LG] Why GANs are overkill for NLP

D Alvarez-Melis, V Garg, A T Kalai

[Microsoft Research & Aalto University]

为什么GAN对NLP来说过犹不及？本文提供了一个新的理论视角，说明为什么尽管有许多尝试，生成建模的对抗性方法(如GAN)在某些生成任务中，特别是在自然语言生成等序列任务中，没有像在计算机视觉等其他任务中那样流行。特别是在文本等连续数据上，最大似然法的使用率明显高于GAN。本文表明，虽然看起来最大似然法与最小可分性有本质的区别，但这种区别在很大程度上是人为的，而且只对有限的模型有效。最小化KL散度(即似然最大化)是一种更有效的方法，可以有效地最小化对抗性模型寻求优化的相同可区分性标准。对于某些族的模型，包括n-gram模型和具有softmax输出层的神经网络，最小化可区分性可以被看作是简单地提升了似然。为了实现完全的多项式时间缩减，本文考虑了一种新的下一token可区分性模型。

This work offers a novel theoretical perspective on why, despite numerous attempts, adversarial approaches to generative modeling (e.g., GANs) have not been as popular for certain generation tasks, particularly sequential tasks such as Natural Language Generation, as they have in others, such as Computer Vision. In particular, on sequential data such as text, maximum-likelihood approaches are significantly more utilized than GANs. We show that, while it may seem that maximizing likelihood is inherently different than minimizing distinguishability, this distinction is largely artificial and only holds for limited models. We argue that minimizing KL-divergence (i.e., maximizing likelihood) is a more efficient approach to effectively minimizing the same distinguishability criteria that adversarial models seek to optimize. Reductions show that minimizing distinguishability can be seen as simply boosting likelihood for certain families of models including n-gram models and neural networks with a softmax output layer. To achieve a full polynomial-time reduction, a novel next-token distinguishability model is considered.

https://arxiv.org/abs/2205.09838

4、[CV] StylizedNeRF: Consistent 3D Scene Stylization as Stylized NeRF via 2D-3D Mutual Learning

Y Huang, Y He, Y Yuan, Y Lai, L Gao

[Chinese Academy of Sciences & Cardiff University]

StylizedNeRF：通过2D-3D互学习实现一致3D场景风格化的风格化NeRF。3D场景风格化的目的是按照一组给定的风格范例，从任意的新视角生成场景的风格化图像，同时确保从不同视角渲染时的一致性。直接将图像或视频风格化的方法应用于3D场景无法实现这种一致性。由于最近提出的神经辐射场(NeRF)，能以一种一致的方式表示3D场景。一致的3D场景风格化可以通过对相应的NeRF进行风格化来有效实现。然而，在作为2D图像的风格实例和作为隐性体表示的NeRF之间，存在着明显的域差距。为了解决该问题，本文提出一种新的三维场景风格化的互学习框架，结合了2D图像风格化网络和NeRF，将2D风格化网络的风格化能力与NeRF的3D一致性融合在一起。首先预训练一个需要风格化的3D场景的标准NeRF，并将其颜色预测模块替换为风格化网络，得到一个风格化的NeRF。通过引入一致性损失，将空间一致性的先验知识从NeRF提炼到2D风格化网络中。引入了一个模仿损失来监督NeRF风格模块的互学习，并对2D风格化解码器进行微调。为进一步使模型能处理2D风格化结果的模糊性，引入了服从以风格为条件的概率分布的可学习潜代码。其被附在训练样本上作为条件输入，以更好地学习新的风格化NeRF中的风格模块。实验结果表明，所提出方法在视觉质量和长程一致性方面都优于现有方法。

3D scene stylization aims at generating stylized images of the scene from arbitrary novel views following a given set of style examples, while ensuring consistency when rendered from different views. Directly applying methods for image or video stylization to 3D scenes cannot achieve such consistency. Thanks to recently proposed neural radiance fields (NeRF), we are able to represent a 3D scene in a consistent way. Consistent 3D scene stylization can be effectively achieved by stylizing the corresponding NeRF. However, there is a significant domain gap between style examples which are 2D images and NeRF which is an implicit volumetric representation. To address this problem, we propose a novel mutual learning framework for 3D scene stylization that combines a 2D image stylization network and NeRF to fuse the stylization ability of 2D stylization network with the 3D consistency of NeRF. We first pre-train a standard NeRF of the 3D scene to be stylized and replace its color prediction module with a style network to obtain a stylized NeRF. It is followed by distilling the prior knowledge of spatial consistency from NeRF to the 2D stylization network through an introduced consistency loss. We also introduce a mimic loss to supervise the mutual learning of the NeRF style module and fine-tune the 2D stylization decoder. In order to further make our model handle ambiguities of 2D stylization results, we introduce learnable latent codes that obey the probability distributions conditioned on the style. They are attached to training samples as conditional inputs to better learn the style module in our novel stylized NeRF. Experimental results demonstrate that our method is superior to existing approaches in both visual quality and long-range consistency.

https://arxiv.org/abs/2205.12183

5、[CL] Train Flat, Then Compress: Sharpness-Aware Minimization Learns More Compressible Models

C Na, S V Mehta, E Strubell

[CMU]

先训练平坦再压缩：基于锐度感知最小化的更可压缩模型学习。通过参数修剪、量化或蒸馏的方式来压缩模型，作为降低现代深度神经网络模型对NLP计算要求的一种方法，最近获得了广泛的欢迎。修剪不必要的参数已经成为压缩大模型的一种简单有效的方法，它与各种当代既有硬件兼容(与量化不同)，几乎不需要额外的训练(与蒸馏不同)。修剪方法通常将一个大型的精确模型作为输入，试图发现该模型的一个较小的子网络，能达到与完整模型相当的最终任务精度。之前的工作表明，更简单、更通用的模型与那些位于损失景观中的平坦盆地之间存在联系，受此启发，本文提出在执行特定任务的剪枝时直接优化平坦的最小值，假设这将导致更简单的参数化，从而产生更可压缩的模型。在将锐度感知最小化与迭代幅度修剪和结构化修剪方法相结合的实验中表明，在对BERT模型进行微调时，与标准的Adam优化相比，对平坦的最小值进行优化一直导致参数的更大可压缩性，在GLUE分类基准上导致了更高的压缩率，而几乎没有损失。

Model compression by way of parameter pruning, quantization, or distillation has recently gained popularity as an approach for reducing the computational requirements of modern deep neural network models for NLP. Pruning unnecessary parameters has emerged as a simple and effective method for compressing large models that is compatible with a wide variety of contemporary off-the-shelf hardware (unlike quantization), and that requires little additional training (unlike distillation). Pruning approaches typically take a large, accurate model as input, then attempt to discover a smaller subnetwork of that model capable of achieving end-task accuracy comparable to the full model. Inspired by previous work suggesting a connection between simpler, more generalizable models and those that lie within flat basins in the loss landscape, we propose to directly optimize for flat minima while performing task-specific pruning, which we hypothesize should lead to simpler parameterizations and thus more compressible models. In experiments combining sharpness-aware minimization with both iterative magnitude pruning and structured pruning approaches, we show that optimizing for flat minima consistently leads to greater compressibility of parameters compared to standard Adam optimization when fine-tuning BERT models, leading to higher rates of compression with little to no loss in accuracy on the GLUE classification benchmark.

https://arxiv.org/abs/2205.12694