爱可可AI前沿推介(12.31)

LG - 机器学习 CV - 计算机视觉 CL - 计算与语言 AS - 音频与语音 RO - 机器人 CO - 计算方法

转自爱可可爱生活

1、[CL] A Survey on Gender Bias in Natural Language Processing

K Stanczak, I Augenstein

[University of Copenhagen]

关于自然语言处理性别偏见综述。语言可以被用作复制和执行有害的刻板印象和偏见的手段，并在许多研究中被分析为如此。本文介绍了304篇关于自然语言处理中的性别偏见的论文的调查。分析了社会科学中对性别及其类别的定义，并将其与NLP研究中的性别偏见的正式定义联系起来。调研了性别偏见研究中应用的词汇和数据集，比较和对比了检测和减轻性别偏见的方法。关于性别偏见的研究有四个核心限制：1）大多数研究将性别视为一个二元变量，忽略了它的流动性和连续性。2）大多数工作都是在英语或其他高资源语言的单语环境下进行的。3）尽管有无数篇关于NLP方法中的性别偏见的论文，大多数新开发的算法并没有测试他们的模型是否有偏见，并且无视他们工作中可能的道德考虑。4）在一线研究中开发的方法从根本上说是有缺陷的，涵盖了对性别偏见的非常有限的定义，并且缺乏评估基线和管线。克服这些局限性是未来研究的一个必要方向。

Language can be used as a means of reproducing and enforcing harmful stereotypes and biases and has been analysed as such in numerous research. In this paper, we present a survey of 304 papers on gender bias in natural language processing. We analyse definitions of gender and its categories within social sciences and connect them to formal definitions of gender bias in NLP research. We survey lexica and datasets applied in research on gender bias and then compare and contrast approaches to detecting and mitigating gender bias. We find that research on gender bias suffers from four core limitations. 1) Most research treats gender as a binary variable neglecting its fluidity and continuity. 2) Most of the work has been conducted in monolingual setups for English or other high-resource languages. 3) Despite a myriad of papers on gender bias in NLP methods, we find that most of the newly developed algorithms do not test their models for bias and disregard possible ethical considerations of their work. 4) Finally, methodologies developed in this line of research are fundamentally flawed covering very limited definitions of gender bias and lacking evaluation baselines and pipelines. We see overcoming these limitations as a necessary development in future research.

2、[CV] StyleGAN-V: A Continuous Video Generator with the Price, Image Quality and Perks of StyleGAN2

I Skorokhodov, S Tulyakov, M Elhoseiny

[KAUST & Snap Inc.]

StyleGAN-V：具有StyleGAN2成本、图像质量和优点的连续视频生成器。视频显示了连续的事件，然而大多数——如果不是全部--视频合成框架都是在时间上离散地进行处理。本文认为视频应该是时间连续信号，并扩展了神经表示的范式以建立一个连续时间的视频生成器。通过位置嵌入的视角来设计连续运动表示。探讨了在非常稀疏的视频上进行训练的问题，并证明了通过用每个片段中仅有的2个帧就可以学到一个好的生成器。重新思考了传统的图像和视频判别器对，并建议用一个基于超网络的判别器。这降低了训练成本，并为生成器提供了更丰富的学习信号，使其首次有可能直接在1024个视频上进行训练。在StyleGAN2的基础上建立了模型，在相同分辨率下，训练成本只增加了≈5%，同时实现了几乎相同的图像质量。所得到的潜空间具有类似的属性，使时间上传播的空间操作成为可能。可以在任意的高帧率下生成任意长的视频，而之前的工作甚至难以在固定的速率下生成64帧。所提出的模型在四个现代256视频合成基准和一个1024分辨率的基准上取得了最先进的结果。

Videos show continuous events, yet most — if not all — video synthesis frameworks treat them discretely in time. In this work, we think of videos of what they should be — time-continuous signals, and extend the paradigm of neural representations to build a continuous-time video generator. For this, we first design continuous motion representations through the lens of positional embeddings. Then, we explore the question of training on very sparse videos and demonstrate that a good generator can be learned by using as few as 2 frames per clip. After that, we rethink the traditional image and video discriminators pair and propose to use a single hypernetwork-based one. This decreases the training cost and provides richer learning signal to the generator, making it possible to train directly on 1024 videos for the first time. We build our model on top of StyleGAN2 and it is just ≈5% more expensive to train at the same resolution while achieving almost the same image quality. Moreover, our latent space features similar properties, enabling spatial manipulations that our method can propagate in time. We can generate arbitrarily long videos at arbitrary high frame rate, while prior work struggles to generate even 64 frames at a fixed rate. Our model achieves state-of-the-art results on four modern 256 video synthesis benchmarks and one 1024 resolution one.1

3、[CV] A Fistful of Words: Learning Transferable Visual Models from Bag-of-Words Supervision

A Tejankar, A Tejankar, B Wu, S Xie, M Khabsa, H Pirsiavash, H Firooz

[University of Maryland & Facebook AI & UC Davis]

A Fistful of Words: 从词袋监督学习可迁移视觉模型。用自然语言作为训练视觉识别模型的监督手段具有很大的前景。最近的工作表明，如果这种监督是以大型训练数据集中的图像和标题之间的对齐形式使用的，那么产生的对齐模型在作为下游任务的零样本分类中表现良好。本文专注于研究语言监督的哪些部分对于训练零样本图像分类模型是必不可少的。通过广泛而仔细的实验，得出结论：1）简单的词袋(BoW)标题可以用来替代数据集中的大部分图像标题。令人惊讶的是，当与词平衡相结合时，这种方法提高了零样本分类性能。2）使用BoW预训练的模型，可通过在没有标题的图像上生成假的BoW标题来获得更多的训练数据。在具有真实和伪BoW标题的图像上训练的模型可以获得更强的零样本性能。在ImageNet-1k的零样本评估中，最好的模型只用了3M的图像-标题对，与在1500万图像-标题对上训练的CLIP模型表现相当(31.5% vs 31.3%)。

Using natural language as a supervision for training visual recognition models holds great promise. Recent works have shown that if such supervision is used in the form of alignment between images and captions in large training datasets, then the resulting aligned models perform well on zero-shot classification as downstream tasks2. In this paper, we focus on teasing out what parts of the language supervision are essential for training zero-shot image classification models. Through extensive and careful experiments, we show that: 1) A simple Bag-of-Words (BoW) caption could be used as a replacement for most of the image captions in the dataset. Surprisingly, we observe that this approach improves the zero-shot classification performance when combined with word balancing. 2) Using a BoW pretrained model, we can obtain more training data by generating pseudo-BoW captions on images that do not have a caption. Models trained on images with real and pseudo-BoW captions achieve stronger zero-shot performance. On ImageNet-1k zero-shot evaluation, our best model, that uses only 3M image-caption pairs, performs on-par with a CLIP model trained on 15M image-caption pairs (31.5% vs 31.3%).

4、[CO] Efficient Automatic Differentiation of Implicit Functions

C Margossian, M Betancourt

[Columbia University]

隐函数的高效自动微分。基于导数的算法在统计学、机器学习和应用数学中无处不在。自动微分提供了一种算法方法，可以从执行相关函数的计算机程序中有效地评估导数。然而，为包含隐函数的程序实施自动微分，如代数或微分方程的解决方案，需要特别小心。当代的应用通常会求助于隐函数定理的应用，或者在某些情况下，求助于专门的邻接方法。本文表明这两种方法都可以推广到任意隐函数，尽管广义邻接方法通常对自动微分更为有效。为了展示这两种方法的相对优势和局限性，展示了它们在一套常见隐函数上的应用。

Derivative-based algorithms are ubiquitous in statistics, machine learning, and applied mathematics. Automatic differentiation offers an algorithmic way to efficiently evaluate these derivatives from computer programs that execute relevant functions. Implementing automatic differentiation for programs that incorporate implicit functions, such as the solution to an algebraic or differential equation, however, requires particular care. Contemporary applications typically appeal to either the application of the implicit function theorem or, in certain circumstances, specialized adjoint methods. In this paper we show that both of these approaches can be generalized to any implicit function, although the generalized adjoint method is typically more effective for automatic differentiation. To showcase the relative advantages and limitations of the two methods we demonstrate their application on a suite of common implicit functions.

5、[CL] LINDA: Unsupervised Learning to Interpolate in Natural Language Processing

Y Kim, S Jeong, K Cho

[Hyundai Motor Company & New York University]

LINDA：自然语言处理插值无监督学习。尽管Mixup方法在数据增强方面取得了成功，但由于自然语言的离散性和可变长性质，Mixup方法对自然语言处理(NLP)任务的适用性是有限的。因此，最近的研究依赖于特定领域的启发式方法和手工制作的资源，如字典，以便在NLP中应用Mixup方法。本文提出一种以数据扩充为目的的无监督学习方法，即"面向数据增强的插值学习"(LINDA)，不需要任何启发式方法或手工制作的资源，而是学习在自然语言流形上任何一对自然语言句子之间进行插值。实验证明了LINDA的插值能力，表明LINDA确实可以在NLP中无缝地应用Mixup方法，并在域内和域外的文本分类中带来更好的泛化。

Despite the success of mixup in data augmentation, its applicability to natural language processing (NLP) tasks has been limited due to the discrete and variable-length nature of natural languages. Recent studies have thus relied on domain-specific heuristics and manually crafted resources, such as dictionaries, in order to apply mixup in NLP. In this paper, we instead propose an unsupervised learning approach to text interpolation for the purpose of data augmentation, to which we refer as “Learning to INterpolate for Data Augmentation” (LINDA), that does not require any heuristics nor manually crafted resources but learns to interpolate between any pair of natural language sentences over a natural language manifold. After empirically demonstrating the LINDA’s interpolation capability, we show that LINDA indeed allows us to seamlessly apply mixup in NLP and leads to better generalization in text classification both indomain and out-of-domain.

另外几篇值得关注的论文：

[CV] AdaFocus V2: End-to-End Training of Spatial Dynamic Networks for Video Recognition

AdaFocus V2：面向视频识别的空间动态网络端到端训练

Y Wang, Y Yue, Y Lin, H Jiang, Z Lai, V Kulikov, N Orlov, H Shi, G Huang

[Tsinghua University & University of Washington & CMU & Picsart AI Research (PAIR)]

[CV] SurFit: Learning to Fit Surfaces Improves Few Shot Learning on Point Clouds

SurFit：基于曲面拟合学习改善点云少样本学习

G Sharma, B Dash, M Gadelha, A RoyChowdhury, M Loizou, E Kalogerakis, L Cao, E Learned-Miller, R W a Maji

[University of Massachusetts Amherst & Adobe & University of Cyprus]

[CV] Human View Synthesis using a Single Sparse RGB-D Input

基于单稀疏RGB-D输入的人体视图合成

P Nguyen, N Sarafianos, C Lassner, J Heikkila, T Tung

[University of Oulu & Reality Labs Research, Sausalito]

[CV] SPViT: Enabling Faster Vision Transformers via Soft Token Pruning

SPViT：通过软Token修剪实现更快的视觉Transformer

Z Kong, P Dong, X Ma, X Meng, W Niu, M Sun, B Ren, M Qin, H Tang, Y Wang

[Northeastern University & College of William and Mary & Peking University & ETH Zurich]

内容中包含的图片若涉及版权问题，请及时与我们联系删除