视觉、语音和语言上下文化目标表示高效自监督学习

[LG] Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and LanguageA

Baevski, A Babu, W-N Hsu, M Auli
[Meta AI]

视觉、语音和语言上下文化目标表示高效自监督学习要点：

通过快速卷积解码器、数据编码和重复使用目标表示提高了自监督学习的计算效率；
实验表明，在图像分类、语音识别和自然语言理解中，预训练速度提高2-16倍，准确性相似。

摘要：目前的自监督学习算法往往针对特定的模态，需要大量的计算资源。为了解决这些问题，本文提高了data2vec的训练效率，这个学习目标可以在多个模态间通用。不对被掩码的Token进行编码，使用快速卷积解码器，并摊销建立教师表示的努力。data2vec 2.0得益于data2vec中引入的丰富的上下文化目标表示，使得一个快速的自监督学习器成为可能。在ImageNet-1K图像分类上的实验表明，data2vec 2.0以16.4倍的预训练时间达到了掩码自编码器的准确度，在Librispeech语音识别上，以10.6倍少的时间达到了wav2vec 2.0的表现，在GLUE自然语言理解上，以一半的时间达到了重新训练的RoBERTa模型。用速度换取准确率，在ImageNet-1K上，ViT-L模型训练了150个轮次，其准确率达到86.8%。

论文地址：http://aicoco.net/s/data2vec20

Current self-supervised learning algorithms are often modality-specific and require large amounts of computational resources. To address these issues, we increase the training efficiency of data2vec, a learning objective that generalizes across several modalities. We do not encode masked tokens, use a fast convolutional decoder and amortize the effort to build teacher representations. data2vec 2.0 benefits from the rich contextualized target representations introduced in data2vec which enable a fast self-supervised learner. Experiments on ImageNet-1K image classification show that data2vec 2.0 matches the accuracy of Masked Autoencoders in 16.4x lower pre-training time, on Librispeech speech recognition it performs as well as wav2vec 2.0 in 10.6x less time, and on GLUE natural language understanding it matches a retrained RoBERTa model in half the time. Trading some speed for accuracy results in ImageNet-1K top-1 accuracy of 86.8% with a ViT-L model trained for 150 epochs.

内容中包含的图片若涉及版权问题，请及时与我们联系删除

视觉、语音和语言上下文化目标表示高效自监督学习

评论