[LG] Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and LanguageA
Baevski, A Babu, W-N Hsu, M Auli
[Meta AI]

视觉、语音和语言上下文化目标表示高效自监督学习要点:

  1. 通过快速卷积解码器、数据编码和重复使用目标表示提高了自监督学习的计算效率;
  2. 实验表明,在图像分类、语音识别和自然语言理解中,预训练速度提高2-16倍,准确性相似。  


摘要:目前的自监督学习算法往往针对特定的模态,需要大量的计算资源。为了解决这些问题,本文提高了data2vec的训练效率,这个学习目标可以在多个模态间通用。不对被掩码的Token进行编码,使用快速卷积解码器,并摊销建立教师表示的努力。data2vec 2.0得益于data2vec中引入的丰富的上下文化目标表示,使得一个快速的自监督学习器成为可能。在ImageNet-1K图像分类上的实验表明,data2vec 2.0以16.4倍的预训练时间达到了掩码自编码器的准确度,在Librispeech语音识别上,以10.6倍少的时间达到了wav2vec 2.0的表现,在GLUE自然语言理解上,以一半的时间达到了重新训练的RoBERTa模型。用速度换取准确率,在ImageNet-1K上,ViT-L模型训练了150个轮次,其准确率达到86.8%。

论文地址:http://aicoco.net/s/data2vec20

Current self-supervised learning algorithms are often modality-specific and require large amounts of computational resources. To address these issues, we increase the training efficiency of data2vec, a learning objective that generalizes across several modalities. We do not encode masked tokens, use a fast convolutional decoder and amortize the effort to build teacher representations. data2vec 2.0 benefits from the rich contextualized target representations introduced in data2vec which enable a fast self-supervised learner. Experiments on ImageNet-1K image classification show that data2vec 2.0 matches the accuracy of Masked Autoencoders in 16.4x lower pre-training time, on Librispeech speech recognition it performs as well as wav2vec 2.0 in 10.6x less time, and on GLUE natural language understanding it matches a retrained RoBERTa model in half the time. Trading some speed for accuracy results in ImageNet-1K top-1 accuracy of 86.8% with a ViT-L model trained for 150 epochs.



 

内容中包含的图片若涉及版权问题,请及时与我们联系删除