Learning Visual Representations via Language-Guided Sampling

M E Banani, K Desai, J Johnson
[University of Michigan]

基于语言引导采样的视觉表示学习

要点:

  1. 语言引导采样可以改善视觉学习;
  2. 预训练语言模型可以用来对相似的描述进行采样对比学习;
  3. 语言引导学习可以比基于图像的对比学习学到更好的特征;
  4. 对于无标签的数据集,最近邻实例和语言采样优于其他方法。

一句话总结:
利用语言相似性对语义相似图像对进行对比学习,可以获得比图像-图像和图像-文本表现学习方法更好的特征。

Although an object may appear in numerous contexts, we often describe it in a limited number of ways. This happens because language abstracts away visual variation to represent and communicate concepts. Building on this intuition, we propose an alternative approach to visual learning: using language similarity to sample semantically similar image pairs for contrastive learning. Our approach deviates from image-based contrastive learning by using language to sample pairs instead of hand-crafted augmentations or learned clusters. Our approach also deviates from image-text contrastive learning by relying on pre-trained language models to guide the learning rather than minimize a cross-modal similarity. Through a series of experiments, we show that language-guided learning can learn better features than both image-image and image-text representation learning approaches.

论文链接:https://arxiv.org/abs/2302.12248
图片
图片
图片
图片

内容中包含的图片若涉及版权问题,请及时与我们联系删除