Meta AI｜VoLTA: 基于弱监督局部特征对齐的视觉语言Transformer

VoLTA 是一种新的视觉语言 Transformer 范式，无需使用昂贵的框标注就能实现细粒度的区块级图像理解。

VoLTA: Vision-Language Transformer with Weakly-Supervised Local-Feature Alignment

S Pramanick, L Jing, S Nag, J Zhu, H J Shah, Y LeCun, R Chellappa
[Meta AI]

VoLTA: 基于弱监督局部特征对齐的视觉-语言Transformer

要点:

VoLTA 是一种统一的 VLP 范式，利用图像标题数据和弱监督图块-标记对齐来实现细粒度的区块级图像理解，消除了对昂贵的框标注的需求；
VoLTA 在预训练期间将多模态融合深入到单模态骨干中，删除了针对融合的 Transformer 层，减少了内存需求；
VoLTA 在广泛的视觉和视觉-语言下游任务上显示出有效性，超过使用明显更多描述和框标注的方法。

https://openreview.net/forum?id=26aAV_wjoc

Vision-language pre-training (VLP) has recently proven highly effective for various uni- and multi-modal downstream applications. However, most existing end-to-end VLP methods use high-resolution image-text-box data to perform well on fine-grained region-level tasks, such as object detection, segmentation, and referring expression comprehension. Unfortunately, such high-resolution images with accurate bounding box annotations are expensive to collect and use for supervision at scale. In this work, we propose VoLTA (Vision-Language Transformer with weakly-supervised local-feature Alignment), a new VLP paradigm that only utilizes image-caption data but achieves fine-grained region-level image understanding, eliminating the use of expensive box annotations. VoLTA adopts graph optimal transport-based weakly-supervised alignment on local image patches and text tokens to germinate an explicit, self-normalized, and interpretable low-level matching criterion. In addition, VoLTA pushes multi-modal fusion deep into the uni-modal backbones during pre-training and removes fusion-specific transformer layers, further reducing memory requirements. Extensive experiments on a wide range of vision- and vision-language downstream tasks demonstrate the effectiveness of VoLTA on fine-grained applications without compromising the coarse-grained downstream performance, often outperforming methods using significantly more caption and box annotations.

内容中包含的图片若涉及版权问题，请及时与我们联系删除

Meta AI｜VoLTA: 基于弱监督局部特征对齐的视觉语言Transformer

评论列表

评论