VoLTA 是一种新的视觉语言 Transformer 范式,无需使用昂贵的框标注就能实现细粒度的区块级图像理解。
VoLTA: Vision-Language Transformer with Weakly-Supervised Local-Feature Alignment
S Pramanick, L Jing, S Nag, J Zhu, H J Shah, Y LeCun, R Chellappa
[Meta AI]
VoLTA: 基于弱监督局部特征对齐的视觉-语言Transformer
要点:
-
VoLTA 是一种统一的 VLP 范式,利用图像标题数据和弱监督图块-标记对齐来实现细粒度的区块级图像理解,消除了对昂贵的框标注的需求; -
VoLTA 在预训练期间将多模态融合深入到单模态骨干中,删除了针对融合的 Transformer 层,减少了内存需求; -
VoLTA 在广泛的视觉和视觉-语言下游任务上显示出有效性,超过使用明显更多描述和框标注的方法。
https://openreview.net/forum?id=26aAV_wjoc
Vision-language pre-training (VLP) has recently proven highly effective for various uni- and multi-modal downstream applications. However, most existing end-to-end VLP methods use high-resolution image-text-box data to perform well on fine-grained region-level tasks, such as object detection, segmentation, and referring expression comprehension. Unfortunately, such high-resolution images with accurate bounding box annotations are expensive to collect and use for supervision at scale. In this work, we propose VoLTA (Vision-Language Transformer with weakly-supervised local-feature Alignment), a new VLP paradigm that only utilizes image-caption data but achieves fine-grained region-level image understanding, eliminating the use of expensive box annotations. VoLTA adopts graph optimal transport-based weakly-supervised alignment on local image patches and text tokens to germinate an explicit, self-normalized, and interpretable low-level matching criterion. In addition, VoLTA pushes multi-modal fusion deep into the uni-modal backbones during pre-training and removes fusion-specific transformer layers, further reducing memory requirements. Extensive experiments on a wide range of vision- and vision-language downstream tasks demonstrate the effectiveness of VoLTA on fine-grained applications without compromising the coarse-grained downstream performance, often outperforming methods using significantly more caption and box annotations.



内容中包含的图片若涉及版权问题,请及时与我们联系删除


评论
沙发等你来抢