Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks

简介

我们介绍了一种新的视觉基础模型Florence-2，它具有统一的基于提示的表示，可用于各种计算机视觉和视觉语言任务。虽然现有的大型视觉模型在迁移学习方面表现出色，但它们在执行各种带有简单指令的任务方面却很困难，这需要处理各种空间层次和语义粒度的复杂性。Florence-2旨在接受文本提示作为任务指令，并以文本形式生成理想的结果，无论是字幕、对象检测、接地还是分割。这种多任务学习设置需要大规模、高质量的注释数据。为此，我们共同开发了FLD-5B，它包含了1.26亿张图像的54亿个全面的视觉注释，采用了自动图像注释和模型精炼的迭代策略。我们采用序列到序列的结构来训练Florence-2执行多样化和全面的视觉任务。对众多任务进行广泛的评估表明，Florence-2是一个强大的视觉基础模型竞争者，具有前所未有的零样本和微调能力。
图表
解决问题

The paper aims to introduce Florence-2, a vision foundation model that can perform a variety of computer vision and vision-language tasks with simple instructions.
关键思路

Florence-2 uses a unified, prompt-based representation to generate text forms of desirable results for tasks such as captioning, object detection, grounding, and segmentation. It was trained using a sequence-to-sequence structure and a large-scale, high-quality annotated dataset called FLD-5B.
其它亮点

FLD-5B consists of 5.4 billion comprehensive visual annotations on 126 million images, and was created using an iterative strategy of automated image annotation and model refinement. Florence-2 demonstrated strong zero-shot and fine-tuning capabilities in extensive evaluations on numerous tasks. The paper also highlights the importance of large-scale annotated datasets for training vision models.
相关研究

Recent related research in this field includes the papers 'Unified Vision-Language Pre-Training for Image Captioning and VQA' by Li et al. and 'VisualBERT: A Simple and Performant Baseline for Vision and Language' by Li et al.

Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks

评论