- 简介我们介绍了一种新的视觉基础模型Florence-2,它具有统一的基于提示的表示,可用于各种计算机视觉和视觉语言任务。虽然现有的大型视觉模型在迁移学习方面表现出色,但它们在执行各种带有简单指令的任务方面却很困难,这需要处理各种空间层次和语义粒度的复杂性。Florence-2旨在接受文本提示作为任务指令,并以文本形式生成理想的结果,无论是字幕、对象检测、接地还是分割。这种多任务学习设置需要大规模、高质量的注释数据。为此,我们共同开发了FLD-5B,它包含了1.26亿张图像的54亿个全面的视觉注释,采用了自动图像注释和模型精炼的迭代策略。我们采用序列到序列的结构来训练Florence-2执行多样化和全面的视觉任务。对众多任务进行广泛的评估表明,Florence-2是一个强大的视觉基础模型竞争者,具有前所未有的零样本和微调能力。
- 图表
- 解决问题The paper aims to introduce Florence-2, a vision foundation model that can perform a variety of computer vision and vision-language tasks with simple instructions.
- 关键思路Florence-2 uses a unified, prompt-based representation to generate text forms of desirable results for tasks such as captioning, object detection, grounding, and segmentation. It was trained using a sequence-to-sequence structure and a large-scale, high-quality annotated dataset called FLD-5B.
- 其它亮点FLD-5B consists of 5.4 billion comprehensive visual annotations on 126 million images, and was created using an iterative strategy of automated image annotation and model refinement. Florence-2 demonstrated strong zero-shot and fine-tuning capabilities in extensive evaluations on numerous tasks. The paper also highlights the importance of large-scale annotated datasets for training vision models.
- Recent related research in this field includes the papers 'Unified Vision-Language Pre-Training for Image Captioning and VQA' by Li et al. and 'VisualBERT: A Simple and Performant Baseline for Vision and Language' by Li et al.
沙发等你来抢
去评论
评论
沙发等你来抢