- 简介现有的视觉指令调优方法通常通过向大型语言模型提供文本描述来生成指令遵循数据。尽管已经取得了有希望的表现,但这些描述是从图像注释中派生出来的,往往是粗粒度的。此外,这些指令甚至可能在没有观察整个视觉环境的情况下与视觉内容相矛盾。为了解决这一挑战,我们介绍了一个细粒度的视觉指令数据集LVIS-Instruct4V,其中包含通过提示强大的GPT-4V使用LVIS图像生成的220K个视觉对齐和上下文感知指令。通过实验验证和案例研究,我们证明高质量的视觉指令数据可以显著提高LLaVA-1.5(一种最先进的大型多模型模型)在广泛的基准测试中的表现。值得注意的是,仅仅通过用我们的LVIS-Instruct4V替换LLaVA-Instruct,我们就在大多数具有挑战性的LMM基准测试中取得了比LLaVA更好的结果,例如LLaVA$^w$(76.7 vs. 70.7)和MM-Vet(40.2 vs. 35.4)。我们在https://github.com/X2FD/LVIS-INSTRUCT4V发布了我们的数据和模型。
- 图表
- 解决问题The paper aims to address the challenge of coarse-grained image annotations in existing visual instruction tuning methods and introduce a fine-grained visual instruction dataset to improve the performance of large multimodal models.
- 关键思路The key idea of the paper is to prompt the powerful GPT-4V with images from LVIS to produce visually aligned and context-aware instructions, which can improve the performance of LLaVA-1.5, a state-of-the-art large multimodal model.
- 其它亮点The paper introduces a new fine-grained visual instruction dataset, LVIS-Instruct4V, which contains 220K visually aligned and context-aware instructions. Experimental validation and case studies demonstrate that using this dataset can improve the performance of LLaVA-1.5 across a wide spectrum of benchmarks by clear margins. The data and model are released on GitHub. The paper suggests that future research can explore the use of this dataset in other multimodal models and tasks.
- Related research in this field includes existing visual instruction tuning methods that prompt large language models with textual descriptions to generate instruction-following data, as well as other multimodal models and datasets such as LLaVA and MM-Vet.
沙发等你来抢
去评论
评论
沙发等你来抢