Comparison Visual Instruction Tuning

简介

比较两张图片的共同点和差异（CaD）是人类基本的能力，它构成了先进视觉推理和解释的基础。这对于生成详细的、与上下文相关的描述，进行比较分析、新颖性检测以及基于视觉数据做出明智决策都是必不可少的。然而，令人惊讶的是，在当前最好的人类视觉智能模拟——大型多模态模型（LMM）中，这些基本概念却受到了很少的关注。我们开发并贡献了一种新的两阶段方法CaD-VI，用于收集合成视觉指令，以及一个包含349K图像对的CaD指令数据集CaD-Inst，这些指令是使用CaD-VI收集的。我们的方法显著提高了LMM的CaD识别能力，在各种相关任务上的最新结果提高了高达17.5%。它也是现有仅包含差异指令数据集的有益补充，使得这些资源的自动定向优化增加了10%的效果。此外，我们提出了一个评估基准，包含7.5K个开放式问题和答案，以评估LMM的CaD理解能力。
作者讲解

目前尚无作者解读视频，你可点击下方【许愿开讲】按钮，许愿作者开讲~
图表
解决问题

CaD-VI: Collecting Synthetic Visual Instructions for Comparing Images in Large Multimodal Models
关键思路

The paper proposes a two-phase approach, CaD-VI, for collecting synthetic visual instructions to improve the CaD spotting capabilities in Large Multimodal Models (LMMs) and advances the state-of-the-art on related tasks by up to 17.5%.
其它亮点

The paper introduces the CaD-Inst dataset containing 349K image pairs with CaD instructions collected using CaD-VI. The proposed approach is complementary to existing difference-only instruction datasets and can increase their effectiveness for CaD tuning by up to 10%. The paper also proposes an evaluation benchmark with 7.5K open-ended QAs to assess the CaD understanding abilities of LMMs.
相关研究

Related research in the field includes difference-only instruction datasets and visual reasoning tasks such as VQA and CLEVR.

Comparison Visual Instruction Tuning

提问交流

提问交流