- 简介比较两张图片的共同点和差异(CaD)是人类基本的能力,它构成了先进视觉推理和解释的基础。这对于生成详细的、与上下文相关的描述,进行比较分析、新颖性检测以及基于视觉数据做出明智决策都是必不可少的。然而,令人惊讶的是,在当前最好的人类视觉智能模拟——大型多模态模型(LMM)中,这些基本概念却受到了很少的关注。我们开发并贡献了一种新的两阶段方法CaD-VI,用于收集合成视觉指令,以及一个包含349K图像对的CaD指令数据集CaD-Inst,这些指令是使用CaD-VI收集的。我们的方法显著提高了LMM的CaD识别能力,在各种相关任务上的最新结果提高了高达17.5%。它也是现有仅包含差异指令数据集的有益补充,使得这些资源的自动定向优化增加了10%的效果。此外,我们提出了一个评估基准,包含7.5K个开放式问题和答案,以评估LMM的CaD理解能力。
-
- 图表
- 解决问题CaD-VI: Collecting Synthetic Visual Instructions for Comparing Images in Large Multimodal Models
- 关键思路The paper proposes a two-phase approach, CaD-VI, for collecting synthetic visual instructions to improve the CaD spotting capabilities in Large Multimodal Models (LMMs) and advances the state-of-the-art on related tasks by up to 17.5%.
- 其它亮点The paper introduces the CaD-Inst dataset containing 349K image pairs with CaD instructions collected using CaD-VI. The proposed approach is complementary to existing difference-only instruction datasets and can increase their effectiveness for CaD tuning by up to 10%. The paper also proposes an evaluation benchmark with 7.5K open-ended QAs to assess the CaD understanding abilities of LMMs.
- Related research in the field includes difference-only instruction datasets and visual reasoning tasks such as VQA and CLEVR.
NEW
提问交流
提交问题,平台邀请作者,轻松获得权威解答~
向作者提问

提问交流