NExT-Chat: An LMM for Chat, Detection and Segmentation

2023年11月08日
  • 简介
    大型语言模型(LLM)的发展极大地推进了多模态理解领域,引发了大型多模态模型(LMM)的出现。为了增强视觉理解能力,最近的研究通过将物体边界框坐标表示为一系列文本序列(pixel2seq),为LMM配备了区域级别的理解能力。本文介绍了一种名为pixel2emb的物体定位建模新范式,其中我们要求LMM输出位置嵌入,然后由不同的解码器解码。这种范式允许在多模态对话中使用不同的位置格式(如边界框和掩码)。此类基于嵌入的位置建模还可以利用定位任务中的现有实践,如检测和分割。在资源有限的情况下,我们的pixel2emb在公平比较下在位置输入和输出任务中表现出优越的性能,超过了现有的最先进(SOTA)方法。利用所提出的pixel2emb方法,我们训练了一个名为NExT-Chat的LMM,并展示了它处理多个任务的能力,如视觉基础、区域字幕和基于基础的推理。
  • 解决问题
    The paper aims to enhance the level of visual comprehension in LMMs by introducing a novel paradigm for object location modeling called pixel2emb method.
  • 关键思路
    The key idea of the paper is to ask the LMM to output the location embeddings and then decode them with different decoders, allowing for different location formats to be used in multimodal conversations. This kind of embedding-based location modeling enables the utilization of existing practices in localization tasks, such as detection and segmentation.
  • 其它亮点
    The paper demonstrates that the proposed pixel2emb method outperforms existing state-of-the-art approaches in both the location input and output tasks under fair comparison, particularly in scenarios with limited resources. The paper also introduces an LMM named NExT-Chat, which is capable of handling multiple tasks like visual grounding, region caption, and grounded reasoning. The experiments are designed using various datasets, and the paper provides open-source code for reproducibility.
  • 相关研究
    Recent studies in this field include 'VisualBERT: A Simple and Performant Baseline for Vision and Language' and 'Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training'.
PDF
原文
点赞 收藏 评论 分享到Link

沙发等你来抢

去评论