LaMI-DETR: Open-Vocabulary Detection with Language Model Instruction

2024年07月16日
  • 简介
    现有的方法通过利用视觉语言模型(VLM)如CLIP的强大的开放式词汇识别能力来增强开放式词汇物体检测。然而,出现了两个主要挑战:(1)概念表示不足,CLIP文本空间中的类别名称缺乏文本和视觉知识;(2)对基础类别的过拟合倾向,开放式词汇知识在从VLM到检测器的转移过程中偏向于基础类别。为了解决这些挑战,我们提出了语言模型指导(LaMI)策略,它利用视觉概念之间的关系,并在一个简单而有效的DETR检测器中应用它们,称为LaMI-DETR。LaMI利用GPT构建视觉概念,并使用T5研究类别之间的视觉相似性。这些类别之间的关系可以提高概念表示,并避免对基础类别的过拟合。全面的实验验证了我们的方法在相同严格的设置下优于现有方法,而不依赖于外部训练资源。LaMI-DETR在OV-LVIS上取得了43.4的罕见框AP,超过了先前最佳的7.8的罕见框AP。
  • 图表
  • 解决问题
    LaMI-DETR: Leveraging Relationships Between Visual Concepts for Open-Vocabulary Object Detection
  • 关键思路
    The Language Model Instruction (LaMI) strategy is proposed to refine concept representation and avoid overfitting to base categories by leveraging relationships between visual concepts and applying them within a simple yet effective DETR-like detector.
  • 其它亮点
    LaMI utilizes GPT to construct visual concepts and employs T5 to investigate visual similarities across categories. Comprehensive experiments validate LaMI-DETR's superior performance over existing methods in the same rigorous setting without reliance on external training resources, achieving a rare box AP of 43.4 on OV-LVIS, surpassing the previous best by 7.8 rare box AP.
  • 相关研究
    Existing methods enhance open-vocabulary object detection by leveraging the robust open-vocabulary recognition capabilities of Vision-Language Models (VLMs), such as CLIP. Other related studies in this field include "Unbiased Scene Graph Generation from Biased Training" and "Learning to Learn from Web Data through Deep Semantic Embeddings".
PDF
原文
点赞 收藏 评论 分享到Link

沙发等你来抢

去评论