Parrot: Multilingual Visual Instruction Tuning

向作者提问

NEW

简介

本文介绍了一种名为Parrot的新方法，它利用文本指导驱动语言级别的视觉令牌对齐，以提高多语言大型语言模型（MLLMs）的多模态能力。现有方法主要通过监督微调（SFT）将视觉编码器与LLMs对齐，使得MLLMs对多种语言的固有反应能力随着训练过程的演化而逐渐恶化。我们实证发现，SFT数据集的不平衡性，主要由以英语为中心的图像-文本对组成，导致非英语语言的性能显著降低。这是由于在SFT过程中未能对齐视觉编码器和LLM的多语言令牌所致。Parrot方法通过使用混合专家（MoE）来促进多语言令牌的对齐，使视觉令牌条件于不同的语言输入，并使用文本嵌入计算交叉注意力，选择最相关的专家，将初始视觉令牌转换为特定于语言的视觉令牌。此外，考虑到目前领域内缺乏用于评估多语言能力的基准，我们收集并提供了一个名为MMMB的大规模多语言多模态基准，包括6种语言、15个类别和12,000个问题。我们的方法不仅在多语言MMBench和MMMB上展现了最先进的性能，而且在广泛的多模态任务中也表现出色。Parrot的源代码和训练数据集将公开发布。
作者讲解

目前尚无作者解读视频，你可点击下方【许愿开讲】按钮，许愿作者开讲~
图表
解决问题

Parrot: Multilingual Large-Scale Vision-and-Language Pre-Training via Textual Guidance
关键思路

Parrot introduces a novel method that utilizes textual guidance to drive visual token alignment at the language level, which enhances non-English visual tokens alignment and promotes the alignment of multilingual tokens using Mixture-of-Experts (MoE) approach.
其它亮点

The paper introduces a Massive Multilingual Multimodal Benchmark (MMMB) which includes 6 languages, 15 categories, and 12,000 questions. Parrot demonstrates state-of-the-art performance on multilingual MMMBench and MMMB, as well as across a broad range of multimodal tasks. The source code and the training dataset of Parrot are publicly available.
相关研究

Recent related research includes fine-tuning vision encoders with LLMs through supervised fine-tuning (SFT) and the failure of aligning the vision encoder and LLM with multilingual tokens during the SFT process.

许愿开讲

PDF

原文

点赞收藏

向作者提问

NEW

分享到Link

提问交流

提交问题，平台邀请作者，轻松获得权威解答～

向作者提问