ViTextVQA: A Large-Scale Visual Question Answering Dataset for Evaluating Vietnamese Text Comprehension in Images

简介

视觉问答（VQA）是一项复杂的任务，需要同时处理自然语言和图像的能力。最初，该任务的研究集中在帮助机器理解图像中的对象和场景背景的方法上。然而，图像中出现的一些文本包含有关图像整体内容的显式信息，但未被提及。随着人工智能时代的不断发展，世界上已经有许多关于VQA模型阅读理解能力的研究。作为一个发展中国家，越南的条件仍然有限，这项任务仍然是开放的。因此，我们介绍了越南第一个大规模数据集，专门针对理解出现在图像中的文本的能力，我们称之为ViTextVQA（越南语基于文本的视觉问答数据集），其中包含超过16,000张图像和超过50,000个问题及答案。通过对各种最先进的模型进行细致的实验，我们揭示了OCR文本中处理和选择令牌的顺序对制定答案的重要性。这一发现帮助我们显着提高了基线模型在ViTextVQA数据集上的性能。我们的数据集可在此链接上获得，供研究目的使用：https://github.com/minhquan6203/ViTextVQA-Dataset。
图表
解决问题

ViTextVQA dataset aims to solve the problem of understanding text appearing in images in Vietnamese language, which is still an open task in Vietnam.
关键思路

The significance of the order in which tokens in OCR text are processed and selected to formulate answers is uncovered, which significantly improves the performance of baseline models on the ViTextVQA dataset.
其它亮点

The ViTextVQA dataset contains over 16,000 images and over 50,000 questions with answers. Meticulous experiments with various state-of-the-art models are conducted. The dataset is available for research purposes at the given link.
相关研究

Recent related studies in this field include Visual Question Answering (VQA) research focusing on methods to help machines understand objects and scene contexts in images, and reading comprehension ability of VQA models.

ViTextVQA: A Large-Scale Visual Question Answering Dataset for Evaluating Vietnamese Text Comprehension in Images

评论