来自今天的爱可可AI前沿推介
[CV] Unifying Vision, Text, and Layout for Universal Document Processing
Z Tang, Z Yang, G Wang, Y Fang, Y Liu, C Zhu, M Zeng, C Zhang, M Bansal
[Microsoft & University of North Carolina at Chapel Hill]
统一视觉、文本和布局实现通用文档处理
简介:提出通用文档处理(UDOP),一个文档人工智能基础模型,将预训练和多域下游任务统一到一个基于提示的序列生成方案中,利用生成框架中的自监督和监督任务,在不同数据域的9个文档人工智能任务中取得了最先进结果,目前在文档理解基准排行榜上排名第一。
摘要:本文提出通用文档处理(UDOP),一个基础文档人工智能模型,将文本、图像和布局模态与各种任务规格统一起来,包括文档理解和生成。UDOP利用文本内容和文档图像之间的空间相关性,用一个统一的表示方法来模拟图像、文本和布局模式。通过一个新的视觉-文本-布局Transformer,UDOP将预训练和多域下游任务统一到一个基于提示的序列生成方案中。UDOP利用创新的自监督目标和多样的标记数据,对大规模的无标记文档集进行预训练。UDOP还学会了通过掩码图像重建从文本和布局模式中生成文档图像。这是文档人工智能领域第一次有一个模型同时实现高质量的神经文档编辑和内容定制。该方法在9个文档人工智能任务上创造了最先进的水平,例如,文档理解和QA,跨不同数据域,如金融报告、学术论文和网站。UDOP在文档理解基准排行榜上排名第一。
We propose Universal Document Processing (UDOP), a foundation Document AI model which unifies text, image, and layout modalities together with varied task formats, including document understanding and generation. UDOP leverages the spatial correlation between textual content and document image to model image, text, and layout modalities with one uniform representation. With a novel Vision-Text-Layout Transformer, UDOP unifies pretraining and multi-domain downstream tasks into a prompt-based sequence generation scheme. UDOP is pretrained on both large-scale unlabeled document corpora using innovative self-supervised objectives and diverse labeled data. UDOP also learns to generate document images from text and layout modalities via masked image reconstruction. To the best of our knowledge, this is the first time in the field of document AI that one model simultaneously achieves high-quality neural document editing and content customization. Our method sets the state-of-the-art on 9 Document AI tasks, e.g., document understanding and QA, across diverse data domains like finance reports, academic papers, and websites. UDOP ranks first on the leaderboard of the Document Understanding Benchmark (DUE).
评论
沙发等你来抢