- 简介我们呈现了手写阿拉伯文手稿(Muharaf)数据集,这是一个机器学习数据集,包括超过1,600个历史手写页面图像,由档案阿拉伯语专家转录。每个文档图像都附带其文本行的空间多边形坐标以及基本页面元素。该数据集旨在推进手写文本识别(HTR)的最新技术,不仅适用于阿拉伯手稿,还适用于一般草书文本。Muharaf数据集包括多种手写风格和各种文档类型,包括个人信件、日记、笔记、诗歌、教堂记录和法律通信。在本文中,我们描述了数据获取流程,显着的数据集特征和统计数据。我们还提供了使用此数据训练卷积神经网络实现的初步基线结果。
- 图表
- 解决问题Muharaf dataset is created to advance the state of the art in handwritten text recognition (HTR), not only for Arabic manuscripts but also for cursive text in general.
- 关键思路The Manuscripts of Handwritten Arabic (Muharaf) dataset is a machine learning dataset consisting of more than 1,600 historic handwritten page images transcribed by experts in archival Arabic. It includes diverse handwriting styles and a wide range of document types. The dataset is accompanied by spatial polygonal coordinates of its text lines as well as basic page elements.
- 其它亮点The data acquisition pipeline, notable dataset features, and statistics are described in the paper. A preliminary baseline result achieved by training convolutional neural networks using this data is provided. The Muharaf dataset is valuable for advancing the state of the art in HTR for cursive text in general.
- Recent related studies in this field include 'A Survey on Arabic Handwriting Recognition Databases and Techniques' and 'A Survey on Deep Learning for Arabic Handwriting Recognition'.
沙发等你来抢
去评论
评论
沙发等你来抢