The image and ground truth dataset of Mongolian movable-type newspapers for text recognition


연구 분야: Artificial Intelligence



학회: International Journal on Document Analysis and Recognition (IJDAR)


초록

OCR approaches have been widely advanced in recent years thanks to the resurgence of deep learning. However, to the best of our knowledge, there is little work on Mongolian movable-type document recognition. One major hurdle is the lack of a domain-specific well-labeled set for training robust models. This paper aims to create the first Mongolian movable type text-image dataset for OCR research. We collated 771 paragraph-level pages segmented from 34 newspapers from 1947 to 1952. For each page, word- and line-level text transcriptions and boundary annotations are recorded. It consists of 86,578 word appearances and 9711 text-line images in total. The vocabulary is 7964. The dataset was finally established from scratch through image collection, text transcription, text-image alignment and manual correction. Moreover, an official train and test set partition is defined on which the typical text segmentation and recognition experiments are tested to set the strong baselines. This dataset is available for research, and we encourage researchers to develop and test new methods using our dataset.


Author Profile
Min Lu

School of Information Engineering Inner Mongolia University of Technology Hohhot 010051 Inner Mongolia People’s Republic of China

China
Author Profile
Feilong Bao

College of Computer Science Inner Mongolia University Hohhot 010021 Inner Mongolia People’s Republic of China

China
Author Profile
Hui Zhang

National & Local Joint Engineering Research Center of Intelligent Information Processing Technology for Mongolian Hohhot 010021 Inner Mongolia People’s Republic of China

China

📄 논문 정보

발행 연도 2023년
인용수 2
출판 국가 China
사이트 Springer
좋아요 수 0

연관 논문 목록 (1건)