Computational Modelling of an Optical Character Recognition System for Yorùbá Printed Text Images

Óní, Ọlálékan; Asahiah, Franklin

Computational Modelling of an Optical Character Recognition System for Yorùbá Printed Text Images

Files

Computational_modelling_of_an_optical_character_recognition_system.pdf (6.36 MB)

Date

2020-07

Authors

Óní, Ọlálékan

Asahiah, Franklin

Publisher

Elsevier

Abstract

This study acquired a dataset of scanned images of Standard Yorùbá printed text and formulated a Yorùbá character image recognition model. The model formulated was implemented and the performance of the model evaluated to develop an Optical Character Recognition (OCR) model for Yorùbá printed text images. The image dataset at 300 dots per inches (dpi) was acquired by generating image text-line from Yorùbá New Testament Bible (Bibeli Mimo) corpus using Unicode UTF8. The Long Short Term Memory (LSTM) model, a variant of Recurrent Neural Network (RNN) was used to formulate the Standard Yorùbá character image recognition model. The Python OCRopus framework was used to implement the model designed. The performance of the model designed was evaluated using character error rate based on Levenshtein Edit Distance algorithm. The results show that the Character Error Rate (CER) of 3.138% for the font Times New Roman which gives better recognition than the other font style metric performance. The model achieved an OCR result of (7.435% CER) DejaVuSans font style image dataset, while for Ariel font image dataset, a result of 15.141% was achieved. The introduction of Language model-based Standard Yorùbá a spell-checker corrector show a reduction in the Character Error Rate. The Times New Roman font recorded an error rate of 1.182%, the DejaVuSans font style at an error rate of 4.098% while the Ariel font at 5.87%. The study concluded that the performance of the model shows that the farther away an image text font is from the font(s) used in training the network, the higher the character error rate of the recognition and that the inclusion of a post-processing stage shows a reduction in the Character Error Rates.

Description

Scientific African Volume 9, September 2020, e00415

Keywords

STEM, Obafemi Awolowo University, Optical character, Yorùbá, Orthography, Computational modelling, Spell-Check, correction, OCRopus

Citation

ONI, O. J., & ASAHIAH, F. O. (2020). Computational modelling of an optical character recognition system for Yorùbá printed text images. Scientific African, 9, e00415.

URI

10.1016/j.sciaf.2020.e00415
http://hdl.handle.net/123456789/1973

Collections

STEM

Full item page