An Experimental Technique for OCR Line and Word Segmentation using Probability Distribution Estimation
Rajan Goyal1, Rajesh Kumar Narula2, Manish Kumar3
1Rajan Goyal, Research Scholar, I.K. Gujral Punjab Technical University, Kapurthala (Punjab), India.
2Dr. Rajesh Kumar Narula, Assistant Professor, Department of Mathematics Science, I.K. Gujral Punjab Technical University, Kapurthala (Punjab), India.
3Dr. Manish Kumar Jindal, Professor, Punjab University Regional Centre, Muktsar (Punjab), India.
Manuscript received on 25 July 2019 | Revised Manuscript received on 03 August 2019 | Manuscript Published on 10 August 2019 | PP: 1484-1494 | Volume-8 Issue-2S3 July 2019 | Retrieval Number: B12730782S319/2019©BEIESP | DOI: 10.35940/ijrte.B1273.0782S319
Open Access | Editorial and Publishing Policies | Cite | Mendeley | Indexing and Abstracting
© The Authors. Blue Eyes Intelligence Engineering and Sciences Publication (BEIESP). This is an open access article under the CC-BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
Abstract: Segmentation is always an important step in designing an Optical Character Recognition (OCR) of any script. In this paper, we focus on the line and word segmentation in typewritten Gurmukhi script documents. In order to perform this task, we consider OCR based methodology where several processing steps are implemented. The typewritten documents suffer from several issues such as noise, skew, and quality of the document. In this work, we present a combined pre-processing scheme where document thresholding and skew detection and correction schemes are implemented where image thresholding is obtained using Niblack’s method and skew correction is carried out using gradient histogram algorithm and uniform orientation is obtained. Later, line segmentation scheme is applied where probability density function is applied to generate the text distribution in the probability map. Here, identifying the relation of the text to the exact line is a challenging task hence, we present a 2D-Gaussian modelling which helps to identify the text boundaries in the x and y direction. The proposed methodology is applied for typewritten Gurmukhi documents and an experimental study is carried out to show that the proposed approach achieves better performance when compared with the existing techniques.
Keywords: OCR, Typewritten, Line Segmentation, Character Segmentation, Probability Density Function.
Scope of the Article: Cryptography and Applied Mathematics