Offline OCR system for machine-printed Turkish using template matching

Dena Rafaa Ahmed, Md. Jan Nordin

Research output: Chapter in Book/Report/Conference proceedingConference contribution

2 Citations (Scopus)

Abstract

One of the most important application these days in Pattern Recognition (PR) is Optical Character recognition (OCR) which is a system used to convert scanned printed or handwritten image files into machine readable and editable format such as text documents. The main motivation behind this study is to build an OCR system for offline machine-printed Turkish characters to convert any image file into a readable and editable format. This OCR system started from preprocessing step to convert the image file into a binary format with less noise to be ready for recognition. The preprocessing step includes digitization, binarization, thresholding, and noise removal. Next, horizontal projection method is used for line detection and word allocation and 8- connected neighbors' schema is used to extract characters as a set of connected components. Then, the Template matching method is utilized to implement the matching process between the segmented characters and the template set stored in OCR database in order to recognize the text. Unlike other approaches, template matching takes shorter time and does not require sample training but it is not able to recognize some letters with similar shape or combined letters, for this reason, this OCR system combines both the template matching and the size feature of the segmented characters to achieve accurate results. Finally, upon a successful implementation of the OCR, the recognized patterns are displayed in notepad as readable and editable text. The Turkish machineprinted database consists of a list of 630 names of cities in Turkey written by using Arial font with different sizes in uppercase, lowercase and capitalizes the first character for each word. The proposed OCR's result show that the accuracy of the system is from 96% to 100%.

Original languageEnglish
Title of host publicationAdvanced Materials Research
Pages565-569
Number of pages5
Volume341-342
DOIs
Publication statusPublished - 2012
Event2011 2nd International Conference on Material and Manufacturing Technology, ICMMT 2011 - Xiamen
Duration: 8 Jul 201111 Jul 2011

Publication series

NameAdvanced Materials Research
Volume341-342
ISSN (Print)10226680

Other

Other2011 2nd International Conference on Material and Manufacturing Technology, ICMMT 2011
CityXiamen
Period8/7/1111/7/11

Fingerprint

Optical character recognition
Template matching
Analog to digital conversion
Pattern recognition

Keywords

  • Image processing
  • OCR
  • Pattern Recognition
  • Template matching

ASJC Scopus subject areas

  • Engineering(all)

Cite this

Ahmed, D. R., & Nordin, M. J. (2012). Offline OCR system for machine-printed Turkish using template matching. In Advanced Materials Research (Vol. 341-342, pp. 565-569). (Advanced Materials Research; Vol. 341-342). https://doi.org/10.4028/www.scientific.net/AMR.341-342.565

Offline OCR system for machine-printed Turkish using template matching. / Ahmed, Dena Rafaa; Nordin, Md. Jan.

Advanced Materials Research. Vol. 341-342 2012. p. 565-569 (Advanced Materials Research; Vol. 341-342).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Ahmed, DR & Nordin, MJ 2012, Offline OCR system for machine-printed Turkish using template matching. in Advanced Materials Research. vol. 341-342, Advanced Materials Research, vol. 341-342, pp. 565-569, 2011 2nd International Conference on Material and Manufacturing Technology, ICMMT 2011, Xiamen, 8/7/11. https://doi.org/10.4028/www.scientific.net/AMR.341-342.565
Ahmed DR, Nordin MJ. Offline OCR system for machine-printed Turkish using template matching. In Advanced Materials Research. Vol. 341-342. 2012. p. 565-569. (Advanced Materials Research). https://doi.org/10.4028/www.scientific.net/AMR.341-342.565
Ahmed, Dena Rafaa ; Nordin, Md. Jan. / Offline OCR system for machine-printed Turkish using template matching. Advanced Materials Research. Vol. 341-342 2012. pp. 565-569 (Advanced Materials Research).
@inproceedings{eb0ee9ee97724cf482a4fc36bb8d83ac,
title = "Offline OCR system for machine-printed Turkish using template matching",
abstract = "One of the most important application these days in Pattern Recognition (PR) is Optical Character recognition (OCR) which is a system used to convert scanned printed or handwritten image files into machine readable and editable format such as text documents. The main motivation behind this study is to build an OCR system for offline machine-printed Turkish characters to convert any image file into a readable and editable format. This OCR system started from preprocessing step to convert the image file into a binary format with less noise to be ready for recognition. The preprocessing step includes digitization, binarization, thresholding, and noise removal. Next, horizontal projection method is used for line detection and word allocation and 8- connected neighbors' schema is used to extract characters as a set of connected components. Then, the Template matching method is utilized to implement the matching process between the segmented characters and the template set stored in OCR database in order to recognize the text. Unlike other approaches, template matching takes shorter time and does not require sample training but it is not able to recognize some letters with similar shape or combined letters, for this reason, this OCR system combines both the template matching and the size feature of the segmented characters to achieve accurate results. Finally, upon a successful implementation of the OCR, the recognized patterns are displayed in notepad as readable and editable text. The Turkish machineprinted database consists of a list of 630 names of cities in Turkey written by using Arial font with different sizes in uppercase, lowercase and capitalizes the first character for each word. The proposed OCR's result show that the accuracy of the system is from 96{\%} to 100{\%}.",
keywords = "Image processing, OCR, Pattern Recognition, Template matching",
author = "Ahmed, {Dena Rafaa} and Nordin, {Md. Jan}",
year = "2012",
doi = "10.4028/www.scientific.net/AMR.341-342.565",
language = "English",
isbn = "9783037852521",
volume = "341-342",
series = "Advanced Materials Research",
pages = "565--569",
booktitle = "Advanced Materials Research",

}

TY - GEN

T1 - Offline OCR system for machine-printed Turkish using template matching

AU - Ahmed, Dena Rafaa

AU - Nordin, Md. Jan

PY - 2012

Y1 - 2012

N2 - One of the most important application these days in Pattern Recognition (PR) is Optical Character recognition (OCR) which is a system used to convert scanned printed or handwritten image files into machine readable and editable format such as text documents. The main motivation behind this study is to build an OCR system for offline machine-printed Turkish characters to convert any image file into a readable and editable format. This OCR system started from preprocessing step to convert the image file into a binary format with less noise to be ready for recognition. The preprocessing step includes digitization, binarization, thresholding, and noise removal. Next, horizontal projection method is used for line detection and word allocation and 8- connected neighbors' schema is used to extract characters as a set of connected components. Then, the Template matching method is utilized to implement the matching process between the segmented characters and the template set stored in OCR database in order to recognize the text. Unlike other approaches, template matching takes shorter time and does not require sample training but it is not able to recognize some letters with similar shape or combined letters, for this reason, this OCR system combines both the template matching and the size feature of the segmented characters to achieve accurate results. Finally, upon a successful implementation of the OCR, the recognized patterns are displayed in notepad as readable and editable text. The Turkish machineprinted database consists of a list of 630 names of cities in Turkey written by using Arial font with different sizes in uppercase, lowercase and capitalizes the first character for each word. The proposed OCR's result show that the accuracy of the system is from 96% to 100%.

AB - One of the most important application these days in Pattern Recognition (PR) is Optical Character recognition (OCR) which is a system used to convert scanned printed or handwritten image files into machine readable and editable format such as text documents. The main motivation behind this study is to build an OCR system for offline machine-printed Turkish characters to convert any image file into a readable and editable format. This OCR system started from preprocessing step to convert the image file into a binary format with less noise to be ready for recognition. The preprocessing step includes digitization, binarization, thresholding, and noise removal. Next, horizontal projection method is used for line detection and word allocation and 8- connected neighbors' schema is used to extract characters as a set of connected components. Then, the Template matching method is utilized to implement the matching process between the segmented characters and the template set stored in OCR database in order to recognize the text. Unlike other approaches, template matching takes shorter time and does not require sample training but it is not able to recognize some letters with similar shape or combined letters, for this reason, this OCR system combines both the template matching and the size feature of the segmented characters to achieve accurate results. Finally, upon a successful implementation of the OCR, the recognized patterns are displayed in notepad as readable and editable text. The Turkish machineprinted database consists of a list of 630 names of cities in Turkey written by using Arial font with different sizes in uppercase, lowercase and capitalizes the first character for each word. The proposed OCR's result show that the accuracy of the system is from 96% to 100%.

KW - Image processing

KW - OCR

KW - Pattern Recognition

KW - Template matching

UR - http://www.scopus.com/inward/record.url?scp=80053996883&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=80053996883&partnerID=8YFLogxK

U2 - 10.4028/www.scientific.net/AMR.341-342.565

DO - 10.4028/www.scientific.net/AMR.341-342.565

M3 - Conference contribution

SN - 9783037852521

VL - 341-342

T3 - Advanced Materials Research

SP - 565

EP - 569

BT - Advanced Materials Research

ER -