Automatic multi-lingual script recognition application

Waleed Abdel Karim Abu-Ain, Siti Norul Huda Sheikh Abdullah, Khairuddin Omar, Siti Zaharah Siti

Research output: Contribution to journalArticle

Abstract

Document Image Analysis and Recognition (DIAR) technique is used to recognize text component and translate it into editable format. Scripts are a set of graphical representations used to express a particular writing system as well as subsets belonging to a particular writing system. The writing styles of more than one script family may then be adopted by one language, such as in the cases where the old Malay language (Jawi) adopts the Arabic script while the modern one adopts the Roman script. The seven major scripts used in this research are in handwritten style including Arabic, Devanagari, Hebrew, Thai, Greek, Cyrillic and Korean. Automatic Multi-lingual Script Recognition (AMSR) is one of the main challenges in DIAR domain. Currently, only few attempts have been made for automated script identification of off-line handwritten documents images. Most available AMSR applications only deal with printed documents and script types, and they neglect handwritten and multilingual documents. The objective of this study is to propose a multi-lingual AMSR framework. The research methodology consists of a proposed multilingual AMSR framework. The multilingual AMSR framework is tested on Multilingual-HW datasets, which contains more than seven international unconstraint handwritten scripts, using Grey-Level Co-occurrence Matrix and Local Binary Pattern. The average accuracy of both methods is about 97.01% and 85.29% respectively. This proposed multilingual AMSR is hoped to be beneficial to a group of community which requires automatic sorting multilingual documents. This research can also be extended to document forensic area or international relations agency to identify unknown native document.

Original languageEnglish
Pages (from-to)203-221
Number of pages19
JournalGEMA Online Journal of Language Studies
Volume18
Issue number3
DOIs
Publication statusPublished - 1 Aug 2018

Fingerprint

language
international relations
neglect
methodology
community
Group
Writing Systems
Image Analysis
Language
Co-occurrence
Roman Script
International Relations
Arabic Script
Neglect
Writing Style
Cyrillic

Keywords

  • Automatic multi-lingual script recognition (AMSR)
  • Feature extraction
  • Grey-level co-occurrence matrix (GLCM)
  • Local binary pattern (LBP)
  • Statistical texture analysis

ASJC Scopus subject areas

  • Language and Linguistics
  • Linguistics and Language
  • Literature and Literary Theory

Cite this

Automatic multi-lingual script recognition application. / Abu-Ain, Waleed Abdel Karim; Sheikh Abdullah, Siti Norul Huda; Omar, Khairuddin; Siti, Siti Zaharah.

In: GEMA Online Journal of Language Studies, Vol. 18, No. 3, 01.08.2018, p. 203-221.

Research output: Contribution to journalArticle

@article{93434d7c45f244b8bf7a7d944c90118d,
title = "Automatic multi-lingual script recognition application",
abstract = "Document Image Analysis and Recognition (DIAR) technique is used to recognize text component and translate it into editable format. Scripts are a set of graphical representations used to express a particular writing system as well as subsets belonging to a particular writing system. The writing styles of more than one script family may then be adopted by one language, such as in the cases where the old Malay language (Jawi) adopts the Arabic script while the modern one adopts the Roman script. The seven major scripts used in this research are in handwritten style including Arabic, Devanagari, Hebrew, Thai, Greek, Cyrillic and Korean. Automatic Multi-lingual Script Recognition (AMSR) is one of the main challenges in DIAR domain. Currently, only few attempts have been made for automated script identification of off-line handwritten documents images. Most available AMSR applications only deal with printed documents and script types, and they neglect handwritten and multilingual documents. The objective of this study is to propose a multi-lingual AMSR framework. The research methodology consists of a proposed multilingual AMSR framework. The multilingual AMSR framework is tested on Multilingual-HW datasets, which contains more than seven international unconstraint handwritten scripts, using Grey-Level Co-occurrence Matrix and Local Binary Pattern. The average accuracy of both methods is about 97.01{\%} and 85.29{\%} respectively. This proposed multilingual AMSR is hoped to be beneficial to a group of community which requires automatic sorting multilingual documents. This research can also be extended to document forensic area or international relations agency to identify unknown native document.",
keywords = "Automatic multi-lingual script recognition (AMSR), Feature extraction, Grey-level co-occurrence matrix (GLCM), Local binary pattern (LBP), Statistical texture analysis",
author = "Abu-Ain, {Waleed Abdel Karim} and {Sheikh Abdullah}, {Siti Norul Huda} and Khairuddin Omar and Siti, {Siti Zaharah}",
year = "2018",
month = "8",
day = "1",
doi = "10.17576/gema-2018-1803-12",
language = "English",
volume = "18",
pages = "203--221",
journal = "GEMA Online Journal of Language Studies",
issn = "1675-8021",
publisher = "Universiti Kebangsaan Malaysia",
number = "3",

}

TY - JOUR

T1 - Automatic multi-lingual script recognition application

AU - Abu-Ain, Waleed Abdel Karim

AU - Sheikh Abdullah, Siti Norul Huda

AU - Omar, Khairuddin

AU - Siti, Siti Zaharah

PY - 2018/8/1

Y1 - 2018/8/1

N2 - Document Image Analysis and Recognition (DIAR) technique is used to recognize text component and translate it into editable format. Scripts are a set of graphical representations used to express a particular writing system as well as subsets belonging to a particular writing system. The writing styles of more than one script family may then be adopted by one language, such as in the cases where the old Malay language (Jawi) adopts the Arabic script while the modern one adopts the Roman script. The seven major scripts used in this research are in handwritten style including Arabic, Devanagari, Hebrew, Thai, Greek, Cyrillic and Korean. Automatic Multi-lingual Script Recognition (AMSR) is one of the main challenges in DIAR domain. Currently, only few attempts have been made for automated script identification of off-line handwritten documents images. Most available AMSR applications only deal with printed documents and script types, and they neglect handwritten and multilingual documents. The objective of this study is to propose a multi-lingual AMSR framework. The research methodology consists of a proposed multilingual AMSR framework. The multilingual AMSR framework is tested on Multilingual-HW datasets, which contains more than seven international unconstraint handwritten scripts, using Grey-Level Co-occurrence Matrix and Local Binary Pattern. The average accuracy of both methods is about 97.01% and 85.29% respectively. This proposed multilingual AMSR is hoped to be beneficial to a group of community which requires automatic sorting multilingual documents. This research can also be extended to document forensic area or international relations agency to identify unknown native document.

AB - Document Image Analysis and Recognition (DIAR) technique is used to recognize text component and translate it into editable format. Scripts are a set of graphical representations used to express a particular writing system as well as subsets belonging to a particular writing system. The writing styles of more than one script family may then be adopted by one language, such as in the cases where the old Malay language (Jawi) adopts the Arabic script while the modern one adopts the Roman script. The seven major scripts used in this research are in handwritten style including Arabic, Devanagari, Hebrew, Thai, Greek, Cyrillic and Korean. Automatic Multi-lingual Script Recognition (AMSR) is one of the main challenges in DIAR domain. Currently, only few attempts have been made for automated script identification of off-line handwritten documents images. Most available AMSR applications only deal with printed documents and script types, and they neglect handwritten and multilingual documents. The objective of this study is to propose a multi-lingual AMSR framework. The research methodology consists of a proposed multilingual AMSR framework. The multilingual AMSR framework is tested on Multilingual-HW datasets, which contains more than seven international unconstraint handwritten scripts, using Grey-Level Co-occurrence Matrix and Local Binary Pattern. The average accuracy of both methods is about 97.01% and 85.29% respectively. This proposed multilingual AMSR is hoped to be beneficial to a group of community which requires automatic sorting multilingual documents. This research can also be extended to document forensic area or international relations agency to identify unknown native document.

KW - Automatic multi-lingual script recognition (AMSR)

KW - Feature extraction

KW - Grey-level co-occurrence matrix (GLCM)

KW - Local binary pattern (LBP)

KW - Statistical texture analysis

UR - http://www.scopus.com/inward/record.url?scp=85052736657&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85052736657&partnerID=8YFLogxK

U2 - 10.17576/gema-2018-1803-12

DO - 10.17576/gema-2018-1803-12

M3 - Article

AN - SCOPUS:85052736657

VL - 18

SP - 203

EP - 221

JO - GEMA Online Journal of Language Studies

JF - GEMA Online Journal of Language Studies

SN - 1675-8021

IS - 3

ER -