POS-taggging Malay Corpus

A novel approach based on maximum entropy

Research output: Contribution to journalArticle

Abstract

Jawi and Roman scripts are represented Malay language. In the past, Jawi writings are widely used by the Malay community and foreigners; and it can be seen in the old documents. Old documents face the risk of background damage. In order to preserve this valuable information, there are significant needs to automated Jawi materials. Based on previous literature, POS-tags are known as the first phase in the automated text analysis; and the development of language technologies can barely initiate without this phase. We highlight the existing POS-tags approaches; and suggest the development of Malay Jawi POS-tags using extended ME-based approach on NUWT Corpus. Results have shown that the proposed model yielded a higher accuracy in comparison to the state-of-the-art model.

Original languageEnglish
Pages (from-to)6-14
Number of pages9
JournalInternational Journal of Engineering and Technology(UAE)
Volume7
Issue number3.20 Special Issue 20
Publication statusPublished - 1 Jan 2018

Fingerprint

Language Development
Entropy
Language
Technology

Keywords

  • Jawi
  • Malay language
  • NLP pipeline task
  • POS-tags
  • Tagging approach

ASJC Scopus subject areas

  • Biotechnology
  • Computer Science (miscellaneous)
  • Environmental Engineering
  • Chemical Engineering(all)
  • Engineering(all)
  • Hardware and Architecture

Cite this

POS-taggging Malay Corpus : A novel approach based on maximum entropy. / Abu Bakar, Juhaida; Omar, Khairuddin; Nasrudin, Mohammad Faidzul; Murah, Mohd. Zamri.

In: International Journal of Engineering and Technology(UAE), Vol. 7, No. 3.20 Special Issue 20, 01.01.2018, p. 6-14.

Research output: Contribution to journalArticle

@article{0ae97015caee4ff8a788573f8f8277e0,
title = "POS-taggging Malay Corpus: A novel approach based on maximum entropy",
abstract = "Jawi and Roman scripts are represented Malay language. In the past, Jawi writings are widely used by the Malay community and foreigners; and it can be seen in the old documents. Old documents face the risk of background damage. In order to preserve this valuable information, there are significant needs to automated Jawi materials. Based on previous literature, POS-tags are known as the first phase in the automated text analysis; and the development of language technologies can barely initiate without this phase. We highlight the existing POS-tags approaches; and suggest the development of Malay Jawi POS-tags using extended ME-based approach on NUWT Corpus. Results have shown that the proposed model yielded a higher accuracy in comparison to the state-of-the-art model.",
keywords = "Jawi, Malay language, NLP pipeline task, POS-tags, Tagging approach",
author = "{Abu Bakar}, Juhaida and Khairuddin Omar and Nasrudin, {Mohammad Faidzul} and Murah, {Mohd. Zamri}",
year = "2018",
month = "1",
day = "1",
language = "English",
volume = "7",
pages = "6--14",
journal = "International Journal of Engineering and Technology(UAE)",
issn = "2227-524X",
publisher = "Science Publishing Corporation Inc",
number = "3.20 Special Issue 20",

}

TY - JOUR

T1 - POS-taggging Malay Corpus

T2 - A novel approach based on maximum entropy

AU - Abu Bakar, Juhaida

AU - Omar, Khairuddin

AU - Nasrudin, Mohammad Faidzul

AU - Murah, Mohd. Zamri

PY - 2018/1/1

Y1 - 2018/1/1

N2 - Jawi and Roman scripts are represented Malay language. In the past, Jawi writings are widely used by the Malay community and foreigners; and it can be seen in the old documents. Old documents face the risk of background damage. In order to preserve this valuable information, there are significant needs to automated Jawi materials. Based on previous literature, POS-tags are known as the first phase in the automated text analysis; and the development of language technologies can barely initiate without this phase. We highlight the existing POS-tags approaches; and suggest the development of Malay Jawi POS-tags using extended ME-based approach on NUWT Corpus. Results have shown that the proposed model yielded a higher accuracy in comparison to the state-of-the-art model.

AB - Jawi and Roman scripts are represented Malay language. In the past, Jawi writings are widely used by the Malay community and foreigners; and it can be seen in the old documents. Old documents face the risk of background damage. In order to preserve this valuable information, there are significant needs to automated Jawi materials. Based on previous literature, POS-tags are known as the first phase in the automated text analysis; and the development of language technologies can barely initiate without this phase. We highlight the existing POS-tags approaches; and suggest the development of Malay Jawi POS-tags using extended ME-based approach on NUWT Corpus. Results have shown that the proposed model yielded a higher accuracy in comparison to the state-of-the-art model.

KW - Jawi

KW - Malay language

KW - NLP pipeline task

KW - POS-tags

KW - Tagging approach

UR - http://www.scopus.com/inward/record.url?scp=85063183180&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85063183180&partnerID=8YFLogxK

M3 - Article

VL - 7

SP - 6

EP - 14

JO - International Journal of Engineering and Technology(UAE)

JF - International Journal of Engineering and Technology(UAE)

SN - 2227-524X

IS - 3.20 Special Issue 20

ER -