The effectiveness of using Malay affixes for handling unknown words in unsupervised HMM POS tagger

Research output: Contribution to journalArticle

Abstract

The challenge in unsupervised Hidden Markov Model (HMM) training for a POS tagger is that the training depends on an untagged corpus; the only supervised data limiting possible tagging of words is a dictionary. A morpheme-based POS guessing algorithm has been introduced to assign unknown words' probable tags based on linguistically meaningful affixes. Therefore, the exact morphemes of pre-fixes, suffixes and circumfixes in the agglutinative Malay language is examined before giving tags to unknown words. The algorithm has been integrated into HMM tagger which uses HMM trained parameters for tagging new sentences. However, for unknown words their parameters are absent. Therefore, the algorithm applies two methods for assigning unknown words' emission to HMM tagger, first is based on uniform distribution of all possible tags; and second, is based on marginal proportionate distribution of tags. The effective method is proven to be using morpheme-based POS guessing with unknown word emissions substituted by a value proportionate to the marginal distribution of tags.

Original languageEnglish
Pages (from-to)9-12
Number of pages4
JournalInternational Journal of Engineering and Technology(UAE)
Volume7
Issue number4.29 Special Issue 29
Publication statusPublished - 1 Jan 2018

Fingerprint

Hidden Markov models
Language
Glossaries

Keywords

  • Malay
  • POS tagger
  • Unsupervised HMM

ASJC Scopus subject areas

  • Biotechnology
  • Computer Science (miscellaneous)
  • Environmental Engineering
  • Chemical Engineering(all)
  • Engineering(all)
  • Hardware and Architecture

Cite this

The effectiveness of using Malay affixes for handling unknown words in unsupervised HMM POS tagger. / Mohamed, Hassan; Omar, Nazlia; Ab Aziz, Mohd Juzaiddin.

In: International Journal of Engineering and Technology(UAE), Vol. 7, No. 4.29 Special Issue 29, 01.01.2018, p. 9-12.

Research output: Contribution to journalArticle

@article{2447444658b34f8ab6c981592ae46b97,
title = "The effectiveness of using Malay affixes for handling unknown words in unsupervised HMM POS tagger",
abstract = "The challenge in unsupervised Hidden Markov Model (HMM) training for a POS tagger is that the training depends on an untagged corpus; the only supervised data limiting possible tagging of words is a dictionary. A morpheme-based POS guessing algorithm has been introduced to assign unknown words' probable tags based on linguistically meaningful affixes. Therefore, the exact morphemes of pre-fixes, suffixes and circumfixes in the agglutinative Malay language is examined before giving tags to unknown words. The algorithm has been integrated into HMM tagger which uses HMM trained parameters for tagging new sentences. However, for unknown words their parameters are absent. Therefore, the algorithm applies two methods for assigning unknown words' emission to HMM tagger, first is based on uniform distribution of all possible tags; and second, is based on marginal proportionate distribution of tags. The effective method is proven to be using morpheme-based POS guessing with unknown word emissions substituted by a value proportionate to the marginal distribution of tags.",
keywords = "Malay, POS tagger, Unsupervised HMM",
author = "Hassan Mohamed and Nazlia Omar and {Ab Aziz}, {Mohd Juzaiddin}",
year = "2018",
month = "1",
day = "1",
language = "English",
volume = "7",
pages = "9--12",
journal = "International Journal of Engineering and Technology(UAE)",
issn = "2227-524X",
publisher = "Science Publishing Corporation Inc",
number = "4.29 Special Issue 29",

}

TY - JOUR

T1 - The effectiveness of using Malay affixes for handling unknown words in unsupervised HMM POS tagger

AU - Mohamed, Hassan

AU - Omar, Nazlia

AU - Ab Aziz, Mohd Juzaiddin

PY - 2018/1/1

Y1 - 2018/1/1

N2 - The challenge in unsupervised Hidden Markov Model (HMM) training for a POS tagger is that the training depends on an untagged corpus; the only supervised data limiting possible tagging of words is a dictionary. A morpheme-based POS guessing algorithm has been introduced to assign unknown words' probable tags based on linguistically meaningful affixes. Therefore, the exact morphemes of pre-fixes, suffixes and circumfixes in the agglutinative Malay language is examined before giving tags to unknown words. The algorithm has been integrated into HMM tagger which uses HMM trained parameters for tagging new sentences. However, for unknown words their parameters are absent. Therefore, the algorithm applies two methods for assigning unknown words' emission to HMM tagger, first is based on uniform distribution of all possible tags; and second, is based on marginal proportionate distribution of tags. The effective method is proven to be using morpheme-based POS guessing with unknown word emissions substituted by a value proportionate to the marginal distribution of tags.

AB - The challenge in unsupervised Hidden Markov Model (HMM) training for a POS tagger is that the training depends on an untagged corpus; the only supervised data limiting possible tagging of words is a dictionary. A morpheme-based POS guessing algorithm has been introduced to assign unknown words' probable tags based on linguistically meaningful affixes. Therefore, the exact morphemes of pre-fixes, suffixes and circumfixes in the agglutinative Malay language is examined before giving tags to unknown words. The algorithm has been integrated into HMM tagger which uses HMM trained parameters for tagging new sentences. However, for unknown words their parameters are absent. Therefore, the algorithm applies two methods for assigning unknown words' emission to HMM tagger, first is based on uniform distribution of all possible tags; and second, is based on marginal proportionate distribution of tags. The effective method is proven to be using morpheme-based POS guessing with unknown word emissions substituted by a value proportionate to the marginal distribution of tags.

KW - Malay

KW - POS tagger

KW - Unsupervised HMM

UR - http://www.scopus.com/inward/record.url?scp=85063181457&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85063181457&partnerID=8YFLogxK

M3 - Article

VL - 7

SP - 9

EP - 12

JO - International Journal of Engineering and Technology(UAE)

JF - International Journal of Engineering and Technology(UAE)

SN - 2227-524X

IS - 4.29 Special Issue 29

ER -