Statistical malay part-of-speech (POS) tagger using Hidden Markov approach

Research output: Chapter in Book/Report/Conference proceedingConference contribution

10 Citations (Scopus)

Abstract

Assigning part of speech to running words in a sentence is one of the pipeline processes in Natural Language Processing (NLP) tasks. In this paper, a statistical POS tagger using trigram Hidden Markov Model for tagging Malay language sentences is examined. The problem of the tagger approach is to predict the POS for unseen words in the training corpus that can guess word's POS based on their surrounding information. The predictor has been built based on information of word's prefixes, suffixes or combination of them. Linear successive abstraction has been used for smoothing the probability distribution of part of speech for unknown Malay words given their prefixes or suffixes information. However, for the combination of prefixes and suffixes information, the joint probability distribution has been used. The best performance to predict POS of unknown words are obtained through prefixes information by seeing the first three characters of the words. The accuracy of the tagging is 67.9%. This shows that a statistical tagger for Malay language using Hidden Markov Model is able to predict any unknown word's POS at some promising accuracy.

Original languageEnglish
Title of host publication2011 International Conference on Semantic Technology and Information Retrieval, STAIR 2011
Pages231-236
Number of pages6
DOIs
Publication statusPublished - 2011
Event2011 International Conference on Semantic Technology and Information Retrieval, STAIR 2011 - Putrajaya
Duration: 28 Jun 201129 Jun 2011

Other

Other2011 International Conference on Semantic Technology and Information Retrieval, STAIR 2011
CityPutrajaya
Period28/6/1129/6/11

Fingerprint

Hidden Markov models
Probability distributions
Pipelines
Processing

Keywords

  • HMM
  • Malay POS
  • part of speech
  • POS tagger

ASJC Scopus subject areas

  • Computational Theory and Mathematics
  • Information Systems

Cite this

Mohamed, H., Omar, N., & Ab Aziz, M. J. (2011). Statistical malay part-of-speech (POS) tagger using Hidden Markov approach. In 2011 International Conference on Semantic Technology and Information Retrieval, STAIR 2011 (pp. 231-236). [5995794] https://doi.org/10.1109/STAIR.2011.5995794

Statistical malay part-of-speech (POS) tagger using Hidden Markov approach. / Mohamed, Hassan; Omar, Nazlia; Ab Aziz, Mohd Juzaiddin.

2011 International Conference on Semantic Technology and Information Retrieval, STAIR 2011. 2011. p. 231-236 5995794.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Mohamed, H, Omar, N & Ab Aziz, MJ 2011, Statistical malay part-of-speech (POS) tagger using Hidden Markov approach. in 2011 International Conference on Semantic Technology and Information Retrieval, STAIR 2011., 5995794, pp. 231-236, 2011 International Conference on Semantic Technology and Information Retrieval, STAIR 2011, Putrajaya, 28/6/11. https://doi.org/10.1109/STAIR.2011.5995794
Mohamed H, Omar N, Ab Aziz MJ. Statistical malay part-of-speech (POS) tagger using Hidden Markov approach. In 2011 International Conference on Semantic Technology and Information Retrieval, STAIR 2011. 2011. p. 231-236. 5995794 https://doi.org/10.1109/STAIR.2011.5995794
Mohamed, Hassan ; Omar, Nazlia ; Ab Aziz, Mohd Juzaiddin. / Statistical malay part-of-speech (POS) tagger using Hidden Markov approach. 2011 International Conference on Semantic Technology and Information Retrieval, STAIR 2011. 2011. pp. 231-236
@inproceedings{4df32e869aa34235b253368680948854,
title = "Statistical malay part-of-speech (POS) tagger using Hidden Markov approach",
abstract = "Assigning part of speech to running words in a sentence is one of the pipeline processes in Natural Language Processing (NLP) tasks. In this paper, a statistical POS tagger using trigram Hidden Markov Model for tagging Malay language sentences is examined. The problem of the tagger approach is to predict the POS for unseen words in the training corpus that can guess word's POS based on their surrounding information. The predictor has been built based on information of word's prefixes, suffixes or combination of them. Linear successive abstraction has been used for smoothing the probability distribution of part of speech for unknown Malay words given their prefixes or suffixes information. However, for the combination of prefixes and suffixes information, the joint probability distribution has been used. The best performance to predict POS of unknown words are obtained through prefixes information by seeing the first three characters of the words. The accuracy of the tagging is 67.9{\%}. This shows that a statistical tagger for Malay language using Hidden Markov Model is able to predict any unknown word's POS at some promising accuracy.",
keywords = "HMM, Malay POS, part of speech, POS tagger",
author = "Hassan Mohamed and Nazlia Omar and {Ab Aziz}, {Mohd Juzaiddin}",
year = "2011",
doi = "10.1109/STAIR.2011.5995794",
language = "English",
isbn = "9781612843537",
pages = "231--236",
booktitle = "2011 International Conference on Semantic Technology and Information Retrieval, STAIR 2011",

}

TY - GEN

T1 - Statistical malay part-of-speech (POS) tagger using Hidden Markov approach

AU - Mohamed, Hassan

AU - Omar, Nazlia

AU - Ab Aziz, Mohd Juzaiddin

PY - 2011

Y1 - 2011

N2 - Assigning part of speech to running words in a sentence is one of the pipeline processes in Natural Language Processing (NLP) tasks. In this paper, a statistical POS tagger using trigram Hidden Markov Model for tagging Malay language sentences is examined. The problem of the tagger approach is to predict the POS for unseen words in the training corpus that can guess word's POS based on their surrounding information. The predictor has been built based on information of word's prefixes, suffixes or combination of them. Linear successive abstraction has been used for smoothing the probability distribution of part of speech for unknown Malay words given their prefixes or suffixes information. However, for the combination of prefixes and suffixes information, the joint probability distribution has been used. The best performance to predict POS of unknown words are obtained through prefixes information by seeing the first three characters of the words. The accuracy of the tagging is 67.9%. This shows that a statistical tagger for Malay language using Hidden Markov Model is able to predict any unknown word's POS at some promising accuracy.

AB - Assigning part of speech to running words in a sentence is one of the pipeline processes in Natural Language Processing (NLP) tasks. In this paper, a statistical POS tagger using trigram Hidden Markov Model for tagging Malay language sentences is examined. The problem of the tagger approach is to predict the POS for unseen words in the training corpus that can guess word's POS based on their surrounding information. The predictor has been built based on information of word's prefixes, suffixes or combination of them. Linear successive abstraction has been used for smoothing the probability distribution of part of speech for unknown Malay words given their prefixes or suffixes information. However, for the combination of prefixes and suffixes information, the joint probability distribution has been used. The best performance to predict POS of unknown words are obtained through prefixes information by seeing the first three characters of the words. The accuracy of the tagging is 67.9%. This shows that a statistical tagger for Malay language using Hidden Markov Model is able to predict any unknown word's POS at some promising accuracy.

KW - HMM

KW - Malay POS

KW - part of speech

KW - POS tagger

UR - http://www.scopus.com/inward/record.url?scp=80052601464&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=80052601464&partnerID=8YFLogxK

U2 - 10.1109/STAIR.2011.5995794

DO - 10.1109/STAIR.2011.5995794

M3 - Conference contribution

AN - SCOPUS:80052601464

SN - 9781612843537

SP - 231

EP - 236

BT - 2011 International Conference on Semantic Technology and Information Retrieval, STAIR 2011

ER -