Automatic part of speech tagging for arabic: An experiment using bigram hidden markov model

Research output: Chapter in Book/Report/Conference proceedingConference contribution

21 Citations (Scopus)

Abstract

Part Of Speech (POS) tagging is the ability to computationally determine which POS of a word is activated by its use in a particular context. POS tagger is a useful preprocessing tool in many natural languages processing (NLP) applications such as information extraction and information retrieval. In this paper, we present the preliminary achievement of Bigram Hidden Markov Model (HMM) to tackle the POS tagging problem of Arabic language. In addition, we have used different smoothing algorithms with HMM model to overcome the data sparseness problem. The Viterbi algorithm is used to assign the most probable tag to each word in the text. Furthermore, several lexical models have been defined and implemented to handle unknown word POS guessing based on word substring i.e. prefix probability, suffix probability or the linear interpolation of both of them. The average overall accuracy for this tagger is 95.8.

Original languageEnglish
Title of host publicationLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Pages361-370
Number of pages10
Volume6401 LNAI
DOIs
Publication statusPublished - 2010
Event5th International Conference on Rough Set and Knowledge Technology, RSKT 2010 - Beijing
Duration: 15 Oct 201017 Oct 2010

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume6401 LNAI
ISSN (Print)03029743
ISSN (Electronic)16113349

Other

Other5th International Conference on Rough Set and Knowledge Technology, RSKT 2010
CityBeijing
Period15/10/1017/10/10

Fingerprint

Tagging
Hidden Markov models
Markov Model
Experiment
Experiments
Viterbi Algorithm
Smoothing Algorithm
Viterbi algorithm
Suffix
Linear Interpolation
Information Extraction
Prefix
Probable
Information retrieval
Information Retrieval
Natural Language
Preprocessing
Assign
Interpolation
Speech

Keywords

  • Arabic languages
  • Hidden Markov model
  • Part of speech tagging
  • Smoothing methods

ASJC Scopus subject areas

  • Computer Science(all)
  • Theoretical Computer Science

Cite this

Albared, M., Omar, N., Ab Aziz, M. J., & Ahmad Nazri, M. Z. (2010). Automatic part of speech tagging for arabic: An experiment using bigram hidden markov model. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 6401 LNAI, pp. 361-370). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 6401 LNAI). https://doi.org/10.1007/978-3-642-16248-0_52

Automatic part of speech tagging for arabic : An experiment using bigram hidden markov model. / Albared, Mohammed; Omar, Nazlia; Ab Aziz, Mohd Juzaiddin; Ahmad Nazri, Mohd Zakree.

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 6401 LNAI 2010. p. 361-370 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 6401 LNAI).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Albared, M, Omar, N, Ab Aziz, MJ & Ahmad Nazri, MZ 2010, Automatic part of speech tagging for arabic: An experiment using bigram hidden markov model. in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). vol. 6401 LNAI, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 6401 LNAI, pp. 361-370, 5th International Conference on Rough Set and Knowledge Technology, RSKT 2010, Beijing, 15/10/10. https://doi.org/10.1007/978-3-642-16248-0_52
Albared M, Omar N, Ab Aziz MJ, Ahmad Nazri MZ. Automatic part of speech tagging for arabic: An experiment using bigram hidden markov model. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 6401 LNAI. 2010. p. 361-370. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). https://doi.org/10.1007/978-3-642-16248-0_52
Albared, Mohammed ; Omar, Nazlia ; Ab Aziz, Mohd Juzaiddin ; Ahmad Nazri, Mohd Zakree. / Automatic part of speech tagging for arabic : An experiment using bigram hidden markov model. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 6401 LNAI 2010. pp. 361-370 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
@inproceedings{f611779bffa14293913278fe3f8b27df,
title = "Automatic part of speech tagging for arabic: An experiment using bigram hidden markov model",
abstract = "Part Of Speech (POS) tagging is the ability to computationally determine which POS of a word is activated by its use in a particular context. POS tagger is a useful preprocessing tool in many natural languages processing (NLP) applications such as information extraction and information retrieval. In this paper, we present the preliminary achievement of Bigram Hidden Markov Model (HMM) to tackle the POS tagging problem of Arabic language. In addition, we have used different smoothing algorithms with HMM model to overcome the data sparseness problem. The Viterbi algorithm is used to assign the most probable tag to each word in the text. Furthermore, several lexical models have been defined and implemented to handle unknown word POS guessing based on word substring i.e. prefix probability, suffix probability or the linear interpolation of both of them. The average overall accuracy for this tagger is 95.8.",
keywords = "Arabic languages, Hidden Markov model, Part of speech tagging, Smoothing methods",
author = "Mohammed Albared and Nazlia Omar and {Ab Aziz}, {Mohd Juzaiddin} and {Ahmad Nazri}, {Mohd Zakree}",
year = "2010",
doi = "10.1007/978-3-642-16248-0_52",
language = "English",
isbn = "3642162479",
volume = "6401 LNAI",
series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
pages = "361--370",
booktitle = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

}

TY - GEN

T1 - Automatic part of speech tagging for arabic

T2 - An experiment using bigram hidden markov model

AU - Albared, Mohammed

AU - Omar, Nazlia

AU - Ab Aziz, Mohd Juzaiddin

AU - Ahmad Nazri, Mohd Zakree

PY - 2010

Y1 - 2010

N2 - Part Of Speech (POS) tagging is the ability to computationally determine which POS of a word is activated by its use in a particular context. POS tagger is a useful preprocessing tool in many natural languages processing (NLP) applications such as information extraction and information retrieval. In this paper, we present the preliminary achievement of Bigram Hidden Markov Model (HMM) to tackle the POS tagging problem of Arabic language. In addition, we have used different smoothing algorithms with HMM model to overcome the data sparseness problem. The Viterbi algorithm is used to assign the most probable tag to each word in the text. Furthermore, several lexical models have been defined and implemented to handle unknown word POS guessing based on word substring i.e. prefix probability, suffix probability or the linear interpolation of both of them. The average overall accuracy for this tagger is 95.8.

AB - Part Of Speech (POS) tagging is the ability to computationally determine which POS of a word is activated by its use in a particular context. POS tagger is a useful preprocessing tool in many natural languages processing (NLP) applications such as information extraction and information retrieval. In this paper, we present the preliminary achievement of Bigram Hidden Markov Model (HMM) to tackle the POS tagging problem of Arabic language. In addition, we have used different smoothing algorithms with HMM model to overcome the data sparseness problem. The Viterbi algorithm is used to assign the most probable tag to each word in the text. Furthermore, several lexical models have been defined and implemented to handle unknown word POS guessing based on word substring i.e. prefix probability, suffix probability or the linear interpolation of both of them. The average overall accuracy for this tagger is 95.8.

KW - Arabic languages

KW - Hidden Markov model

KW - Part of speech tagging

KW - Smoothing methods

UR - http://www.scopus.com/inward/record.url?scp=78349269090&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=78349269090&partnerID=8YFLogxK

U2 - 10.1007/978-3-642-16248-0_52

DO - 10.1007/978-3-642-16248-0_52

M3 - Conference contribution

AN - SCOPUS:78349269090

SN - 3642162479

SN - 9783642162473

VL - 6401 LNAI

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 361

EP - 370

BT - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

ER -