Probabilistic Arabic part of speech tagger with unknown words handling

Mohammed Albared, Tareq Al-Moslmi, Nazlia Omar, Adel Al-Shabi, Fadl Mutaher Ba-Alwi

Research output: Contribution to journalArticle

5 Citations (Scopus)

Abstract

Part Of Speech (POS) tagger is an essential preprocessing step in many natural language applications. In this paper, we investigate the best configuration of trigram Hidden Markov Model (HMM) Arabic POS tagger when small tagged corpus is available. With small training data, unknown word POS guessing is the main problem. This problem becomes more serious in languages which have huge size of vocabulary and rich and complex morphology like Arabic. In order to handle this problem in Arabic POS tagger, we have studied the effect of integrating a lexicon based morphological analyzer to improve the performance of the tagger. Moreover, in this work, several lexical models have been empirically defined, implemented and evaluated. These models are based essentially on the internal structure and the formation process of Arabic words. Furthermore, several combinations of these models have been presented. The POS tagger has been trained with a training corpus of 29300 words and it uses a tagset of 24 different POS tags. Our system achieves state-of-the-art overall accuracy in Arabic part of speech tagging and outperforms other Arabic taggers in unknown word POS tagging accuracy.

Original languageEnglish
Pages (from-to)236-246
Number of pages11
JournalJournal of Theoretical and Applied Information Technology
Volume90
Issue number2
Publication statusPublished - 31 Aug 2016

Fingerprint

Unknown
Tagging
Speech
Hidden Markov models
Natural Language
Markov Model
Preprocessing
Model
Internal
Configuration
Corpus
Training

Keywords

  • Arabic language
  • Part of speech tagger
  • Unknown word guessing

ASJC Scopus subject areas

  • Computer Science(all)
  • Theoretical Computer Science

Cite this

Probabilistic Arabic part of speech tagger with unknown words handling. / Albared, Mohammed; Al-Moslmi, Tareq; Omar, Nazlia; Al-Shabi, Adel; Ba-Alwi, Fadl Mutaher.

In: Journal of Theoretical and Applied Information Technology, Vol. 90, No. 2, 31.08.2016, p. 236-246.

Research output: Contribution to journalArticle

Albared, M, Al-Moslmi, T, Omar, N, Al-Shabi, A & Ba-Alwi, FM 2016, 'Probabilistic Arabic part of speech tagger with unknown words handling', Journal of Theoretical and Applied Information Technology, vol. 90, no. 2, pp. 236-246.
Albared, Mohammed ; Al-Moslmi, Tareq ; Omar, Nazlia ; Al-Shabi, Adel ; Ba-Alwi, Fadl Mutaher. / Probabilistic Arabic part of speech tagger with unknown words handling. In: Journal of Theoretical and Applied Information Technology. 2016 ; Vol. 90, No. 2. pp. 236-246.
@article{9055d3fad80443a9ab88aff54c77553c,
title = "Probabilistic Arabic part of speech tagger with unknown words handling",
abstract = "Part Of Speech (POS) tagger is an essential preprocessing step in many natural language applications. In this paper, we investigate the best configuration of trigram Hidden Markov Model (HMM) Arabic POS tagger when small tagged corpus is available. With small training data, unknown word POS guessing is the main problem. This problem becomes more serious in languages which have huge size of vocabulary and rich and complex morphology like Arabic. In order to handle this problem in Arabic POS tagger, we have studied the effect of integrating a lexicon based morphological analyzer to improve the performance of the tagger. Moreover, in this work, several lexical models have been empirically defined, implemented and evaluated. These models are based essentially on the internal structure and the formation process of Arabic words. Furthermore, several combinations of these models have been presented. The POS tagger has been trained with a training corpus of 29300 words and it uses a tagset of 24 different POS tags. Our system achieves state-of-the-art overall accuracy in Arabic part of speech tagging and outperforms other Arabic taggers in unknown word POS tagging accuracy.",
keywords = "Arabic language, Part of speech tagger, Unknown word guessing",
author = "Mohammed Albared and Tareq Al-Moslmi and Nazlia Omar and Adel Al-Shabi and Ba-Alwi, {Fadl Mutaher}",
year = "2016",
month = "8",
day = "31",
language = "English",
volume = "90",
pages = "236--246",
journal = "Journal of Theoretical and Applied Information Technology",
issn = "1992-8645",
publisher = "Asian Research Publishing Network (ARPN)",
number = "2",

}

TY - JOUR

T1 - Probabilistic Arabic part of speech tagger with unknown words handling

AU - Albared, Mohammed

AU - Al-Moslmi, Tareq

AU - Omar, Nazlia

AU - Al-Shabi, Adel

AU - Ba-Alwi, Fadl Mutaher

PY - 2016/8/31

Y1 - 2016/8/31

N2 - Part Of Speech (POS) tagger is an essential preprocessing step in many natural language applications. In this paper, we investigate the best configuration of trigram Hidden Markov Model (HMM) Arabic POS tagger when small tagged corpus is available. With small training data, unknown word POS guessing is the main problem. This problem becomes more serious in languages which have huge size of vocabulary and rich and complex morphology like Arabic. In order to handle this problem in Arabic POS tagger, we have studied the effect of integrating a lexicon based morphological analyzer to improve the performance of the tagger. Moreover, in this work, several lexical models have been empirically defined, implemented and evaluated. These models are based essentially on the internal structure and the formation process of Arabic words. Furthermore, several combinations of these models have been presented. The POS tagger has been trained with a training corpus of 29300 words and it uses a tagset of 24 different POS tags. Our system achieves state-of-the-art overall accuracy in Arabic part of speech tagging and outperforms other Arabic taggers in unknown word POS tagging accuracy.

AB - Part Of Speech (POS) tagger is an essential preprocessing step in many natural language applications. In this paper, we investigate the best configuration of trigram Hidden Markov Model (HMM) Arabic POS tagger when small tagged corpus is available. With small training data, unknown word POS guessing is the main problem. This problem becomes more serious in languages which have huge size of vocabulary and rich and complex morphology like Arabic. In order to handle this problem in Arabic POS tagger, we have studied the effect of integrating a lexicon based morphological analyzer to improve the performance of the tagger. Moreover, in this work, several lexical models have been empirically defined, implemented and evaluated. These models are based essentially on the internal structure and the formation process of Arabic words. Furthermore, several combinations of these models have been presented. The POS tagger has been trained with a training corpus of 29300 words and it uses a tagset of 24 different POS tags. Our system achieves state-of-the-art overall accuracy in Arabic part of speech tagging and outperforms other Arabic taggers in unknown word POS tagging accuracy.

KW - Arabic language

KW - Part of speech tagger

KW - Unknown word guessing

UR - http://www.scopus.com/inward/record.url?scp=84984890194&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84984890194&partnerID=8YFLogxK

M3 - Article

AN - SCOPUS:84984890194

VL - 90

SP - 236

EP - 246

JO - Journal of Theoretical and Applied Information Technology

JF - Journal of Theoretical and Applied Information Technology

SN - 1992-8645

IS - 2

ER -