Developing a competitive HMM arabic POS tagger using small training corpora

Research output: Chapter in Book/Report/Conference proceedingConference contribution

9 Citations (Scopus)

Abstract

Part Of Speech (POS) tagging is the ability to computationally determine which POS of a word is activated by its use in a particular context. POS is one of the important processing steps for many natural language systems such as information extraction, question answering. This paper presents a study aiming to find out the appropriate strategy to develop a fast and accurate Arabic statistical POS tagger when only a limited amount of training material is available. This is an essential factor when dealing with languages like Arabic for which small annotated resources are scarce and not easily available. Different configurations of a HMM tagger are studied. Namely, bigram and trigram models are tested, as well as different smoothing techniques. In addition, new lexical model has been defined to handle unknown word POS guessing based on the linear interpolation of both word suffix probability and word prefix probability. Several experiments are carried out to determine the performance of the different configurations of HMM with two small training corpora. The first corpus includes about 29300 words from both Modern Standard Arabic and Classical Arabic. The second corpus is the Quranic Arabic Corpus which is consisting of 77,430 words of the Quranic Arabic.

Original languageEnglish
Title of host publicationLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Pages288-296
Number of pages9
Volume6591 LNAI
EditionPART 1
DOIs
Publication statusPublished - 2011
Event3rd International Conference on Intelligent Information and Database Systems, ACIIDS 2011 - Daegu
Duration: 20 Apr 201122 Apr 2011

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
NumberPART 1
Volume6591 LNAI
ISSN (Print)03029743
ISSN (Electronic)16113349

Other

Other3rd International Conference on Intelligent Information and Database Systems, ACIIDS 2011
CityDaegu
Period20/4/1122/4/11

Fingerprint

Suffix
Configuration
Smoothing Techniques
Linear Interpolation
Question Answering
Information Extraction
Tagging
Prefix
Natural Language
Interpolation
Training
Speech
Corpus
Unknown
Resources
Processing
Model
Experiment
Experiments
Strategy

Keywords

  • Arabic languages
  • Hidden Markov model
  • Unknown words

ASJC Scopus subject areas

  • Computer Science(all)
  • Theoretical Computer Science

Cite this

Albared, M., Omar, N., & Ab Aziz, M. J. (2011). Developing a competitive HMM arabic POS tagger using small training corpora. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (PART 1 ed., Vol. 6591 LNAI, pp. 288-296). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 6591 LNAI, No. PART 1). https://doi.org/10.1007/978-3-642-20039-7-29

Developing a competitive HMM arabic POS tagger using small training corpora. / Albared, Mohammed; Omar, Nazlia; Ab Aziz, Mohd Juzaiddin.

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 6591 LNAI PART 1. ed. 2011. p. 288-296 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 6591 LNAI, No. PART 1).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Albared, M, Omar, N & Ab Aziz, MJ 2011, Developing a competitive HMM arabic POS tagger using small training corpora. in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). PART 1 edn, vol. 6591 LNAI, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), no. PART 1, vol. 6591 LNAI, pp. 288-296, 3rd International Conference on Intelligent Information and Database Systems, ACIIDS 2011, Daegu, 20/4/11. https://doi.org/10.1007/978-3-642-20039-7-29
Albared M, Omar N, Ab Aziz MJ. Developing a competitive HMM arabic POS tagger using small training corpora. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). PART 1 ed. Vol. 6591 LNAI. 2011. p. 288-296. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); PART 1). https://doi.org/10.1007/978-3-642-20039-7-29
Albared, Mohammed ; Omar, Nazlia ; Ab Aziz, Mohd Juzaiddin. / Developing a competitive HMM arabic POS tagger using small training corpora. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 6591 LNAI PART 1. ed. 2011. pp. 288-296 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); PART 1).
@inproceedings{e60557b26cf8495d972fce944c793884,
title = "Developing a competitive HMM arabic POS tagger using small training corpora",
abstract = "Part Of Speech (POS) tagging is the ability to computationally determine which POS of a word is activated by its use in a particular context. POS is one of the important processing steps for many natural language systems such as information extraction, question answering. This paper presents a study aiming to find out the appropriate strategy to develop a fast and accurate Arabic statistical POS tagger when only a limited amount of training material is available. This is an essential factor when dealing with languages like Arabic for which small annotated resources are scarce and not easily available. Different configurations of a HMM tagger are studied. Namely, bigram and trigram models are tested, as well as different smoothing techniques. In addition, new lexical model has been defined to handle unknown word POS guessing based on the linear interpolation of both word suffix probability and word prefix probability. Several experiments are carried out to determine the performance of the different configurations of HMM with two small training corpora. The first corpus includes about 29300 words from both Modern Standard Arabic and Classical Arabic. The second corpus is the Quranic Arabic Corpus which is consisting of 77,430 words of the Quranic Arabic.",
keywords = "Arabic languages, Hidden Markov model, Unknown words",
author = "Mohammed Albared and Nazlia Omar and {Ab Aziz}, {Mohd Juzaiddin}",
year = "2011",
doi = "10.1007/978-3-642-20039-7-29",
language = "English",
isbn = "9783642200380",
volume = "6591 LNAI",
series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
number = "PART 1",
pages = "288--296",
booktitle = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
edition = "PART 1",

}

TY - GEN

T1 - Developing a competitive HMM arabic POS tagger using small training corpora

AU - Albared, Mohammed

AU - Omar, Nazlia

AU - Ab Aziz, Mohd Juzaiddin

PY - 2011

Y1 - 2011

N2 - Part Of Speech (POS) tagging is the ability to computationally determine which POS of a word is activated by its use in a particular context. POS is one of the important processing steps for many natural language systems such as information extraction, question answering. This paper presents a study aiming to find out the appropriate strategy to develop a fast and accurate Arabic statistical POS tagger when only a limited amount of training material is available. This is an essential factor when dealing with languages like Arabic for which small annotated resources are scarce and not easily available. Different configurations of a HMM tagger are studied. Namely, bigram and trigram models are tested, as well as different smoothing techniques. In addition, new lexical model has been defined to handle unknown word POS guessing based on the linear interpolation of both word suffix probability and word prefix probability. Several experiments are carried out to determine the performance of the different configurations of HMM with two small training corpora. The first corpus includes about 29300 words from both Modern Standard Arabic and Classical Arabic. The second corpus is the Quranic Arabic Corpus which is consisting of 77,430 words of the Quranic Arabic.

AB - Part Of Speech (POS) tagging is the ability to computationally determine which POS of a word is activated by its use in a particular context. POS is one of the important processing steps for many natural language systems such as information extraction, question answering. This paper presents a study aiming to find out the appropriate strategy to develop a fast and accurate Arabic statistical POS tagger when only a limited amount of training material is available. This is an essential factor when dealing with languages like Arabic for which small annotated resources are scarce and not easily available. Different configurations of a HMM tagger are studied. Namely, bigram and trigram models are tested, as well as different smoothing techniques. In addition, new lexical model has been defined to handle unknown word POS guessing based on the linear interpolation of both word suffix probability and word prefix probability. Several experiments are carried out to determine the performance of the different configurations of HMM with two small training corpora. The first corpus includes about 29300 words from both Modern Standard Arabic and Classical Arabic. The second corpus is the Quranic Arabic Corpus which is consisting of 77,430 words of the Quranic Arabic.

KW - Arabic languages

KW - Hidden Markov model

KW - Unknown words

UR - http://www.scopus.com/inward/record.url?scp=80053181634&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=80053181634&partnerID=8YFLogxK

U2 - 10.1007/978-3-642-20039-7-29

DO - 10.1007/978-3-642-20039-7-29

M3 - Conference contribution

SN - 9783642200380

VL - 6591 LNAI

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 288

EP - 296

BT - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

ER -