Arabic part of speech tagging using K-Nearest Neighbour and Naive Bayes classifiers combination

Rund Mahafdah, Nazlia Omar, Omaia Al-Omari

Research output: Contribution to journalArticle

3 Citations (Scopus)

Abstract

Part Of Speech (POS) tagging forms the important preprocessing step in many of the natural language processing applications such as text summarization, question answering and information retrieval system. It is the process of classifying every word in a given context to its appropriate part of speech. Different POS tagging techniques in the literature have been developed and experimented. Currently, it is well known that some POS tagging models are not performing well on the Quranic Arabic due to the complexity of the Quranic Arabic text. This complexity presents several challenges for POS tagging such as high ambiguity, data sparseness and large existence of unknown words. With this in mind, the main problem here is to find out how existing and efficient methods perform in Arabic and how can Quranic corpus be utilized to produce an efficient framework for Arabic POS tagging. We propose a classifiers combination experimental framework for Arabic POS tagger, by selecting two best diverse probabilistic classifiers used in numerous works in non-Arabic language namely K-Nearest Neighbour (KNN) and Naive Bayes (NB). The Majority voting is used here as the combination strategy to exploit classifiers advantages. In addition, an in-depth study has been conducted on a large list of features for exploiting effective features and investigating their role in enhancing the performance of POS taggers for the Quranic Arabic. Hence, this study aims to efficiently integrate different feature sets and tagging algorithms to synthesize more accurate POS tagging procedure. The data used in this study is the Arabic Quranic Corpus, an annotated linguistic resource consisting of 77,430 words with Arabic grammar, syntax and morphology for each word in the Holy Quran. The highest accuracy in the results achieved is 98.32%, which can be a significant enhancement for the state-of-the-art for Arabic Quranic text. The most effective features that yield this accuracy are a combination of w0 (the current word), p0 (POS of the current word), p-3 (POS of three words before), p-2 (POS of two words before) and p-1 (POS of the word before).

Original languageEnglish
Pages (from-to)1865-1873
Number of pages9
JournalJournal of Computer Science
Volume10
Issue number10
DOIs
Publication statusPublished - 2014

Fingerprint

Classifiers
Information retrieval systems
Linguistics

Keywords

  • Classification
  • Natural language processing
  • Part of speech

ASJC Scopus subject areas

  • Software
  • Computer Networks and Communications
  • Artificial Intelligence

Cite this

Arabic part of speech tagging using K-Nearest Neighbour and Naive Bayes classifiers combination. / Mahafdah, Rund; Omar, Nazlia; Al-Omari, Omaia.

In: Journal of Computer Science, Vol. 10, No. 10, 2014, p. 1865-1873.

Research output: Contribution to journalArticle

@article{8535e0a1129f4055b45994e1757c2f93,
title = "Arabic part of speech tagging using K-Nearest Neighbour and Naive Bayes classifiers combination",
abstract = "Part Of Speech (POS) tagging forms the important preprocessing step in many of the natural language processing applications such as text summarization, question answering and information retrieval system. It is the process of classifying every word in a given context to its appropriate part of speech. Different POS tagging techniques in the literature have been developed and experimented. Currently, it is well known that some POS tagging models are not performing well on the Quranic Arabic due to the complexity of the Quranic Arabic text. This complexity presents several challenges for POS tagging such as high ambiguity, data sparseness and large existence of unknown words. With this in mind, the main problem here is to find out how existing and efficient methods perform in Arabic and how can Quranic corpus be utilized to produce an efficient framework for Arabic POS tagging. We propose a classifiers combination experimental framework for Arabic POS tagger, by selecting two best diverse probabilistic classifiers used in numerous works in non-Arabic language namely K-Nearest Neighbour (KNN) and Naive Bayes (NB). The Majority voting is used here as the combination strategy to exploit classifiers advantages. In addition, an in-depth study has been conducted on a large list of features for exploiting effective features and investigating their role in enhancing the performance of POS taggers for the Quranic Arabic. Hence, this study aims to efficiently integrate different feature sets and tagging algorithms to synthesize more accurate POS tagging procedure. The data used in this study is the Arabic Quranic Corpus, an annotated linguistic resource consisting of 77,430 words with Arabic grammar, syntax and morphology for each word in the Holy Quran. The highest accuracy in the results achieved is 98.32{\%}, which can be a significant enhancement for the state-of-the-art for Arabic Quranic text. The most effective features that yield this accuracy are a combination of w0 (the current word), p0 (POS of the current word), p-3 (POS of three words before), p-2 (POS of two words before) and p-1 (POS of the word before).",
keywords = "Classification, Natural language processing, Part of speech",
author = "Rund Mahafdah and Nazlia Omar and Omaia Al-Omari",
year = "2014",
doi = "10.3844/jcssp.2014.1865.1873",
language = "English",
volume = "10",
pages = "1865--1873",
journal = "Journal of Computer Science",
issn = "1549-3636",
publisher = "Science Publications",
number = "10",

}

TY - JOUR

T1 - Arabic part of speech tagging using K-Nearest Neighbour and Naive Bayes classifiers combination

AU - Mahafdah, Rund

AU - Omar, Nazlia

AU - Al-Omari, Omaia

PY - 2014

Y1 - 2014

N2 - Part Of Speech (POS) tagging forms the important preprocessing step in many of the natural language processing applications such as text summarization, question answering and information retrieval system. It is the process of classifying every word in a given context to its appropriate part of speech. Different POS tagging techniques in the literature have been developed and experimented. Currently, it is well known that some POS tagging models are not performing well on the Quranic Arabic due to the complexity of the Quranic Arabic text. This complexity presents several challenges for POS tagging such as high ambiguity, data sparseness and large existence of unknown words. With this in mind, the main problem here is to find out how existing and efficient methods perform in Arabic and how can Quranic corpus be utilized to produce an efficient framework for Arabic POS tagging. We propose a classifiers combination experimental framework for Arabic POS tagger, by selecting two best diverse probabilistic classifiers used in numerous works in non-Arabic language namely K-Nearest Neighbour (KNN) and Naive Bayes (NB). The Majority voting is used here as the combination strategy to exploit classifiers advantages. In addition, an in-depth study has been conducted on a large list of features for exploiting effective features and investigating their role in enhancing the performance of POS taggers for the Quranic Arabic. Hence, this study aims to efficiently integrate different feature sets and tagging algorithms to synthesize more accurate POS tagging procedure. The data used in this study is the Arabic Quranic Corpus, an annotated linguistic resource consisting of 77,430 words with Arabic grammar, syntax and morphology for each word in the Holy Quran. The highest accuracy in the results achieved is 98.32%, which can be a significant enhancement for the state-of-the-art for Arabic Quranic text. The most effective features that yield this accuracy are a combination of w0 (the current word), p0 (POS of the current word), p-3 (POS of three words before), p-2 (POS of two words before) and p-1 (POS of the word before).

AB - Part Of Speech (POS) tagging forms the important preprocessing step in many of the natural language processing applications such as text summarization, question answering and information retrieval system. It is the process of classifying every word in a given context to its appropriate part of speech. Different POS tagging techniques in the literature have been developed and experimented. Currently, it is well known that some POS tagging models are not performing well on the Quranic Arabic due to the complexity of the Quranic Arabic text. This complexity presents several challenges for POS tagging such as high ambiguity, data sparseness and large existence of unknown words. With this in mind, the main problem here is to find out how existing and efficient methods perform in Arabic and how can Quranic corpus be utilized to produce an efficient framework for Arabic POS tagging. We propose a classifiers combination experimental framework for Arabic POS tagger, by selecting two best diverse probabilistic classifiers used in numerous works in non-Arabic language namely K-Nearest Neighbour (KNN) and Naive Bayes (NB). The Majority voting is used here as the combination strategy to exploit classifiers advantages. In addition, an in-depth study has been conducted on a large list of features for exploiting effective features and investigating their role in enhancing the performance of POS taggers for the Quranic Arabic. Hence, this study aims to efficiently integrate different feature sets and tagging algorithms to synthesize more accurate POS tagging procedure. The data used in this study is the Arabic Quranic Corpus, an annotated linguistic resource consisting of 77,430 words with Arabic grammar, syntax and morphology for each word in the Holy Quran. The highest accuracy in the results achieved is 98.32%, which can be a significant enhancement for the state-of-the-art for Arabic Quranic text. The most effective features that yield this accuracy are a combination of w0 (the current word), p0 (POS of the current word), p-3 (POS of three words before), p-2 (POS of two words before) and p-1 (POS of the word before).

KW - Classification

KW - Natural language processing

KW - Part of speech

UR - http://www.scopus.com/inward/record.url?scp=84905233594&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84905233594&partnerID=8YFLogxK

U2 - 10.3844/jcssp.2014.1865.1873

DO - 10.3844/jcssp.2014.1865.1873

M3 - Article

AN - SCOPUS:84905233594

VL - 10

SP - 1865

EP - 1873

JO - Journal of Computer Science

JF - Journal of Computer Science

SN - 1549-3636

IS - 10

ER -