Statistical Bayesian learning for automatic Arabic text categorization

Bassam Al-Salemi, Mohd Juzaiddin Ab Aziz

Research output: Contribution to journalArticle

23 Citations (Scopus)

Abstract

Approach: Problem statement: The rapid increasing of online Arabic documents necessitated applying Text Categorization techniques which are commonly used for English language to categorize them automatically. The complex morphology of Arabic language and its large vocabulary size make using these techniques difficult and costly in time and effort. Approach: We have investigated Bayesian learning models in order to enhance Arabic ATC. Three classifiers based on Bayesian theorem had been implemented which are Simple Naïve Bayes (NB), Multi-variant Bernoulli Naïve Bayes (MBNB) and Multinomial Naïve Bayes (MNB) models. TREC-2002 Light Stemmer was applied for Arabic stemming. For text representation, BOW and character-level 3, 4 and 5 g had been used. In order to reduce the dimensionality of feature space, we have used several feature selection methods; Mutual Information (MI), CHI-Square statistic (CHI), Odds Ratio (OR) and GSS-coefficient (GSS). Conclusion: MBNB classifier outperforms both of NB and MNB classifiers. BOW representation type leads to the best classification performance, nevertheless using character level n-gram leads to satisfied results by Bayesian learning for Arabic ATC.

Original languageEnglish
Pages (from-to)39-45
Number of pages7
JournalJournal of Computer Science
Volume7
Issue number1
DOIs
Publication statusPublished - 2011

Fingerprint

Classifiers
Feature extraction
Statistics

Keywords

  • Arabic text categorization
  • Automatic text categorization
  • Bayesian learning
  • Feature selection (FS)
  • Information gain (IG)
  • Mutual information (MI)
  • Odds ratio (OR)

ASJC Scopus subject areas

  • Software
  • Computer Networks and Communications
  • Artificial Intelligence

Cite this

Statistical Bayesian learning for automatic Arabic text categorization. / Al-Salemi, Bassam; Ab Aziz, Mohd Juzaiddin.

In: Journal of Computer Science, Vol. 7, No. 1, 2011, p. 39-45.

Research output: Contribution to journalArticle

@article{f17416ec83c340f4849cb1e76dd61037,
title = "Statistical Bayesian learning for automatic Arabic text categorization",
abstract = "Approach: Problem statement: The rapid increasing of online Arabic documents necessitated applying Text Categorization techniques which are commonly used for English language to categorize them automatically. The complex morphology of Arabic language and its large vocabulary size make using these techniques difficult and costly in time and effort. Approach: We have investigated Bayesian learning models in order to enhance Arabic ATC. Three classifiers based on Bayesian theorem had been implemented which are Simple Na{\"i}ve Bayes (NB), Multi-variant Bernoulli Na{\"i}ve Bayes (MBNB) and Multinomial Na{\"i}ve Bayes (MNB) models. TREC-2002 Light Stemmer was applied for Arabic stemming. For text representation, BOW and character-level 3, 4 and 5 g had been used. In order to reduce the dimensionality of feature space, we have used several feature selection methods; Mutual Information (MI), CHI-Square statistic (CHI), Odds Ratio (OR) and GSS-coefficient (GSS). Conclusion: MBNB classifier outperforms both of NB and MNB classifiers. BOW representation type leads to the best classification performance, nevertheless using character level n-gram leads to satisfied results by Bayesian learning for Arabic ATC.",
keywords = "Arabic text categorization, Automatic text categorization, Bayesian learning, Feature selection (FS), Information gain (IG), Mutual information (MI), Odds ratio (OR)",
author = "Bassam Al-Salemi and {Ab Aziz}, {Mohd Juzaiddin}",
year = "2011",
doi = "10.3844/jcssp.2011.39.45",
language = "English",
volume = "7",
pages = "39--45",
journal = "Journal of Computer Science",
issn = "1549-3636",
publisher = "Science Publications",
number = "1",

}

TY - JOUR

T1 - Statistical Bayesian learning for automatic Arabic text categorization

AU - Al-Salemi, Bassam

AU - Ab Aziz, Mohd Juzaiddin

PY - 2011

Y1 - 2011

N2 - Approach: Problem statement: The rapid increasing of online Arabic documents necessitated applying Text Categorization techniques which are commonly used for English language to categorize them automatically. The complex morphology of Arabic language and its large vocabulary size make using these techniques difficult and costly in time and effort. Approach: We have investigated Bayesian learning models in order to enhance Arabic ATC. Three classifiers based on Bayesian theorem had been implemented which are Simple Naïve Bayes (NB), Multi-variant Bernoulli Naïve Bayes (MBNB) and Multinomial Naïve Bayes (MNB) models. TREC-2002 Light Stemmer was applied for Arabic stemming. For text representation, BOW and character-level 3, 4 and 5 g had been used. In order to reduce the dimensionality of feature space, we have used several feature selection methods; Mutual Information (MI), CHI-Square statistic (CHI), Odds Ratio (OR) and GSS-coefficient (GSS). Conclusion: MBNB classifier outperforms both of NB and MNB classifiers. BOW representation type leads to the best classification performance, nevertheless using character level n-gram leads to satisfied results by Bayesian learning for Arabic ATC.

AB - Approach: Problem statement: The rapid increasing of online Arabic documents necessitated applying Text Categorization techniques which are commonly used for English language to categorize them automatically. The complex morphology of Arabic language and its large vocabulary size make using these techniques difficult and costly in time and effort. Approach: We have investigated Bayesian learning models in order to enhance Arabic ATC. Three classifiers based on Bayesian theorem had been implemented which are Simple Naïve Bayes (NB), Multi-variant Bernoulli Naïve Bayes (MBNB) and Multinomial Naïve Bayes (MNB) models. TREC-2002 Light Stemmer was applied for Arabic stemming. For text representation, BOW and character-level 3, 4 and 5 g had been used. In order to reduce the dimensionality of feature space, we have used several feature selection methods; Mutual Information (MI), CHI-Square statistic (CHI), Odds Ratio (OR) and GSS-coefficient (GSS). Conclusion: MBNB classifier outperforms both of NB and MNB classifiers. BOW representation type leads to the best classification performance, nevertheless using character level n-gram leads to satisfied results by Bayesian learning for Arabic ATC.

KW - Arabic text categorization

KW - Automatic text categorization

KW - Bayesian learning

KW - Feature selection (FS)

KW - Information gain (IG)

KW - Mutual information (MI)

KW - Odds ratio (OR)

UR - http://www.scopus.com/inward/record.url?scp=79251506653&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=79251506653&partnerID=8YFLogxK

U2 - 10.3844/jcssp.2011.39.45

DO - 10.3844/jcssp.2011.39.45

M3 - Article

AN - SCOPUS:79251506653

VL - 7

SP - 39

EP - 45

JO - Journal of Computer Science

JF - Journal of Computer Science

SN - 1549-3636

IS - 1

ER -