Bayesian learning for automatic Arabic text categorization

Mahmood H. Kadhim, Nazlia Omar

Research output: Contribution to journalArticle

2 Citations (Scopus)

Abstract

Automatic Text Categorization (ATC) is a task of categorizing an electronic document to a predefined category automatically based on its content. There are many supervised Machine Learning (ML) techniques that has been used to solve Text Categorization (TC) problem. The complex morphology of Arabic language and its large vocabulary size makes using these techniques difficult and costly in time and attempt. We have investigated Bayesian learning which is based on Bayesian theorem to deal with Arabic ATC problem. Bayesian learning classifiers that have been applied are Multivariate Guess Naïve Bayes (MGNB), Flexible Bayes (FB), Multivariate Bernoulli Naïve Bayes (MBNB), and Multinomial Naïve Bayes (MNB). For text representation in terms of word level N-Gram, 1-Gram, 2-Gram and 3-Gram have been used. For Arabic stemming, a simple stemmer called TREC- 2002 Light Stemmer is used in the prototype. For feature selection we have used several feature selection techniques i.e. Chi-Square Statistic (CHI), Odd Ratio (OR), Mutual Information (MI), and GSS Coefficient (GSS). The results showed that FB outperforms MNB, MBNB, and MGNB. The experimental results of this work proved that using word level n-gram for ATC based on Bayesian learning leads to acceptable results.

Original languageEnglish
Pages (from-to)1-8
Number of pages8
JournalJournal of Next Generation Information Technology
Volume4
Issue number3
DOIs
Publication statusPublished - May 2013

Fingerprint

Feature extraction
Learning systems
Classifiers
Statistics

Keywords

  • Arabic text categorization
  • Automatic text categorization
  • Bayesian learning
  • Feature selection (FS)

ASJC Scopus subject areas

  • Computer Science(all)

Cite this

Bayesian learning for automatic Arabic text categorization. / Kadhim, Mahmood H.; Omar, Nazlia.

In: Journal of Next Generation Information Technology, Vol. 4, No. 3, 05.2013, p. 1-8.

Research output: Contribution to journalArticle

@article{1347604a53f54e568ef572c690eb5eed,
title = "Bayesian learning for automatic Arabic text categorization",
abstract = "Automatic Text Categorization (ATC) is a task of categorizing an electronic document to a predefined category automatically based on its content. There are many supervised Machine Learning (ML) techniques that has been used to solve Text Categorization (TC) problem. The complex morphology of Arabic language and its large vocabulary size makes using these techniques difficult and costly in time and attempt. We have investigated Bayesian learning which is based on Bayesian theorem to deal with Arabic ATC problem. Bayesian learning classifiers that have been applied are Multivariate Guess Na{\"i}ve Bayes (MGNB), Flexible Bayes (FB), Multivariate Bernoulli Na{\"i}ve Bayes (MBNB), and Multinomial Na{\"i}ve Bayes (MNB). For text representation in terms of word level N-Gram, 1-Gram, 2-Gram and 3-Gram have been used. For Arabic stemming, a simple stemmer called TREC- 2002 Light Stemmer is used in the prototype. For feature selection we have used several feature selection techniques i.e. Chi-Square Statistic (CHI), Odd Ratio (OR), Mutual Information (MI), and GSS Coefficient (GSS). The results showed that FB outperforms MNB, MBNB, and MGNB. The experimental results of this work proved that using word level n-gram for ATC based on Bayesian learning leads to acceptable results.",
keywords = "Arabic text categorization, Automatic text categorization, Bayesian learning, Feature selection (FS)",
author = "Kadhim, {Mahmood H.} and Nazlia Omar",
year = "2013",
month = "5",
doi = "10.4156/jnit.vol4.issue3.1",
language = "English",
volume = "4",
pages = "1--8",
journal = "Journal of Next Generation Information Technology",
issn = "2092-8637",
publisher = "Advanced Institute of Convergence Information Technology Research Center",
number = "3",

}

TY - JOUR

T1 - Bayesian learning for automatic Arabic text categorization

AU - Kadhim, Mahmood H.

AU - Omar, Nazlia

PY - 2013/5

Y1 - 2013/5

N2 - Automatic Text Categorization (ATC) is a task of categorizing an electronic document to a predefined category automatically based on its content. There are many supervised Machine Learning (ML) techniques that has been used to solve Text Categorization (TC) problem. The complex morphology of Arabic language and its large vocabulary size makes using these techniques difficult and costly in time and attempt. We have investigated Bayesian learning which is based on Bayesian theorem to deal with Arabic ATC problem. Bayesian learning classifiers that have been applied are Multivariate Guess Naïve Bayes (MGNB), Flexible Bayes (FB), Multivariate Bernoulli Naïve Bayes (MBNB), and Multinomial Naïve Bayes (MNB). For text representation in terms of word level N-Gram, 1-Gram, 2-Gram and 3-Gram have been used. For Arabic stemming, a simple stemmer called TREC- 2002 Light Stemmer is used in the prototype. For feature selection we have used several feature selection techniques i.e. Chi-Square Statistic (CHI), Odd Ratio (OR), Mutual Information (MI), and GSS Coefficient (GSS). The results showed that FB outperforms MNB, MBNB, and MGNB. The experimental results of this work proved that using word level n-gram for ATC based on Bayesian learning leads to acceptable results.

AB - Automatic Text Categorization (ATC) is a task of categorizing an electronic document to a predefined category automatically based on its content. There are many supervised Machine Learning (ML) techniques that has been used to solve Text Categorization (TC) problem. The complex morphology of Arabic language and its large vocabulary size makes using these techniques difficult and costly in time and attempt. We have investigated Bayesian learning which is based on Bayesian theorem to deal with Arabic ATC problem. Bayesian learning classifiers that have been applied are Multivariate Guess Naïve Bayes (MGNB), Flexible Bayes (FB), Multivariate Bernoulli Naïve Bayes (MBNB), and Multinomial Naïve Bayes (MNB). For text representation in terms of word level N-Gram, 1-Gram, 2-Gram and 3-Gram have been used. For Arabic stemming, a simple stemmer called TREC- 2002 Light Stemmer is used in the prototype. For feature selection we have used several feature selection techniques i.e. Chi-Square Statistic (CHI), Odd Ratio (OR), Mutual Information (MI), and GSS Coefficient (GSS). The results showed that FB outperforms MNB, MBNB, and MGNB. The experimental results of this work proved that using word level n-gram for ATC based on Bayesian learning leads to acceptable results.

KW - Arabic text categorization

KW - Automatic text categorization

KW - Bayesian learning

KW - Feature selection (FS)

UR - http://www.scopus.com/inward/record.url?scp=84881008146&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84881008146&partnerID=8YFLogxK

U2 - 10.4156/jnit.vol4.issue3.1

DO - 10.4156/jnit.vol4.issue3.1

M3 - Article

VL - 4

SP - 1

EP - 8

JO - Journal of Next Generation Information Technology

JF - Journal of Next Generation Information Technology

SN - 2092-8637

IS - 3

ER -