Automatic arabic text categorization using Bayesian learning

Mahmood H. Kadhim, Nazlia Omar

Research output: Chapter in Book/Report/Conference proceedingConference contribution

2 Citations (Scopus)

Abstract

Automatic Text Categorization (ATC) is a task of categorizing an electronic document to a predefined category automatically based on its content. There are many supervised Machine Learning (ML) techniques that has been used to solve Text Categorization (TC) problem. The complex morphology of Arabic language and its large vocabulary size makes using these techniques difficult and costly in time and attempt. We have investigated Bayesian learning which is based on Bayesian theorem to deal with Arabic ATC problem. Bayesian learning classifiers that have been applied are Multivariate Guess Naïve Bayes (MGNB), Flexible Bayes (FB), Multivariate Bernoulli Naïve Bayes (MBNB), and Multinomial Naïve Bayes (MNB). For text representation in terms of word level N-Gram, 1-Gram, 2-Gram and 3-Gram have been used. For Arabic stemming, a simple stemmer called TREC-2002 Light Stemmer is used in the prototype. For feature selection we have used several feature selection techniques i.e. Chi-Square Statistic (CHI), Odd Ratio (OR), Mutual Information (MI), and GSS Coefficient (GSS). The results showed that FB outperforms MNB, MBNB, and MGNB. The experimental results of this work proved that using word level n-gram for ATC based on Bayesian learning leads to acceptable results.

Original languageEnglish
Title of host publicationProceedings - 2012 7th International Conference on Computing and Convergence Technology (ICCIT, ICEI and ICACT), ICCCT 2012
Pages415-419
Number of pages5
Publication statusPublished - 2012
Event2012 7th International Conference on Computing and Convergence Technology (ICCIT, ICEI and ICACT), ICCCT 2012 - Seoul
Duration: 3 Dec 20125 Dec 2012

Other

Other2012 7th International Conference on Computing and Convergence Technology (ICCIT, ICEI and ICACT), ICCCT 2012
CitySeoul
Period3/12/125/12/12

Fingerprint

Feature extraction
Learning systems
Classifiers
Statistics

Keywords

  • Arabic text categorization
  • automatic text categorization
  • Bayesian learning
  • Feature Selection (FS)

ASJC Scopus subject areas

  • Computer Science Applications
  • Software

Cite this

Kadhim, M. H., & Omar, N. (2012). Automatic arabic text categorization using Bayesian learning. In Proceedings - 2012 7th International Conference on Computing and Convergence Technology (ICCIT, ICEI and ICACT), ICCCT 2012 (pp. 415-419). [6530369]

Automatic arabic text categorization using Bayesian learning. / Kadhim, Mahmood H.; Omar, Nazlia.

Proceedings - 2012 7th International Conference on Computing and Convergence Technology (ICCIT, ICEI and ICACT), ICCCT 2012. 2012. p. 415-419 6530369.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Kadhim, MH & Omar, N 2012, Automatic arabic text categorization using Bayesian learning. in Proceedings - 2012 7th International Conference on Computing and Convergence Technology (ICCIT, ICEI and ICACT), ICCCT 2012., 6530369, pp. 415-419, 2012 7th International Conference on Computing and Convergence Technology (ICCIT, ICEI and ICACT), ICCCT 2012, Seoul, 3/12/12.
Kadhim MH, Omar N. Automatic arabic text categorization using Bayesian learning. In Proceedings - 2012 7th International Conference on Computing and Convergence Technology (ICCIT, ICEI and ICACT), ICCCT 2012. 2012. p. 415-419. 6530369
Kadhim, Mahmood H. ; Omar, Nazlia. / Automatic arabic text categorization using Bayesian learning. Proceedings - 2012 7th International Conference on Computing and Convergence Technology (ICCIT, ICEI and ICACT), ICCCT 2012. 2012. pp. 415-419
@inproceedings{a74249a29b6a420083d269f3a49904b5,
title = "Automatic arabic text categorization using Bayesian learning",
abstract = "Automatic Text Categorization (ATC) is a task of categorizing an electronic document to a predefined category automatically based on its content. There are many supervised Machine Learning (ML) techniques that has been used to solve Text Categorization (TC) problem. The complex morphology of Arabic language and its large vocabulary size makes using these techniques difficult and costly in time and attempt. We have investigated Bayesian learning which is based on Bayesian theorem to deal with Arabic ATC problem. Bayesian learning classifiers that have been applied are Multivariate Guess Na{\"i}ve Bayes (MGNB), Flexible Bayes (FB), Multivariate Bernoulli Na{\"i}ve Bayes (MBNB), and Multinomial Na{\"i}ve Bayes (MNB). For text representation in terms of word level N-Gram, 1-Gram, 2-Gram and 3-Gram have been used. For Arabic stemming, a simple stemmer called TREC-2002 Light Stemmer is used in the prototype. For feature selection we have used several feature selection techniques i.e. Chi-Square Statistic (CHI), Odd Ratio (OR), Mutual Information (MI), and GSS Coefficient (GSS). The results showed that FB outperforms MNB, MBNB, and MGNB. The experimental results of this work proved that using word level n-gram for ATC based on Bayesian learning leads to acceptable results.",
keywords = "Arabic text categorization, automatic text categorization, Bayesian learning, Feature Selection (FS)",
author = "Kadhim, {Mahmood H.} and Nazlia Omar",
year = "2012",
language = "English",
isbn = "9788994364216",
pages = "415--419",
booktitle = "Proceedings - 2012 7th International Conference on Computing and Convergence Technology (ICCIT, ICEI and ICACT), ICCCT 2012",

}

TY - GEN

T1 - Automatic arabic text categorization using Bayesian learning

AU - Kadhim, Mahmood H.

AU - Omar, Nazlia

PY - 2012

Y1 - 2012

N2 - Automatic Text Categorization (ATC) is a task of categorizing an electronic document to a predefined category automatically based on its content. There are many supervised Machine Learning (ML) techniques that has been used to solve Text Categorization (TC) problem. The complex morphology of Arabic language and its large vocabulary size makes using these techniques difficult and costly in time and attempt. We have investigated Bayesian learning which is based on Bayesian theorem to deal with Arabic ATC problem. Bayesian learning classifiers that have been applied are Multivariate Guess Naïve Bayes (MGNB), Flexible Bayes (FB), Multivariate Bernoulli Naïve Bayes (MBNB), and Multinomial Naïve Bayes (MNB). For text representation in terms of word level N-Gram, 1-Gram, 2-Gram and 3-Gram have been used. For Arabic stemming, a simple stemmer called TREC-2002 Light Stemmer is used in the prototype. For feature selection we have used several feature selection techniques i.e. Chi-Square Statistic (CHI), Odd Ratio (OR), Mutual Information (MI), and GSS Coefficient (GSS). The results showed that FB outperforms MNB, MBNB, and MGNB. The experimental results of this work proved that using word level n-gram for ATC based on Bayesian learning leads to acceptable results.

AB - Automatic Text Categorization (ATC) is a task of categorizing an electronic document to a predefined category automatically based on its content. There are many supervised Machine Learning (ML) techniques that has been used to solve Text Categorization (TC) problem. The complex morphology of Arabic language and its large vocabulary size makes using these techniques difficult and costly in time and attempt. We have investigated Bayesian learning which is based on Bayesian theorem to deal with Arabic ATC problem. Bayesian learning classifiers that have been applied are Multivariate Guess Naïve Bayes (MGNB), Flexible Bayes (FB), Multivariate Bernoulli Naïve Bayes (MBNB), and Multinomial Naïve Bayes (MNB). For text representation in terms of word level N-Gram, 1-Gram, 2-Gram and 3-Gram have been used. For Arabic stemming, a simple stemmer called TREC-2002 Light Stemmer is used in the prototype. For feature selection we have used several feature selection techniques i.e. Chi-Square Statistic (CHI), Odd Ratio (OR), Mutual Information (MI), and GSS Coefficient (GSS). The results showed that FB outperforms MNB, MBNB, and MGNB. The experimental results of this work proved that using word level n-gram for ATC based on Bayesian learning leads to acceptable results.

KW - Arabic text categorization

KW - automatic text categorization

KW - Bayesian learning

KW - Feature Selection (FS)

UR - http://www.scopus.com/inward/record.url?scp=84881122174&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84881122174&partnerID=8YFLogxK

M3 - Conference contribution

SN - 9788994364216

SP - 415

EP - 419

BT - Proceedings - 2012 7th International Conference on Computing and Convergence Technology (ICCIT, ICEI and ICACT), ICCCT 2012

ER -