### Abstract

Automatic Text Categorization (ATC) is a task of categorizing an electronic document to a predefined category automatically based on its content. There are many supervised Machine Learning (ML) techniques that has been used to solve Text Categorization (TC) problem. The complex morphology of Arabic language and its large vocabulary size makes using these techniques difficult and costly in time and attempt. We have investigated Bayesian learning which is based on Bayesian theorem to deal with Arabic ATC problem. Bayesian learning classifiers that have been applied are Multivariate Guess Naïve Bayes (MGNB), Flexible Bayes (FB), Multivariate Bernoulli Naïve Bayes (MBNB), and Multinomial Naïve Bayes (MNB). For text representation in terms of word level N-Gram, 1-Gram, 2-Gram and 3-Gram have been used. For Arabic stemming, a simple stemmer called TREC-2002 Light Stemmer is used in the prototype. For feature selection we have used several feature selection techniques i.e. Chi-Square Statistic (CHI), Odd Ratio (OR), Mutual Information (MI), and GSS Coefficient (GSS). The results showed that FB outperforms MNB, MBNB, and MGNB. The experimental results of this work proved that using word level n-gram for ATC based on Bayesian learning leads to acceptable results.

Original language | English |
---|---|

Title of host publication | Proceedings - 2012 7th International Conference on Computing and Convergence Technology (ICCIT, ICEI and ICACT), ICCCT 2012 |

Pages | 415-419 |

Number of pages | 5 |

Publication status | Published - 2012 |

Event | 2012 7th International Conference on Computing and Convergence Technology (ICCIT, ICEI and ICACT), ICCCT 2012 - Seoul Duration: 3 Dec 2012 → 5 Dec 2012 |

### Other

Other | 2012 7th International Conference on Computing and Convergence Technology (ICCIT, ICEI and ICACT), ICCCT 2012 |
---|---|

City | Seoul |

Period | 3/12/12 → 5/12/12 |

### Fingerprint

### Keywords

- Arabic text categorization
- automatic text categorization
- Bayesian learning
- Feature Selection (FS)

### ASJC Scopus subject areas

- Computer Science Applications
- Software

### Cite this

*Proceedings - 2012 7th International Conference on Computing and Convergence Technology (ICCIT, ICEI and ICACT), ICCCT 2012*(pp. 415-419). [6530369]

**Automatic arabic text categorization using Bayesian learning.** / Kadhim, Mahmood H.; Omar, Nazlia.

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

*Proceedings - 2012 7th International Conference on Computing and Convergence Technology (ICCIT, ICEI and ICACT), ICCCT 2012.*, 6530369, pp. 415-419, 2012 7th International Conference on Computing and Convergence Technology (ICCIT, ICEI and ICACT), ICCCT 2012, Seoul, 3/12/12.

}

TY - GEN

T1 - Automatic arabic text categorization using Bayesian learning

AU - Kadhim, Mahmood H.

AU - Omar, Nazlia

PY - 2012

Y1 - 2012

N2 - Automatic Text Categorization (ATC) is a task of categorizing an electronic document to a predefined category automatically based on its content. There are many supervised Machine Learning (ML) techniques that has been used to solve Text Categorization (TC) problem. The complex morphology of Arabic language and its large vocabulary size makes using these techniques difficult and costly in time and attempt. We have investigated Bayesian learning which is based on Bayesian theorem to deal with Arabic ATC problem. Bayesian learning classifiers that have been applied are Multivariate Guess Naïve Bayes (MGNB), Flexible Bayes (FB), Multivariate Bernoulli Naïve Bayes (MBNB), and Multinomial Naïve Bayes (MNB). For text representation in terms of word level N-Gram, 1-Gram, 2-Gram and 3-Gram have been used. For Arabic stemming, a simple stemmer called TREC-2002 Light Stemmer is used in the prototype. For feature selection we have used several feature selection techniques i.e. Chi-Square Statistic (CHI), Odd Ratio (OR), Mutual Information (MI), and GSS Coefficient (GSS). The results showed that FB outperforms MNB, MBNB, and MGNB. The experimental results of this work proved that using word level n-gram for ATC based on Bayesian learning leads to acceptable results.

AB - Automatic Text Categorization (ATC) is a task of categorizing an electronic document to a predefined category automatically based on its content. There are many supervised Machine Learning (ML) techniques that has been used to solve Text Categorization (TC) problem. The complex morphology of Arabic language and its large vocabulary size makes using these techniques difficult and costly in time and attempt. We have investigated Bayesian learning which is based on Bayesian theorem to deal with Arabic ATC problem. Bayesian learning classifiers that have been applied are Multivariate Guess Naïve Bayes (MGNB), Flexible Bayes (FB), Multivariate Bernoulli Naïve Bayes (MBNB), and Multinomial Naïve Bayes (MNB). For text representation in terms of word level N-Gram, 1-Gram, 2-Gram and 3-Gram have been used. For Arabic stemming, a simple stemmer called TREC-2002 Light Stemmer is used in the prototype. For feature selection we have used several feature selection techniques i.e. Chi-Square Statistic (CHI), Odd Ratio (OR), Mutual Information (MI), and GSS Coefficient (GSS). The results showed that FB outperforms MNB, MBNB, and MGNB. The experimental results of this work proved that using word level n-gram for ATC based on Bayesian learning leads to acceptable results.

KW - Arabic text categorization

KW - automatic text categorization

KW - Bayesian learning

KW - Feature Selection (FS)

UR - http://www.scopus.com/inward/record.url?scp=84881122174&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84881122174&partnerID=8YFLogxK

M3 - Conference contribution

SN - 9788994364216

SP - 415

EP - 419

BT - Proceedings - 2012 7th International Conference on Computing and Convergence Technology (ICCIT, ICEI and ICACT), ICCCT 2012

ER -