An automated Arabic Text Categorization based on the Frequency Ratio Accumulation

Baraa Sharef, Nazlia Omar, Zeyad Sharef

Research output: Contribution to journalArticle

7 Citations (Scopus)

Abstract

Compared to other languages, there is still a limited body of research which has been conducted for the automated Arabic Text Categorization (TC) due to the complex and rich nature of the Arabic language. Most of such research includes supervised Machine Learning (ML) approaches such as Naïve Bayes (NB), K-Nearest Neighbour (KNN), Support Vector Machine and Decision Tree. Most of these techniques have complex mathematical models and do not usually lead to accurate results for Arabic TC. Moreover, all the previous research tended to deal with the Feature Selection (FS) and the classification respectively as independent problems in automatic TC, which led to the cost and complex computational issues. Based on this, the need to apply new techniques suitable for Arabic language and its complex morphology arises. A new approach in the Arabic TC term called the Frequency Ratio Accumulation Method (FRAM), which has a simple mathematical model is applied in this study. The categorization task is combined with a feature processing task. The current research mainly aims at solving the problem of automatic Arabic TC by investigating the FRAM in order to enhance the performance of Arabic TC model. The performance of FRAM classifier is compared with three classifiers based on Bayesian theorem which are called Simple NB, Multi-variant Bernoulli Naïve Bayes (MNB) and Multinomial Naïve Bayes models (MBNB). Based on the findings of the study, the FRAM has outperformed the state of the arts. It's achieved 95.1% macro-F1 value by using unigram word-level representation method.

Original languageEnglish
Pages (from-to)213-221
Number of pages9
JournalInternational Arab Journal of Information Technology
Volume11
Issue number2
Publication statusPublished - 2014

Fingerprint

Classifiers
Mathematical models
Decision trees
Support vector machines
Macros
Learning systems
Feature extraction
Processing
Costs

Keywords

  • Arabic TC
  • Automatic TC
  • FRAM
  • Text classification

ASJC Scopus subject areas

  • Computer Science(all)

Cite this

An automated Arabic Text Categorization based on the Frequency Ratio Accumulation. / Sharef, Baraa; Omar, Nazlia; Sharef, Zeyad.

In: International Arab Journal of Information Technology, Vol. 11, No. 2, 2014, p. 213-221.

Research output: Contribution to journalArticle

@article{813919495197406bad68cc885ddc1a4d,
title = "An automated Arabic Text Categorization based on the Frequency Ratio Accumulation",
abstract = "Compared to other languages, there is still a limited body of research which has been conducted for the automated Arabic Text Categorization (TC) due to the complex and rich nature of the Arabic language. Most of such research includes supervised Machine Learning (ML) approaches such as Na{\"i}ve Bayes (NB), K-Nearest Neighbour (KNN), Support Vector Machine and Decision Tree. Most of these techniques have complex mathematical models and do not usually lead to accurate results for Arabic TC. Moreover, all the previous research tended to deal with the Feature Selection (FS) and the classification respectively as independent problems in automatic TC, which led to the cost and complex computational issues. Based on this, the need to apply new techniques suitable for Arabic language and its complex morphology arises. A new approach in the Arabic TC term called the Frequency Ratio Accumulation Method (FRAM), which has a simple mathematical model is applied in this study. The categorization task is combined with a feature processing task. The current research mainly aims at solving the problem of automatic Arabic TC by investigating the FRAM in order to enhance the performance of Arabic TC model. The performance of FRAM classifier is compared with three classifiers based on Bayesian theorem which are called Simple NB, Multi-variant Bernoulli Na{\"i}ve Bayes (MNB) and Multinomial Na{\"i}ve Bayes models (MBNB). Based on the findings of the study, the FRAM has outperformed the state of the arts. It's achieved 95.1{\%} macro-F1 value by using unigram word-level representation method.",
keywords = "Arabic TC, Automatic TC, FRAM, Text classification",
author = "Baraa Sharef and Nazlia Omar and Zeyad Sharef",
year = "2014",
language = "English",
volume = "11",
pages = "213--221",
journal = "International Arab Journal of Information Technology",
issn = "1683-3198",
publisher = "Zarqa University",
number = "2",

}

TY - JOUR

T1 - An automated Arabic Text Categorization based on the Frequency Ratio Accumulation

AU - Sharef, Baraa

AU - Omar, Nazlia

AU - Sharef, Zeyad

PY - 2014

Y1 - 2014

N2 - Compared to other languages, there is still a limited body of research which has been conducted for the automated Arabic Text Categorization (TC) due to the complex and rich nature of the Arabic language. Most of such research includes supervised Machine Learning (ML) approaches such as Naïve Bayes (NB), K-Nearest Neighbour (KNN), Support Vector Machine and Decision Tree. Most of these techniques have complex mathematical models and do not usually lead to accurate results for Arabic TC. Moreover, all the previous research tended to deal with the Feature Selection (FS) and the classification respectively as independent problems in automatic TC, which led to the cost and complex computational issues. Based on this, the need to apply new techniques suitable for Arabic language and its complex morphology arises. A new approach in the Arabic TC term called the Frequency Ratio Accumulation Method (FRAM), which has a simple mathematical model is applied in this study. The categorization task is combined with a feature processing task. The current research mainly aims at solving the problem of automatic Arabic TC by investigating the FRAM in order to enhance the performance of Arabic TC model. The performance of FRAM classifier is compared with three classifiers based on Bayesian theorem which are called Simple NB, Multi-variant Bernoulli Naïve Bayes (MNB) and Multinomial Naïve Bayes models (MBNB). Based on the findings of the study, the FRAM has outperformed the state of the arts. It's achieved 95.1% macro-F1 value by using unigram word-level representation method.

AB - Compared to other languages, there is still a limited body of research which has been conducted for the automated Arabic Text Categorization (TC) due to the complex and rich nature of the Arabic language. Most of such research includes supervised Machine Learning (ML) approaches such as Naïve Bayes (NB), K-Nearest Neighbour (KNN), Support Vector Machine and Decision Tree. Most of these techniques have complex mathematical models and do not usually lead to accurate results for Arabic TC. Moreover, all the previous research tended to deal with the Feature Selection (FS) and the classification respectively as independent problems in automatic TC, which led to the cost and complex computational issues. Based on this, the need to apply new techniques suitable for Arabic language and its complex morphology arises. A new approach in the Arabic TC term called the Frequency Ratio Accumulation Method (FRAM), which has a simple mathematical model is applied in this study. The categorization task is combined with a feature processing task. The current research mainly aims at solving the problem of automatic Arabic TC by investigating the FRAM in order to enhance the performance of Arabic TC model. The performance of FRAM classifier is compared with three classifiers based on Bayesian theorem which are called Simple NB, Multi-variant Bernoulli Naïve Bayes (MNB) and Multinomial Naïve Bayes models (MBNB). Based on the findings of the study, the FRAM has outperformed the state of the arts. It's achieved 95.1% macro-F1 value by using unigram word-level representation method.

KW - Arabic TC

KW - Automatic TC

KW - FRAM

KW - Text classification

UR - http://www.scopus.com/inward/record.url?scp=84899850058&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84899850058&partnerID=8YFLogxK

M3 - Article

VL - 11

SP - 213

EP - 221

JO - International Arab Journal of Information Technology

JF - International Arab Journal of Information Technology

SN - 1683-3198

IS - 2

ER -