Hybrid statistical rule-based classifier for Arabic text mining

Abdullah S. Ghareb, Abdul Razak Hamdan, Azuraliza Abu Bakar, Mohd Ridzwan Yaakub

Research output: Contribution to journalArticle

3 Citations (Scopus)

Abstract

Text categorization is one of key technology for organizing digital dataset. The Naiv Bayes (NB) is popular categorization method due its efficiency and less time complexity, and the Associative Classification (AC) approach has the capability to produces classifier rival to those learned by traditional categorization techniques. However, the independence assumption for text features and the omission of feature frequencies in NB method violates its performance when the selected features are not highly correlated to text categories. Likewise, the lack of useful discovery and usage of categorization rules is the major problem of AC and its performance is declined with large set of rules. This paper proposed a hybrid categorization method for Arabic text mining that combines the merits of statistical classifier (NB) and rule based classifier (AC) in one framework and tried to overcome their limitations. In the first stage, the useful categorization rules are discovered using AC approach and ensure that associated features are highly correlated to their categories. In the second stage, the NB is utilized at the back end of discovery process and takes the discovered rules, concatenates the associated features for each category and classifies texts based on the statistical information of associated features. The proposed method was evaluated on three Arabic text datasets with multiple categories with and without feature selection methods. The experimental results showed that the hybrid method outperforms AC individually with/without feature selection methods and it is better than NB in few cases only with some feature selection methods when the selected feature subset was small.

Original languageEnglish
Pages (from-to)194-204
Number of pages11
JournalJournal of Theoretical and Applied Information Technology
Volume71
Issue number2
Publication statusPublished - 20 Jan 2015

Fingerprint

Text Mining
Categorization
Classifiers
Classifier
Bayes
Feature extraction
Feature Selection
Bayes Classifier
Bayes Method
Text Categorization
Violate
Hybrid Method
Large Set
Time Complexity
Classify
Subset
Text
Experimental Results

Keywords

  • Associative Classification
  • Feature Selection
  • Hybrid Categorization Method
  • Naive Bayes
  • Text Categorization

ASJC Scopus subject areas

  • Computer Science(all)
  • Theoretical Computer Science

Cite this

Hybrid statistical rule-based classifier for Arabic text mining. / Ghareb, Abdullah S.; Hamdan, Abdul Razak; Abu Bakar, Azuraliza; Yaakub, Mohd Ridzwan.

In: Journal of Theoretical and Applied Information Technology, Vol. 71, No. 2, 20.01.2015, p. 194-204.

Research output: Contribution to journalArticle

@article{1c28ae1614654b299c3e9b72579aeb4b,
title = "Hybrid statistical rule-based classifier for Arabic text mining",
abstract = "Text categorization is one of key technology for organizing digital dataset. The Naiv Bayes (NB) is popular categorization method due its efficiency and less time complexity, and the Associative Classification (AC) approach has the capability to produces classifier rival to those learned by traditional categorization techniques. However, the independence assumption for text features and the omission of feature frequencies in NB method violates its performance when the selected features are not highly correlated to text categories. Likewise, the lack of useful discovery and usage of categorization rules is the major problem of AC and its performance is declined with large set of rules. This paper proposed a hybrid categorization method for Arabic text mining that combines the merits of statistical classifier (NB) and rule based classifier (AC) in one framework and tried to overcome their limitations. In the first stage, the useful categorization rules are discovered using AC approach and ensure that associated features are highly correlated to their categories. In the second stage, the NB is utilized at the back end of discovery process and takes the discovered rules, concatenates the associated features for each category and classifies texts based on the statistical information of associated features. The proposed method was evaluated on three Arabic text datasets with multiple categories with and without feature selection methods. The experimental results showed that the hybrid method outperforms AC individually with/without feature selection methods and it is better than NB in few cases only with some feature selection methods when the selected feature subset was small.",
keywords = "Associative Classification, Feature Selection, Hybrid Categorization Method, Naive Bayes, Text Categorization",
author = "Ghareb, {Abdullah S.} and Hamdan, {Abdul Razak} and {Abu Bakar}, Azuraliza and Yaakub, {Mohd Ridzwan}",
year = "2015",
month = "1",
day = "20",
language = "English",
volume = "71",
pages = "194--204",
journal = "Journal of Theoretical and Applied Information Technology",
issn = "1992-8645",
publisher = "Asian Research Publishing Network (ARPN)",
number = "2",

}

TY - JOUR

T1 - Hybrid statistical rule-based classifier for Arabic text mining

AU - Ghareb, Abdullah S.

AU - Hamdan, Abdul Razak

AU - Abu Bakar, Azuraliza

AU - Yaakub, Mohd Ridzwan

PY - 2015/1/20

Y1 - 2015/1/20

N2 - Text categorization is one of key technology for organizing digital dataset. The Naiv Bayes (NB) is popular categorization method due its efficiency and less time complexity, and the Associative Classification (AC) approach has the capability to produces classifier rival to those learned by traditional categorization techniques. However, the independence assumption for text features and the omission of feature frequencies in NB method violates its performance when the selected features are not highly correlated to text categories. Likewise, the lack of useful discovery and usage of categorization rules is the major problem of AC and its performance is declined with large set of rules. This paper proposed a hybrid categorization method for Arabic text mining that combines the merits of statistical classifier (NB) and rule based classifier (AC) in one framework and tried to overcome their limitations. In the first stage, the useful categorization rules are discovered using AC approach and ensure that associated features are highly correlated to their categories. In the second stage, the NB is utilized at the back end of discovery process and takes the discovered rules, concatenates the associated features for each category and classifies texts based on the statistical information of associated features. The proposed method was evaluated on three Arabic text datasets with multiple categories with and without feature selection methods. The experimental results showed that the hybrid method outperforms AC individually with/without feature selection methods and it is better than NB in few cases only with some feature selection methods when the selected feature subset was small.

AB - Text categorization is one of key technology for organizing digital dataset. The Naiv Bayes (NB) is popular categorization method due its efficiency and less time complexity, and the Associative Classification (AC) approach has the capability to produces classifier rival to those learned by traditional categorization techniques. However, the independence assumption for text features and the omission of feature frequencies in NB method violates its performance when the selected features are not highly correlated to text categories. Likewise, the lack of useful discovery and usage of categorization rules is the major problem of AC and its performance is declined with large set of rules. This paper proposed a hybrid categorization method for Arabic text mining that combines the merits of statistical classifier (NB) and rule based classifier (AC) in one framework and tried to overcome their limitations. In the first stage, the useful categorization rules are discovered using AC approach and ensure that associated features are highly correlated to their categories. In the second stage, the NB is utilized at the back end of discovery process and takes the discovered rules, concatenates the associated features for each category and classifies texts based on the statistical information of associated features. The proposed method was evaluated on three Arabic text datasets with multiple categories with and without feature selection methods. The experimental results showed that the hybrid method outperforms AC individually with/without feature selection methods and it is better than NB in few cases only with some feature selection methods when the selected feature subset was small.

KW - Associative Classification

KW - Feature Selection

KW - Hybrid Categorization Method

KW - Naive Bayes

KW - Text Categorization

UR - http://www.scopus.com/inward/record.url?scp=84921265534&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84921265534&partnerID=8YFLogxK

M3 - Article

VL - 71

SP - 194

EP - 204

JO - Journal of Theoretical and Applied Information Technology

JF - Journal of Theoretical and Applied Information Technology

SN - 1992-8645

IS - 2

ER -