A comparative study of combined feature selection methods for Arabic text classification

Aisha Adel, Nazlia Omar, Adel Al-Shabi

Research output: Contribution to journalArticle

8 Citations (Scopus)

Abstract

Text classification is a very important task due to the huge amount of electronic documents. One of the problems of text classification is the high dimensionality of feature space. Researchers proposed many algorithms to select related features from text. These algorithms have been studied extensively for English text, while studies for Arabic are still limited. This study introduces an investigation on the performance of five widely used feature selection methods namely Chi-square, Correlation, GSS Coefficient, Information Gain and Relief F. In addition, this study also introduces an approach of combination of feature selection methods based on the average weight of the features. The experiments are conducted using Naïve Bayes and Support Vector Machine classifiers to classify a published Arabic corpus. The results show that the best results were obtained when using Information Gain method. The results also show that the combination of multiple feature selection methods outperforms the best results obtain by the individual methods.

Original languageEnglish
Pages (from-to)2232-2239
Number of pages8
JournalJournal of Computer Science
Volume10
Issue number11
DOIs
Publication statusPublished - 2014

Fingerprint

Feature extraction
Support vector machines
Classifiers
Experiments

Keywords

  • Arabic text classification
  • Combination method
  • Feature selection

ASJC Scopus subject areas

  • Software
  • Computer Networks and Communications
  • Artificial Intelligence

Cite this

A comparative study of combined feature selection methods for Arabic text classification. / Adel, Aisha; Omar, Nazlia; Al-Shabi, Adel.

In: Journal of Computer Science, Vol. 10, No. 11, 2014, p. 2232-2239.

Research output: Contribution to journalArticle

@article{726975af3a2b4b2c913b4f7b6406c9a8,
title = "A comparative study of combined feature selection methods for Arabic text classification",
abstract = "Text classification is a very important task due to the huge amount of electronic documents. One of the problems of text classification is the high dimensionality of feature space. Researchers proposed many algorithms to select related features from text. These algorithms have been studied extensively for English text, while studies for Arabic are still limited. This study introduces an investigation on the performance of five widely used feature selection methods namely Chi-square, Correlation, GSS Coefficient, Information Gain and Relief F. In addition, this study also introduces an approach of combination of feature selection methods based on the average weight of the features. The experiments are conducted using Na{\"i}ve Bayes and Support Vector Machine classifiers to classify a published Arabic corpus. The results show that the best results were obtained when using Information Gain method. The results also show that the combination of multiple feature selection methods outperforms the best results obtain by the individual methods.",
keywords = "Arabic text classification, Combination method, Feature selection",
author = "Aisha Adel and Nazlia Omar and Adel Al-Shabi",
year = "2014",
doi = "10.3844/jcssp.2014.2232.2239",
language = "English",
volume = "10",
pages = "2232--2239",
journal = "Journal of Computer Science",
issn = "1549-3636",
publisher = "Science Publications",
number = "11",

}

TY - JOUR

T1 - A comparative study of combined feature selection methods for Arabic text classification

AU - Adel, Aisha

AU - Omar, Nazlia

AU - Al-Shabi, Adel

PY - 2014

Y1 - 2014

N2 - Text classification is a very important task due to the huge amount of electronic documents. One of the problems of text classification is the high dimensionality of feature space. Researchers proposed many algorithms to select related features from text. These algorithms have been studied extensively for English text, while studies for Arabic are still limited. This study introduces an investigation on the performance of five widely used feature selection methods namely Chi-square, Correlation, GSS Coefficient, Information Gain and Relief F. In addition, this study also introduces an approach of combination of feature selection methods based on the average weight of the features. The experiments are conducted using Naïve Bayes and Support Vector Machine classifiers to classify a published Arabic corpus. The results show that the best results were obtained when using Information Gain method. The results also show that the combination of multiple feature selection methods outperforms the best results obtain by the individual methods.

AB - Text classification is a very important task due to the huge amount of electronic documents. One of the problems of text classification is the high dimensionality of feature space. Researchers proposed many algorithms to select related features from text. These algorithms have been studied extensively for English text, while studies for Arabic are still limited. This study introduces an investigation on the performance of five widely used feature selection methods namely Chi-square, Correlation, GSS Coefficient, Information Gain and Relief F. In addition, this study also introduces an approach of combination of feature selection methods based on the average weight of the features. The experiments are conducted using Naïve Bayes and Support Vector Machine classifiers to classify a published Arabic corpus. The results show that the best results were obtained when using Information Gain method. The results also show that the combination of multiple feature selection methods outperforms the best results obtain by the individual methods.

KW - Arabic text classification

KW - Combination method

KW - Feature selection

UR - http://www.scopus.com/inward/record.url?scp=84919781345&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84919781345&partnerID=8YFLogxK

U2 - 10.3844/jcssp.2014.2232.2239

DO - 10.3844/jcssp.2014.2232.2239

M3 - Article

AN - SCOPUS:84919781345

VL - 10

SP - 2232

EP - 2239

JO - Journal of Computer Science

JF - Journal of Computer Science

SN - 1549-3636

IS - 11

ER -