Feature selection method based on statistics of compound words for Arabic text classification

Aisha Adel, Nazlia Omar, Mohammed Albared, Adel Al-Shabi

Research output: Contribution to journalArticle

Abstract

One of the main problems of text classification is the high dimensionality of the feature space. Feature selection methods are normally used to reduce the dimensionality of datasets to improve the performance of the classification, or to reduce the processing time, or both. To improve the performance of text classification, a feature selection algorithm is presented, based on terminology extracted from the statistics of compound words, to reduce the high dimensionality of the feature space. The proposed method is evaluated as a standalone method and in combination with other feature selection methods (two-stage method). The performance of the proposed algorithm is compared to the performance of six well-known feature selection methods including Information Gain, Chi-Square, Gini Index, Support Vector Machine-Based, Principal Components Analysis and Symmetric Uncertainty. A wide range of comparative experiments were conducted on three Arabic standard datasets and with three classification algorithms. The experimental results clearly show the superiority of the proposed method in both cases as a standalone or in a two-stage scenario. The results show that the proposed method behaves better than traditional approaches in terms of classification accuracy with a 6-10% gain in the macro-average, F1.

Original languageEnglish
Pages (from-to)178-185
Number of pages8
JournalInternational Arab Journal of Information Technology
Volume16
Issue number2
Publication statusPublished - 1 Mar 2019

Fingerprint

Feature extraction
Statistics
Terminology
Principal component analysis
Support vector machines
Macros
Processing
Experiments

Keywords

  • Arabic text classification
  • Compound words
  • Feature selection method

ASJC Scopus subject areas

  • Computer Science(all)

Cite this

Feature selection method based on statistics of compound words for Arabic text classification. / Adel, Aisha; Omar, Nazlia; Albared, Mohammed; Al-Shabi, Adel.

In: International Arab Journal of Information Technology, Vol. 16, No. 2, 01.03.2019, p. 178-185.

Research output: Contribution to journalArticle

@article{a5f268621f8e42e0b0c18061c452df6d,
title = "Feature selection method based on statistics of compound words for Arabic text classification",
abstract = "One of the main problems of text classification is the high dimensionality of the feature space. Feature selection methods are normally used to reduce the dimensionality of datasets to improve the performance of the classification, or to reduce the processing time, or both. To improve the performance of text classification, a feature selection algorithm is presented, based on terminology extracted from the statistics of compound words, to reduce the high dimensionality of the feature space. The proposed method is evaluated as a standalone method and in combination with other feature selection methods (two-stage method). The performance of the proposed algorithm is compared to the performance of six well-known feature selection methods including Information Gain, Chi-Square, Gini Index, Support Vector Machine-Based, Principal Components Analysis and Symmetric Uncertainty. A wide range of comparative experiments were conducted on three Arabic standard datasets and with three classification algorithms. The experimental results clearly show the superiority of the proposed method in both cases as a standalone or in a two-stage scenario. The results show that the proposed method behaves better than traditional approaches in terms of classification accuracy with a 6-10{\%} gain in the macro-average, F1.",
keywords = "Arabic text classification, Compound words, Feature selection method",
author = "Aisha Adel and Nazlia Omar and Mohammed Albared and Adel Al-Shabi",
year = "2019",
month = "3",
day = "1",
language = "English",
volume = "16",
pages = "178--185",
journal = "International Arab Journal of Information Technology",
issn = "1683-3198",
publisher = "Zarqa University",
number = "2",

}

TY - JOUR

T1 - Feature selection method based on statistics of compound words for Arabic text classification

AU - Adel, Aisha

AU - Omar, Nazlia

AU - Albared, Mohammed

AU - Al-Shabi, Adel

PY - 2019/3/1

Y1 - 2019/3/1

N2 - One of the main problems of text classification is the high dimensionality of the feature space. Feature selection methods are normally used to reduce the dimensionality of datasets to improve the performance of the classification, or to reduce the processing time, or both. To improve the performance of text classification, a feature selection algorithm is presented, based on terminology extracted from the statistics of compound words, to reduce the high dimensionality of the feature space. The proposed method is evaluated as a standalone method and in combination with other feature selection methods (two-stage method). The performance of the proposed algorithm is compared to the performance of six well-known feature selection methods including Information Gain, Chi-Square, Gini Index, Support Vector Machine-Based, Principal Components Analysis and Symmetric Uncertainty. A wide range of comparative experiments were conducted on three Arabic standard datasets and with three classification algorithms. The experimental results clearly show the superiority of the proposed method in both cases as a standalone or in a two-stage scenario. The results show that the proposed method behaves better than traditional approaches in terms of classification accuracy with a 6-10% gain in the macro-average, F1.

AB - One of the main problems of text classification is the high dimensionality of the feature space. Feature selection methods are normally used to reduce the dimensionality of datasets to improve the performance of the classification, or to reduce the processing time, or both. To improve the performance of text classification, a feature selection algorithm is presented, based on terminology extracted from the statistics of compound words, to reduce the high dimensionality of the feature space. The proposed method is evaluated as a standalone method and in combination with other feature selection methods (two-stage method). The performance of the proposed algorithm is compared to the performance of six well-known feature selection methods including Information Gain, Chi-Square, Gini Index, Support Vector Machine-Based, Principal Components Analysis and Symmetric Uncertainty. A wide range of comparative experiments were conducted on three Arabic standard datasets and with three classification algorithms. The experimental results clearly show the superiority of the proposed method in both cases as a standalone or in a two-stage scenario. The results show that the proposed method behaves better than traditional approaches in terms of classification accuracy with a 6-10% gain in the macro-average, F1.

KW - Arabic text classification

KW - Compound words

KW - Feature selection method

UR - http://www.scopus.com/inward/record.url?scp=85069916987&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85069916987&partnerID=8YFLogxK

M3 - Article

AN - SCOPUS:85069916987

VL - 16

SP - 178

EP - 185

JO - International Arab Journal of Information Technology

JF - International Arab Journal of Information Technology

SN - 1683-3198

IS - 2

ER -