Arabic text classification using K-nearest neighbour algorithm

Roiss Alhutaish, Nazlia Omar

Research output: Contribution to journalArticle

12 Citations (Scopus)

Abstract

Many algorithms have been implemented to the problem of Automatic Text Categorization (ATC). Most of the work in this area has been carried out on English texts, with only a few researchers addressing Arabic texts. We have investigated the use of the K-Nearest Neighbour (K-NN) classifier, with an Inew, cosine, jaccard and dice similarities, in order to enhance Arabic ATC. We represent the dataset as un-stemmed and stemmed data; with the use of TREC-2002, in order to remove prefixes and suffixes. However, for statistical text representation, Bag-Of-Words (BOW) and character-level 3 (3-Gram) were used. In order to, reduce the dimensionality of feature space; we used several feature selection methods. Experiments conducted with Arabic text showed that the K-NN classifier, with the new method similarity Inew 92.6% Macro-F1, had better performance than the K-NN classifier with cosine, jaccard and dice similarities. Chi-square feature selection, with representation by BOW, led to the best performance over other feature selection methods using BOW and 3-Gram.

Original languageEnglish
Pages (from-to)190-195
Number of pages6
JournalInternational Arab Journal of Information Technology
Volume12
Issue number2
Publication statusPublished - 2015

Fingerprint

Feature extraction
Classifiers
Macros
Experiments

Keywords

  • ATC
  • Feature selection methods
  • K-NN
  • Similarity measures

ASJC Scopus subject areas

  • Computer Science(all)

Cite this

Arabic text classification using K-nearest neighbour algorithm. / Alhutaish, Roiss; Omar, Nazlia.

In: International Arab Journal of Information Technology, Vol. 12, No. 2, 2015, p. 190-195.

Research output: Contribution to journalArticle

@article{d86d4344a9fb4f338fbe0d4b7388f3cd,
title = "Arabic text classification using K-nearest neighbour algorithm",
abstract = "Many algorithms have been implemented to the problem of Automatic Text Categorization (ATC). Most of the work in this area has been carried out on English texts, with only a few researchers addressing Arabic texts. We have investigated the use of the K-Nearest Neighbour (K-NN) classifier, with an Inew, cosine, jaccard and dice similarities, in order to enhance Arabic ATC. We represent the dataset as un-stemmed and stemmed data; with the use of TREC-2002, in order to remove prefixes and suffixes. However, for statistical text representation, Bag-Of-Words (BOW) and character-level 3 (3-Gram) were used. In order to, reduce the dimensionality of feature space; we used several feature selection methods. Experiments conducted with Arabic text showed that the K-NN classifier, with the new method similarity Inew 92.6{\%} Macro-F1, had better performance than the K-NN classifier with cosine, jaccard and dice similarities. Chi-square feature selection, with representation by BOW, led to the best performance over other feature selection methods using BOW and 3-Gram.",
keywords = "ATC, Feature selection methods, K-NN, Similarity measures",
author = "Roiss Alhutaish and Nazlia Omar",
year = "2015",
language = "English",
volume = "12",
pages = "190--195",
journal = "International Arab Journal of Information Technology",
issn = "1683-3198",
publisher = "Zarqa University",
number = "2",

}

TY - JOUR

T1 - Arabic text classification using K-nearest neighbour algorithm

AU - Alhutaish, Roiss

AU - Omar, Nazlia

PY - 2015

Y1 - 2015

N2 - Many algorithms have been implemented to the problem of Automatic Text Categorization (ATC). Most of the work in this area has been carried out on English texts, with only a few researchers addressing Arabic texts. We have investigated the use of the K-Nearest Neighbour (K-NN) classifier, with an Inew, cosine, jaccard and dice similarities, in order to enhance Arabic ATC. We represent the dataset as un-stemmed and stemmed data; with the use of TREC-2002, in order to remove prefixes and suffixes. However, for statistical text representation, Bag-Of-Words (BOW) and character-level 3 (3-Gram) were used. In order to, reduce the dimensionality of feature space; we used several feature selection methods. Experiments conducted with Arabic text showed that the K-NN classifier, with the new method similarity Inew 92.6% Macro-F1, had better performance than the K-NN classifier with cosine, jaccard and dice similarities. Chi-square feature selection, with representation by BOW, led to the best performance over other feature selection methods using BOW and 3-Gram.

AB - Many algorithms have been implemented to the problem of Automatic Text Categorization (ATC). Most of the work in this area has been carried out on English texts, with only a few researchers addressing Arabic texts. We have investigated the use of the K-Nearest Neighbour (K-NN) classifier, with an Inew, cosine, jaccard and dice similarities, in order to enhance Arabic ATC. We represent the dataset as un-stemmed and stemmed data; with the use of TREC-2002, in order to remove prefixes and suffixes. However, for statistical text representation, Bag-Of-Words (BOW) and character-level 3 (3-Gram) were used. In order to, reduce the dimensionality of feature space; we used several feature selection methods. Experiments conducted with Arabic text showed that the K-NN classifier, with the new method similarity Inew 92.6% Macro-F1, had better performance than the K-NN classifier with cosine, jaccard and dice similarities. Chi-square feature selection, with representation by BOW, led to the best performance over other feature selection methods using BOW and 3-Gram.

KW - ATC

KW - Feature selection methods

KW - K-NN

KW - Similarity measures

UR - http://www.scopus.com/inward/record.url?scp=84923832095&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84923832095&partnerID=8YFLogxK

M3 - Article

VL - 12

SP - 190

EP - 195

JO - International Arab Journal of Information Technology

JF - International Arab Journal of Information Technology

SN - 1683-3198

IS - 2

ER -