Distance measures and stemming impact on arabic document clustering

Qusay Bsoul, Eiman Al-Shamari, Masnizah Mohd, Jaffar Atwan

Research output: Contribution to journalArticle

2 Citations (Scopus)

Abstract

Clustering of Arabic documents is considered as a vital aspect of obtaining optimal results from unsupervised learning. Its aim is to automatically group similar documents into a single cluster using different similarities or distance measures. However, diverse similarities and distance measures are available and their effectiveness in document clustering with a syntactic structure of the stemming is still not obvious. Therefore, this study aims to evaluate the impact of five similarity/distance measures (i.e., cosine similarity, the Jaccard coefficient, Pearson‘s correlation coefficient, Euclidean distance, and averaged Kullback-Leibler divergence) with two stemming algorithms (i.e., morphology-and syntax-based lemmatization; and morphology-based Information Science Research Institute (ISRI) stemming on clustering Arabic text dataset. We aim to identify the best performing similarity and distance measures and determine which measure is most suitable for Arabic document clustering. Our experimental method, which is based on syntactic structure and morphology, outperformed other stemming methods that use any of the five similarity/distance measures for Arabic document clustering. The best performing similarity/distance measures are cosine similarity and Euclidean distance , respectively.

Original languageEnglish
Pages (from-to)327-329
Number of pages3
JournalLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume8870
Publication statusPublished - 2014
Externally publishedYes

Fingerprint

Document Clustering
Distance Measure
Similarity Measure
Syntactics
Unsupervised learning
Information science
Euclidean Distance
Clustering
Pearson Correlation
Kullback-Leibler Divergence
Unsupervised Learning
Correlation coefficient
Evaluate
Coefficient
Syntax

Keywords

  • Arabic document clustering
  • Lemmatization stemming
  • Partitional clustering
  • Similarity/distance measures

ASJC Scopus subject areas

  • Computer Science(all)
  • Theoretical Computer Science

Cite this

@article{9303b1d08008472b8eb2f93b01280bcc,
title = "Distance measures and stemming impact on arabic document clustering",
abstract = "Clustering of Arabic documents is considered as a vital aspect of obtaining optimal results from unsupervised learning. Its aim is to automatically group similar documents into a single cluster using different similarities or distance measures. However, diverse similarities and distance measures are available and their effectiveness in document clustering with a syntactic structure of the stemming is still not obvious. Therefore, this study aims to evaluate the impact of five similarity/distance measures (i.e., cosine similarity, the Jaccard coefficient, Pearson‘s correlation coefficient, Euclidean distance, and averaged Kullback-Leibler divergence) with two stemming algorithms (i.e., morphology-and syntax-based lemmatization; and morphology-based Information Science Research Institute (ISRI) stemming on clustering Arabic text dataset. We aim to identify the best performing similarity and distance measures and determine which measure is most suitable for Arabic document clustering. Our experimental method, which is based on syntactic structure and morphology, outperformed other stemming methods that use any of the five similarity/distance measures for Arabic document clustering. The best performing similarity/distance measures are cosine similarity and Euclidean distance , respectively.",
keywords = "Arabic document clustering, Lemmatization stemming, Partitional clustering, Similarity/distance measures",
author = "Qusay Bsoul and Eiman Al-Shamari and Masnizah Mohd and Jaffar Atwan",
year = "2014",
language = "English",
volume = "8870",
pages = "327--329",
journal = "Lecture Notes in Computer Science",
issn = "0302-9743",
publisher = "Springer Verlag",

}

TY - JOUR

T1 - Distance measures and stemming impact on arabic document clustering

AU - Bsoul, Qusay

AU - Al-Shamari, Eiman

AU - Mohd, Masnizah

AU - Atwan, Jaffar

PY - 2014

Y1 - 2014

N2 - Clustering of Arabic documents is considered as a vital aspect of obtaining optimal results from unsupervised learning. Its aim is to automatically group similar documents into a single cluster using different similarities or distance measures. However, diverse similarities and distance measures are available and their effectiveness in document clustering with a syntactic structure of the stemming is still not obvious. Therefore, this study aims to evaluate the impact of five similarity/distance measures (i.e., cosine similarity, the Jaccard coefficient, Pearson‘s correlation coefficient, Euclidean distance, and averaged Kullback-Leibler divergence) with two stemming algorithms (i.e., morphology-and syntax-based lemmatization; and morphology-based Information Science Research Institute (ISRI) stemming on clustering Arabic text dataset. We aim to identify the best performing similarity and distance measures and determine which measure is most suitable for Arabic document clustering. Our experimental method, which is based on syntactic structure and morphology, outperformed other stemming methods that use any of the five similarity/distance measures for Arabic document clustering. The best performing similarity/distance measures are cosine similarity and Euclidean distance , respectively.

AB - Clustering of Arabic documents is considered as a vital aspect of obtaining optimal results from unsupervised learning. Its aim is to automatically group similar documents into a single cluster using different similarities or distance measures. However, diverse similarities and distance measures are available and their effectiveness in document clustering with a syntactic structure of the stemming is still not obvious. Therefore, this study aims to evaluate the impact of five similarity/distance measures (i.e., cosine similarity, the Jaccard coefficient, Pearson‘s correlation coefficient, Euclidean distance, and averaged Kullback-Leibler divergence) with two stemming algorithms (i.e., morphology-and syntax-based lemmatization; and morphology-based Information Science Research Institute (ISRI) stemming on clustering Arabic text dataset. We aim to identify the best performing similarity and distance measures and determine which measure is most suitable for Arabic document clustering. Our experimental method, which is based on syntactic structure and morphology, outperformed other stemming methods that use any of the five similarity/distance measures for Arabic document clustering. The best performing similarity/distance measures are cosine similarity and Euclidean distance , respectively.

KW - Arabic document clustering

KW - Lemmatization stemming

KW - Partitional clustering

KW - Similarity/distance measures

UR - http://www.scopus.com/inward/record.url?scp=84921415594&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84921415594&partnerID=8YFLogxK

M3 - Article

VL - 8870

SP - 327

EP - 329

JO - Lecture Notes in Computer Science

JF - Lecture Notes in Computer Science

SN - 0302-9743

ER -