Effect of ISRI stemming on similarity measure for Arabic document clustering

Qusay Walid Bsoul, Masnizah Mohd

Research output: Chapter in Book/Report/Conference proceedingConference contribution

8 Citations (Scopus)

Abstract

Arabic Document Clustering has increasingly become an important task for obtaining good results with the unsupervised learning task. This paper aims to evaluate the impact of the five measures (Cosine similarity, Jaccard coefficient, Pearson correlation, Euclidean distance and Averaged Kullback- Leibler Divergence) for Document Clustering with two types of pre-processing morphology-based The Information Science Research Institute (ISRI) is equivalent to the root-based stemmer and light stemmer; and without stemming without morphology) for an Arabic dataset. Stemming is known as a computational process used to reduce words to their stems. For classification, it is categorised as a recall-enhancing or precision-enhancing component. It is concluded that the method of ISRI for words is proved to be better than without stemming methods which use a five similarities/distance measures for Document Clustering.

Original languageEnglish
Title of host publicationLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Pages584-593
Number of pages10
Volume7097 LNCS
DOIs
Publication statusPublished - 2011
Event7th Asia Information Retrieval Societies Conference, AIRS 2011 - Dubai
Duration: 18 Dec 201120 Dec 2011

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume7097 LNCS
ISSN (Print)03029743
ISSN (Electronic)16113349

Other

Other7th Asia Information Retrieval Societies Conference, AIRS 2011
CityDubai
Period18/12/1120/12/11

Fingerprint

Document Clustering
Information science
Similarity Measure
Unsupervised learning
Similarity Coefficient
Pearson Correlation
Kullback-Leibler Divergence
Unsupervised Learning
Distance Measure
Euclidean Distance
Preprocessing
Processing
Roots
Evaluate

Keywords

  • document clustering
  • partitional clustering
  • Similarity measures
  • Stemming

ASJC Scopus subject areas

  • Computer Science(all)
  • Theoretical Computer Science

Cite this

Bsoul, Q. W., & Mohd, M. (2011). Effect of ISRI stemming on similarity measure for Arabic document clustering. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 7097 LNCS, pp. 584-593). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 7097 LNCS). https://doi.org/10.1007/978-3-642-25631-8_53

Effect of ISRI stemming on similarity measure for Arabic document clustering. / Bsoul, Qusay Walid; Mohd, Masnizah.

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 7097 LNCS 2011. p. 584-593 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 7097 LNCS).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Bsoul, QW & Mohd, M 2011, Effect of ISRI stemming on similarity measure for Arabic document clustering. in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). vol. 7097 LNCS, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 7097 LNCS, pp. 584-593, 7th Asia Information Retrieval Societies Conference, AIRS 2011, Dubai, 18/12/11. https://doi.org/10.1007/978-3-642-25631-8_53
Bsoul QW, Mohd M. Effect of ISRI stemming on similarity measure for Arabic document clustering. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 7097 LNCS. 2011. p. 584-593. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). https://doi.org/10.1007/978-3-642-25631-8_53
Bsoul, Qusay Walid ; Mohd, Masnizah. / Effect of ISRI stemming on similarity measure for Arabic document clustering. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 7097 LNCS 2011. pp. 584-593 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
@inproceedings{ad74a97d5cfc484eaa23c3ec006c24ee,
title = "Effect of ISRI stemming on similarity measure for Arabic document clustering",
abstract = "Arabic Document Clustering has increasingly become an important task for obtaining good results with the unsupervised learning task. This paper aims to evaluate the impact of the five measures (Cosine similarity, Jaccard coefficient, Pearson correlation, Euclidean distance and Averaged Kullback- Leibler Divergence) for Document Clustering with two types of pre-processing morphology-based The Information Science Research Institute (ISRI) is equivalent to the root-based stemmer and light stemmer; and without stemming without morphology) for an Arabic dataset. Stemming is known as a computational process used to reduce words to their stems. For classification, it is categorised as a recall-enhancing or precision-enhancing component. It is concluded that the method of ISRI for words is proved to be better than without stemming methods which use a five similarities/distance measures for Document Clustering.",
keywords = "document clustering, partitional clustering, Similarity measures, Stemming",
author = "Bsoul, {Qusay Walid} and Masnizah Mohd",
year = "2011",
doi = "10.1007/978-3-642-25631-8_53",
language = "English",
isbn = "9783642256301",
volume = "7097 LNCS",
series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
pages = "584--593",
booktitle = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

}

TY - GEN

T1 - Effect of ISRI stemming on similarity measure for Arabic document clustering

AU - Bsoul, Qusay Walid

AU - Mohd, Masnizah

PY - 2011

Y1 - 2011

N2 - Arabic Document Clustering has increasingly become an important task for obtaining good results with the unsupervised learning task. This paper aims to evaluate the impact of the five measures (Cosine similarity, Jaccard coefficient, Pearson correlation, Euclidean distance and Averaged Kullback- Leibler Divergence) for Document Clustering with two types of pre-processing morphology-based The Information Science Research Institute (ISRI) is equivalent to the root-based stemmer and light stemmer; and without stemming without morphology) for an Arabic dataset. Stemming is known as a computational process used to reduce words to their stems. For classification, it is categorised as a recall-enhancing or precision-enhancing component. It is concluded that the method of ISRI for words is proved to be better than without stemming methods which use a five similarities/distance measures for Document Clustering.

AB - Arabic Document Clustering has increasingly become an important task for obtaining good results with the unsupervised learning task. This paper aims to evaluate the impact of the five measures (Cosine similarity, Jaccard coefficient, Pearson correlation, Euclidean distance and Averaged Kullback- Leibler Divergence) for Document Clustering with two types of pre-processing morphology-based The Information Science Research Institute (ISRI) is equivalent to the root-based stemmer and light stemmer; and without stemming without morphology) for an Arabic dataset. Stemming is known as a computational process used to reduce words to their stems. For classification, it is categorised as a recall-enhancing or precision-enhancing component. It is concluded that the method of ISRI for words is proved to be better than without stemming methods which use a five similarities/distance measures for Document Clustering.

KW - document clustering

KW - partitional clustering

KW - Similarity measures

KW - Stemming

UR - http://www.scopus.com/inward/record.url?scp=84255178429&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84255178429&partnerID=8YFLogxK

U2 - 10.1007/978-3-642-25631-8_53

DO - 10.1007/978-3-642-25631-8_53

M3 - Conference contribution

SN - 9783642256301

VL - 7097 LNCS

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 584

EP - 593

BT - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

ER -