Comparative study of k-means and k-means++ clustering algorithms on crime domain

Bashar Aubaidan, Masnizah Mohd, Mohammed Albared, Fourth Author

Research output: Contribution to journalArticle

8 Citations (Scopus)

Abstract

This study presents the results of an experimental study of two document clustering techniques which are kmeans and k-means++. In particular, we compare the two main approaches in crime document clustering. The drawback of k-means is that the user needs to define the centroid point. This becomes more critical when dealing with document clustering because each center point represented by a word and the calculation of distance between words is not a trivial task. To overcome this problem, a k-means++ was introduced in order to find a good initial center point. Since k-means++ has not being applied before in crime document clustering, this study presented a comparative study between k-means and k-means++ to investigate whether the initialization process in k-means++ does help to get a better results than k-means. We proposes the k-means++ clustering algorithm, to identify best seed for initial cluster centers in clustering crime document. The aim of this study is to conduct a comparative study of two main clustering algorithms, namely k-means and k-means++. The method of this study includes a preprocessing phase, which in turn involves tokeniza-tion, stop-words removal and stemming. In addition, we evaluate the impact of the two similarity/distance measures (Cosine similarity and Jaccard coefficient) on the results of the two clustering algorithms. Exper-imental results on several settings of the crime data set showed that by identifying the best seed for initial cluster centers, k-mean++ can significantly (with the significance interval at 95%) work better than k-means. These results demonstrate the accuracy of k-mean++ clustering algorithm in clustering crime doc-uments.

Original languageEnglish
Pages (from-to)1197-1206
Number of pages10
JournalJournal of Computer Science
Volume10
Issue number7
DOIs
Publication statusPublished - 2014

Fingerprint

Crime
Clustering algorithms
Seed

Keywords

  • Crime document clustering
  • K-means algorithm
  • K-Means++
  • Similarity/distance measures

ASJC Scopus subject areas

  • Software
  • Computer Networks and Communications
  • Artificial Intelligence

Cite this

Comparative study of k-means and k-means++ clustering algorithms on crime domain. / Aubaidan, Bashar; Mohd, Masnizah; Albared, Mohammed; Author, Fourth.

In: Journal of Computer Science, Vol. 10, No. 7, 2014, p. 1197-1206.

Research output: Contribution to journalArticle

Aubaidan, Bashar ; Mohd, Masnizah ; Albared, Mohammed ; Author, Fourth. / Comparative study of k-means and k-means++ clustering algorithms on crime domain. In: Journal of Computer Science. 2014 ; Vol. 10, No. 7. pp. 1197-1206.
@article{7f6136cd4ad84223a269517fb80fa9cf,
title = "Comparative study of k-means and k-means++ clustering algorithms on crime domain",
abstract = "This study presents the results of an experimental study of two document clustering techniques which are kmeans and k-means++. In particular, we compare the two main approaches in crime document clustering. The drawback of k-means is that the user needs to define the centroid point. This becomes more critical when dealing with document clustering because each center point represented by a word and the calculation of distance between words is not a trivial task. To overcome this problem, a k-means++ was introduced in order to find a good initial center point. Since k-means++ has not being applied before in crime document clustering, this study presented a comparative study between k-means and k-means++ to investigate whether the initialization process in k-means++ does help to get a better results than k-means. We proposes the k-means++ clustering algorithm, to identify best seed for initial cluster centers in clustering crime document. The aim of this study is to conduct a comparative study of two main clustering algorithms, namely k-means and k-means++. The method of this study includes a preprocessing phase, which in turn involves tokeniza-tion, stop-words removal and stemming. In addition, we evaluate the impact of the two similarity/distance measures (Cosine similarity and Jaccard coefficient) on the results of the two clustering algorithms. Exper-imental results on several settings of the crime data set showed that by identifying the best seed for initial cluster centers, k-mean++ can significantly (with the significance interval at 95{\%}) work better than k-means. These results demonstrate the accuracy of k-mean++ clustering algorithm in clustering crime doc-uments.",
keywords = "Crime document clustering, K-means algorithm, K-Means++, Similarity/distance measures",
author = "Bashar Aubaidan and Masnizah Mohd and Mohammed Albared and Fourth Author",
year = "2014",
doi = "10.3844/jcssp.2014.1197.1206",
language = "English",
volume = "10",
pages = "1197--1206",
journal = "Journal of Computer Science",
issn = "1549-3636",
publisher = "Science Publications",
number = "7",

}

TY - JOUR

T1 - Comparative study of k-means and k-means++ clustering algorithms on crime domain

AU - Aubaidan, Bashar

AU - Mohd, Masnizah

AU - Albared, Mohammed

AU - Author, Fourth

PY - 2014

Y1 - 2014

N2 - This study presents the results of an experimental study of two document clustering techniques which are kmeans and k-means++. In particular, we compare the two main approaches in crime document clustering. The drawback of k-means is that the user needs to define the centroid point. This becomes more critical when dealing with document clustering because each center point represented by a word and the calculation of distance between words is not a trivial task. To overcome this problem, a k-means++ was introduced in order to find a good initial center point. Since k-means++ has not being applied before in crime document clustering, this study presented a comparative study between k-means and k-means++ to investigate whether the initialization process in k-means++ does help to get a better results than k-means. We proposes the k-means++ clustering algorithm, to identify best seed for initial cluster centers in clustering crime document. The aim of this study is to conduct a comparative study of two main clustering algorithms, namely k-means and k-means++. The method of this study includes a preprocessing phase, which in turn involves tokeniza-tion, stop-words removal and stemming. In addition, we evaluate the impact of the two similarity/distance measures (Cosine similarity and Jaccard coefficient) on the results of the two clustering algorithms. Exper-imental results on several settings of the crime data set showed that by identifying the best seed for initial cluster centers, k-mean++ can significantly (with the significance interval at 95%) work better than k-means. These results demonstrate the accuracy of k-mean++ clustering algorithm in clustering crime doc-uments.

AB - This study presents the results of an experimental study of two document clustering techniques which are kmeans and k-means++. In particular, we compare the two main approaches in crime document clustering. The drawback of k-means is that the user needs to define the centroid point. This becomes more critical when dealing with document clustering because each center point represented by a word and the calculation of distance between words is not a trivial task. To overcome this problem, a k-means++ was introduced in order to find a good initial center point. Since k-means++ has not being applied before in crime document clustering, this study presented a comparative study between k-means and k-means++ to investigate whether the initialization process in k-means++ does help to get a better results than k-means. We proposes the k-means++ clustering algorithm, to identify best seed for initial cluster centers in clustering crime document. The aim of this study is to conduct a comparative study of two main clustering algorithms, namely k-means and k-means++. The method of this study includes a preprocessing phase, which in turn involves tokeniza-tion, stop-words removal and stemming. In addition, we evaluate the impact of the two similarity/distance measures (Cosine similarity and Jaccard coefficient) on the results of the two clustering algorithms. Exper-imental results on several settings of the crime data set showed that by identifying the best seed for initial cluster centers, k-mean++ can significantly (with the significance interval at 95%) work better than k-means. These results demonstrate the accuracy of k-mean++ clustering algorithm in clustering crime doc-uments.

KW - Crime document clustering

KW - K-means algorithm

KW - K-Means++

KW - Similarity/distance measures

UR - http://www.scopus.com/inward/record.url?scp=84894630837&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84894630837&partnerID=8YFLogxK

U2 - 10.3844/jcssp.2014.1197.1206

DO - 10.3844/jcssp.2014.1197.1206

M3 - Article

AN - SCOPUS:84894630837

VL - 10

SP - 1197

EP - 1206

JO - Journal of Computer Science

JF - Journal of Computer Science

SN - 1549-3636

IS - 7

ER -