An efficient algorithm for data cleansing

Saleh Rehiel Alenazi, Kamsuriah Ahmad

Research output: Contribution to journalArticle

Abstract

The data that have been collected from different resources might be redundant and duplicate. These data need to be cleaned in order for it to be used for other processing. The data should undergo detection process for any occurrence of duplication in the datasets. Two strategies are used to identify duplicates which are windowing or blocking. The aims of this paper are to review, to analyze and to compare algorithms in order to find the most efficient in terms of better accuracy and less number of comparisons. A comparison was made with the five most popular algorithms: DYSNI, PSNM, Dedup, InnWin and DCS++. Two benchmark datasets were used for the experiment, which are Restaurant and Cora. The results reveal that the DYNSI algorithm using both datasets gives high accuracy with respect to the number of comparisons. It is hoped that the results obtained from this study able to give the best review and comparison among the existing algorithms in producing high quality data and serve as a guidance to implement a better initiative for data storage system.

Original languageEnglish
Pages (from-to)6183-6191
Number of pages9
JournalJournal of Theoretical and Applied Information Technology
Volume95
Issue number22
Publication statusPublished - 30 Nov 2017

Fingerprint

Efficient Algorithms
Data Quality
Data Storage
Storage System
Duplication
Guidance
High Accuracy
Benchmark
Data storage equipment
Resources
Processing
Experiment
Experiments
Review

Keywords

  • Accuracy
  • Data cleansing
  • Data quality
  • Efficiency
  • Record deduplication deduction algorithm
  • Windowing-based

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Computer Science(all)

Cite this

An efficient algorithm for data cleansing. / Alenazi, Saleh Rehiel; Ahmad, Kamsuriah.

In: Journal of Theoretical and Applied Information Technology, Vol. 95, No. 22, 30.11.2017, p. 6183-6191.

Research output: Contribution to journalArticle

@article{53e182c806fc4c8fa30fac78733c9b45,
title = "An efficient algorithm for data cleansing",
abstract = "The data that have been collected from different resources might be redundant and duplicate. These data need to be cleaned in order for it to be used for other processing. The data should undergo detection process for any occurrence of duplication in the datasets. Two strategies are used to identify duplicates which are windowing or blocking. The aims of this paper are to review, to analyze and to compare algorithms in order to find the most efficient in terms of better accuracy and less number of comparisons. A comparison was made with the five most popular algorithms: DYSNI, PSNM, Dedup, InnWin and DCS++. Two benchmark datasets were used for the experiment, which are Restaurant and Cora. The results reveal that the DYNSI algorithm using both datasets gives high accuracy with respect to the number of comparisons. It is hoped that the results obtained from this study able to give the best review and comparison among the existing algorithms in producing high quality data and serve as a guidance to implement a better initiative for data storage system.",
keywords = "Accuracy, Data cleansing, Data quality, Efficiency, Record deduplication deduction algorithm, Windowing-based",
author = "Alenazi, {Saleh Rehiel} and Kamsuriah Ahmad",
year = "2017",
month = "11",
day = "30",
language = "English",
volume = "95",
pages = "6183--6191",
journal = "Journal of Theoretical and Applied Information Technology",
issn = "1992-8645",
publisher = "Asian Research Publishing Network (ARPN)",
number = "22",

}

TY - JOUR

T1 - An efficient algorithm for data cleansing

AU - Alenazi, Saleh Rehiel

AU - Ahmad, Kamsuriah

PY - 2017/11/30

Y1 - 2017/11/30

N2 - The data that have been collected from different resources might be redundant and duplicate. These data need to be cleaned in order for it to be used for other processing. The data should undergo detection process for any occurrence of duplication in the datasets. Two strategies are used to identify duplicates which are windowing or blocking. The aims of this paper are to review, to analyze and to compare algorithms in order to find the most efficient in terms of better accuracy and less number of comparisons. A comparison was made with the five most popular algorithms: DYSNI, PSNM, Dedup, InnWin and DCS++. Two benchmark datasets were used for the experiment, which are Restaurant and Cora. The results reveal that the DYNSI algorithm using both datasets gives high accuracy with respect to the number of comparisons. It is hoped that the results obtained from this study able to give the best review and comparison among the existing algorithms in producing high quality data and serve as a guidance to implement a better initiative for data storage system.

AB - The data that have been collected from different resources might be redundant and duplicate. These data need to be cleaned in order for it to be used for other processing. The data should undergo detection process for any occurrence of duplication in the datasets. Two strategies are used to identify duplicates which are windowing or blocking. The aims of this paper are to review, to analyze and to compare algorithms in order to find the most efficient in terms of better accuracy and less number of comparisons. A comparison was made with the five most popular algorithms: DYSNI, PSNM, Dedup, InnWin and DCS++. Two benchmark datasets were used for the experiment, which are Restaurant and Cora. The results reveal that the DYNSI algorithm using both datasets gives high accuracy with respect to the number of comparisons. It is hoped that the results obtained from this study able to give the best review and comparison among the existing algorithms in producing high quality data and serve as a guidance to implement a better initiative for data storage system.

KW - Accuracy

KW - Data cleansing

KW - Data quality

KW - Efficiency

KW - Record deduplication deduction algorithm

KW - Windowing-based

UR - http://www.scopus.com/inward/record.url?scp=85036497304&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85036497304&partnerID=8YFLogxK

M3 - Article

AN - SCOPUS:85036497304

VL - 95

SP - 6183

EP - 6191

JO - Journal of Theoretical and Applied Information Technology

JF - Journal of Theoretical and Applied Information Technology

SN - 1992-8645

IS - 22

ER -