Building a new taxonomy for data discretization techniques

Azuraliza Abu Bakar, Zulaiha Ali Othman, Nor Liyana Mohd Shuib

Research output: Chapter in Book/Report/Conference proceedingConference contribution

19 Citations (Scopus)

Abstract

Data preprocessing is an important step in data mining. It is used to resolve various types of problem in a large dataset in order to produce quality data. It consists of four steps, namely, data cleaning, integration, reduction and transformation. The literature shows that each preprocessing step consists of various techniques. In order to develop quality data, a data miner must decide the most appropriate techniques in every step of data preprocessing. In this study, we focus on data reduction particularly data discretization as one important data preprocessing step. Data reduction involves reducing the data distribution by reducing the range of continuous data into a range of values or categories. Data discretization plays a major role in reducing the attribute intervals of data values. Finding an appropriate number of discrete values will improve the performance of the data mining modelling, particularly in terms of classification accuracy. This paper proposes four levels of data discretization taxonomy as follows: hierarchical and non-hierarchical; splitting, merging and combination; supervised and unsupervised combinations; and binning, statistic, entropy and other related techniques. The taxonomy is developed based on a detailed review of previous discretization techniques. More than fifty techniques are investigated, and the structure of the discretization approach is outlined. Guidelines on how to use the proposed taxonomy are also discussed.

Original languageEnglish
Title of host publication2009 2nd Conference on Data Mining and Optimization, DMO 2009
Pages132-140
Number of pages9
DOIs
Publication statusPublished - 2009
Event2009 2nd Conference on Data Mining and Optimization, DMO 2009 - Bangi, Selangor
Duration: 27 Oct 200928 Oct 2009

Other

Other2009 2nd Conference on Data Mining and Optimization, DMO 2009
CityBangi, Selangor
Period27/10/0928/10/09

Fingerprint

Taxonomies
Data mining
Data reduction
Miners
Merging
Cleaning
Entropy
Statistics

Keywords

  • Data discretization
  • Data mining
  • Data preprocessing

ASJC Scopus subject areas

  • Computational Theory and Mathematics
  • Software

Cite this

Abu Bakar, A., Ali Othman, Z., & Shuib, N. L. M. (2009). Building a new taxonomy for data discretization techniques. In 2009 2nd Conference on Data Mining and Optimization, DMO 2009 (pp. 132-140). [5341896] https://doi.org/10.1109/DMO.2009.5341896

Building a new taxonomy for data discretization techniques. / Abu Bakar, Azuraliza; Ali Othman, Zulaiha; Shuib, Nor Liyana Mohd.

2009 2nd Conference on Data Mining and Optimization, DMO 2009. 2009. p. 132-140 5341896.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abu Bakar, A, Ali Othman, Z & Shuib, NLM 2009, Building a new taxonomy for data discretization techniques. in 2009 2nd Conference on Data Mining and Optimization, DMO 2009., 5341896, pp. 132-140, 2009 2nd Conference on Data Mining and Optimization, DMO 2009, Bangi, Selangor, 27/10/09. https://doi.org/10.1109/DMO.2009.5341896
Abu Bakar A, Ali Othman Z, Shuib NLM. Building a new taxonomy for data discretization techniques. In 2009 2nd Conference on Data Mining and Optimization, DMO 2009. 2009. p. 132-140. 5341896 https://doi.org/10.1109/DMO.2009.5341896
Abu Bakar, Azuraliza ; Ali Othman, Zulaiha ; Shuib, Nor Liyana Mohd. / Building a new taxonomy for data discretization techniques. 2009 2nd Conference on Data Mining and Optimization, DMO 2009. 2009. pp. 132-140
@inproceedings{48f474716eca4489928b74d8a3614661,
title = "Building a new taxonomy for data discretization techniques",
abstract = "Data preprocessing is an important step in data mining. It is used to resolve various types of problem in a large dataset in order to produce quality data. It consists of four steps, namely, data cleaning, integration, reduction and transformation. The literature shows that each preprocessing step consists of various techniques. In order to develop quality data, a data miner must decide the most appropriate techniques in every step of data preprocessing. In this study, we focus on data reduction particularly data discretization as one important data preprocessing step. Data reduction involves reducing the data distribution by reducing the range of continuous data into a range of values or categories. Data discretization plays a major role in reducing the attribute intervals of data values. Finding an appropriate number of discrete values will improve the performance of the data mining modelling, particularly in terms of classification accuracy. This paper proposes four levels of data discretization taxonomy as follows: hierarchical and non-hierarchical; splitting, merging and combination; supervised and unsupervised combinations; and binning, statistic, entropy and other related techniques. The taxonomy is developed based on a detailed review of previous discretization techniques. More than fifty techniques are investigated, and the structure of the discretization approach is outlined. Guidelines on how to use the proposed taxonomy are also discussed.",
keywords = "Data discretization, Data mining, Data preprocessing",
author = "{Abu Bakar}, Azuraliza and {Ali Othman}, Zulaiha and Shuib, {Nor Liyana Mohd}",
year = "2009",
doi = "10.1109/DMO.2009.5341896",
language = "English",
isbn = "9781424449446",
pages = "132--140",
booktitle = "2009 2nd Conference on Data Mining and Optimization, DMO 2009",

}

TY - GEN

T1 - Building a new taxonomy for data discretization techniques

AU - Abu Bakar, Azuraliza

AU - Ali Othman, Zulaiha

AU - Shuib, Nor Liyana Mohd

PY - 2009

Y1 - 2009

N2 - Data preprocessing is an important step in data mining. It is used to resolve various types of problem in a large dataset in order to produce quality data. It consists of four steps, namely, data cleaning, integration, reduction and transformation. The literature shows that each preprocessing step consists of various techniques. In order to develop quality data, a data miner must decide the most appropriate techniques in every step of data preprocessing. In this study, we focus on data reduction particularly data discretization as one important data preprocessing step. Data reduction involves reducing the data distribution by reducing the range of continuous data into a range of values or categories. Data discretization plays a major role in reducing the attribute intervals of data values. Finding an appropriate number of discrete values will improve the performance of the data mining modelling, particularly in terms of classification accuracy. This paper proposes four levels of data discretization taxonomy as follows: hierarchical and non-hierarchical; splitting, merging and combination; supervised and unsupervised combinations; and binning, statistic, entropy and other related techniques. The taxonomy is developed based on a detailed review of previous discretization techniques. More than fifty techniques are investigated, and the structure of the discretization approach is outlined. Guidelines on how to use the proposed taxonomy are also discussed.

AB - Data preprocessing is an important step in data mining. It is used to resolve various types of problem in a large dataset in order to produce quality data. It consists of four steps, namely, data cleaning, integration, reduction and transformation. The literature shows that each preprocessing step consists of various techniques. In order to develop quality data, a data miner must decide the most appropriate techniques in every step of data preprocessing. In this study, we focus on data reduction particularly data discretization as one important data preprocessing step. Data reduction involves reducing the data distribution by reducing the range of continuous data into a range of values or categories. Data discretization plays a major role in reducing the attribute intervals of data values. Finding an appropriate number of discrete values will improve the performance of the data mining modelling, particularly in terms of classification accuracy. This paper proposes four levels of data discretization taxonomy as follows: hierarchical and non-hierarchical; splitting, merging and combination; supervised and unsupervised combinations; and binning, statistic, entropy and other related techniques. The taxonomy is developed based on a detailed review of previous discretization techniques. More than fifty techniques are investigated, and the structure of the discretization approach is outlined. Guidelines on how to use the proposed taxonomy are also discussed.

KW - Data discretization

KW - Data mining

KW - Data preprocessing

UR - http://www.scopus.com/inward/record.url?scp=72449142147&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=72449142147&partnerID=8YFLogxK

U2 - 10.1109/DMO.2009.5341896

DO - 10.1109/DMO.2009.5341896

M3 - Conference contribution

AN - SCOPUS:72449142147

SN - 9781424449446

SP - 132

EP - 140

BT - 2009 2nd Conference on Data Mining and Optimization, DMO 2009

ER -