An intelligent data pre-processing of complex datasets

Research output: Contribution to journalArticle

10 Citations (Scopus)

Abstract

Pre-processing plays a vital role in classification tasks, particularly when complex features are involved, and this demands a highly intelligent method. In bioinformatics, where datasets are categorised as having complex features, the need for pre-processing is unavoidable. In this paper, we propose a framework for selecting the discriminatory features from protein sequences prior to classification by integrating the filter and wrapper approaches. Several state-of-the-art multivariate filters were explored in the first phase to remove the unwanted features that contributed to noise, while particle swarm optimisation (PSO) with support vector machine (SVM) was adopted in the wrapper phase to produce the most optimal features. Several PSO variants were investigated in the wrapper phase to compare the most suitable PSO variants for the problem domain. The results of both phases were analysed based on classification accuracy, number of selected features, modelling time and area under the curve on the main dataset and, five benchmark machine learning datasets of similar complexity. The higher classification accuracy of the proposed framework was highly reliable with an improvement over the filter phase and the use of full features despite using smaller features.

Original languageEnglish
Pages (from-to)305-325
Number of pages21
JournalIntelligent Data Analysis
Volume16
Issue number2
DOIs
Publication statusPublished - 2012

Fingerprint

Data Preprocessing
Wrapper
Particle swarm optimization (PSO)
Particle Swarm Optimization
Processing
Filter
Preprocessing
Feature Modeling
Bioinformatics
Support vector machines
Learning systems
Protein Sequence
Support Vector Machine
Proteins
Machine Learning
Benchmark
Curve

Keywords

  • Classification
  • data mining
  • feature selection
  • machine learning
  • optimisation
  • particle swarm optimisation

ASJC Scopus subject areas

  • Artificial Intelligence
  • Theoretical Computer Science
  • Computer Vision and Pattern Recognition

Cite this

An intelligent data pre-processing of complex datasets. / Abdul-Rahman, Shuzlina; Abu Bakar, Azuraliza; Mohamed Hussein, Zeti Azura.

In: Intelligent Data Analysis, Vol. 16, No. 2, 2012, p. 305-325.

Research output: Contribution to journalArticle

@article{483c6811941e40d2811c2558193ebd60,
title = "An intelligent data pre-processing of complex datasets",
abstract = "Pre-processing plays a vital role in classification tasks, particularly when complex features are involved, and this demands a highly intelligent method. In bioinformatics, where datasets are categorised as having complex features, the need for pre-processing is unavoidable. In this paper, we propose a framework for selecting the discriminatory features from protein sequences prior to classification by integrating the filter and wrapper approaches. Several state-of-the-art multivariate filters were explored in the first phase to remove the unwanted features that contributed to noise, while particle swarm optimisation (PSO) with support vector machine (SVM) was adopted in the wrapper phase to produce the most optimal features. Several PSO variants were investigated in the wrapper phase to compare the most suitable PSO variants for the problem domain. The results of both phases were analysed based on classification accuracy, number of selected features, modelling time and area under the curve on the main dataset and, five benchmark machine learning datasets of similar complexity. The higher classification accuracy of the proposed framework was highly reliable with an improvement over the filter phase and the use of full features despite using smaller features.",
keywords = "Classification, data mining, feature selection, machine learning, optimisation, particle swarm optimisation",
author = "Shuzlina Abdul-Rahman and {Abu Bakar}, Azuraliza and {Mohamed Hussein}, {Zeti Azura}",
year = "2012",
doi = "10.3233/IDA-2012-0525",
language = "English",
volume = "16",
pages = "305--325",
journal = "Intelligent Data Analysis",
issn = "1088-467X",
publisher = "IOS Press",
number = "2",

}

TY - JOUR

T1 - An intelligent data pre-processing of complex datasets

AU - Abdul-Rahman, Shuzlina

AU - Abu Bakar, Azuraliza

AU - Mohamed Hussein, Zeti Azura

PY - 2012

Y1 - 2012

N2 - Pre-processing plays a vital role in classification tasks, particularly when complex features are involved, and this demands a highly intelligent method. In bioinformatics, where datasets are categorised as having complex features, the need for pre-processing is unavoidable. In this paper, we propose a framework for selecting the discriminatory features from protein sequences prior to classification by integrating the filter and wrapper approaches. Several state-of-the-art multivariate filters were explored in the first phase to remove the unwanted features that contributed to noise, while particle swarm optimisation (PSO) with support vector machine (SVM) was adopted in the wrapper phase to produce the most optimal features. Several PSO variants were investigated in the wrapper phase to compare the most suitable PSO variants for the problem domain. The results of both phases were analysed based on classification accuracy, number of selected features, modelling time and area under the curve on the main dataset and, five benchmark machine learning datasets of similar complexity. The higher classification accuracy of the proposed framework was highly reliable with an improvement over the filter phase and the use of full features despite using smaller features.

AB - Pre-processing plays a vital role in classification tasks, particularly when complex features are involved, and this demands a highly intelligent method. In bioinformatics, where datasets are categorised as having complex features, the need for pre-processing is unavoidable. In this paper, we propose a framework for selecting the discriminatory features from protein sequences prior to classification by integrating the filter and wrapper approaches. Several state-of-the-art multivariate filters were explored in the first phase to remove the unwanted features that contributed to noise, while particle swarm optimisation (PSO) with support vector machine (SVM) was adopted in the wrapper phase to produce the most optimal features. Several PSO variants were investigated in the wrapper phase to compare the most suitable PSO variants for the problem domain. The results of both phases were analysed based on classification accuracy, number of selected features, modelling time and area under the curve on the main dataset and, five benchmark machine learning datasets of similar complexity. The higher classification accuracy of the proposed framework was highly reliable with an improvement over the filter phase and the use of full features despite using smaller features.

KW - Classification

KW - data mining

KW - feature selection

KW - machine learning

KW - optimisation

KW - particle swarm optimisation

UR - http://www.scopus.com/inward/record.url?scp=84858216878&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84858216878&partnerID=8YFLogxK

U2 - 10.3233/IDA-2012-0525

DO - 10.3233/IDA-2012-0525

M3 - Article

AN - SCOPUS:84858216878

VL - 16

SP - 305

EP - 325

JO - Intelligent Data Analysis

JF - Intelligent Data Analysis

SN - 1088-467X

IS - 2

ER -