New approach for imbalanced biological dataset classification

Research output: Contribution to journalArticle

2 Citations (Scopus)

Abstract

This paper presents a new ensemble classifier for class imbalance problem with the emphasis on two-class (binary) classification. This novel method is a combination of SMOTE (Synthetic Minority Over-sampling Technique), Rotation Forest, and AdaBoostM1 algorithms. SMOTE was employed for the over-sampling of the minority samples at 100%, 200%, 300%, 400%, and 500% of the initial sample size, with attribute selection being conducted in order to prevent the classification from being over-fitted. The ensemble classifier method was presented to solve the problem of imbalanced biological datasets classification by obtaining a low prediction error and raising the prediction performance. The Rotation Forest algorithm was used to produce an ensemble classifier with a lower prediction error, while the AdaBoostM1 algorithm was used to enhance the performance of the classifier. All the tests were carried out using the java-based WEKA (Waikato Environment for Knowledge Analysis) and Orange canvas data mining systems for training datasets. The performances of three types of classifiers on imbalanced biomedical datasets were assessed. This paper explores the efficiency of this new method in producing an accurate overall classifier and in lowering the error rate in the overall performance of the classifier. Tests were carried out on three actual imbalanced biomedical datasets, which were obtained from the KEEL dataset repository. These imbalanced datasets were divided into ten categories according to their imbalance ratios (IR) which ranged from 1.86 to 41.40. The results indicated that the proposed method, which used a combination of three methods and various evaluation metrics in its assessments, was effective. In practical terms, the use of the SMOTERotBoost for the classification of biological datasets results in a low mean absolute error rate as well as high accuracy and precision. The values of the Kappa Coefficient were close to 1, thus indicating that all the rates in every classification were the same even though the false negative rates, which were close to 0, showed the reliability of the measurements. The SMOTE-RotBoost has useful AUC-ROC outputs that characterise the wider area under the curve compared to other classifiers and is a vital method for the assessment of diagnostic tests.

Original languageEnglish
Pages (from-to)40-57
Number of pages18
JournalJournal of Theoretical and Applied Information Technology
Volume72
Issue number1
Publication statusPublished - 2015

Fingerprint

Oversampling
Classifiers
Ensemble Classifier
Classifier
Prediction Error
Sampling
Error Rate
Diagnostic Tests
Binary Classification
Performance Prediction
Repository
Java
Data Mining
High Accuracy
Sample Size
Attribute
Data mining
Metric
Curve
Output

Keywords

  • Bagging
  • Boosting
  • Random Subspace
  • Rotation Forest
  • SMOTE

ASJC Scopus subject areas

  • Computer Science(all)
  • Theoretical Computer Science

Cite this

New approach for imbalanced biological dataset classification. / Fattahi, Seyyedali; Othman, Zalinda; Ali Othman, Zulaiha.

In: Journal of Theoretical and Applied Information Technology, Vol. 72, No. 1, 2015, p. 40-57.

Research output: Contribution to journalArticle

@article{f6b842d2d0c1429eaf49896e0f0cad67,
title = "New approach for imbalanced biological dataset classification",
abstract = "This paper presents a new ensemble classifier for class imbalance problem with the emphasis on two-class (binary) classification. This novel method is a combination of SMOTE (Synthetic Minority Over-sampling Technique), Rotation Forest, and AdaBoostM1 algorithms. SMOTE was employed for the over-sampling of the minority samples at 100{\%}, 200{\%}, 300{\%}, 400{\%}, and 500{\%} of the initial sample size, with attribute selection being conducted in order to prevent the classification from being over-fitted. The ensemble classifier method was presented to solve the problem of imbalanced biological datasets classification by obtaining a low prediction error and raising the prediction performance. The Rotation Forest algorithm was used to produce an ensemble classifier with a lower prediction error, while the AdaBoostM1 algorithm was used to enhance the performance of the classifier. All the tests were carried out using the java-based WEKA (Waikato Environment for Knowledge Analysis) and Orange canvas data mining systems for training datasets. The performances of three types of classifiers on imbalanced biomedical datasets were assessed. This paper explores the efficiency of this new method in producing an accurate overall classifier and in lowering the error rate in the overall performance of the classifier. Tests were carried out on three actual imbalanced biomedical datasets, which were obtained from the KEEL dataset repository. These imbalanced datasets were divided into ten categories according to their imbalance ratios (IR) which ranged from 1.86 to 41.40. The results indicated that the proposed method, which used a combination of three methods and various evaluation metrics in its assessments, was effective. In practical terms, the use of the SMOTERotBoost for the classification of biological datasets results in a low mean absolute error rate as well as high accuracy and precision. The values of the Kappa Coefficient were close to 1, thus indicating that all the rates in every classification were the same even though the false negative rates, which were close to 0, showed the reliability of the measurements. The SMOTE-RotBoost has useful AUC-ROC outputs that characterise the wider area under the curve compared to other classifiers and is a vital method for the assessment of diagnostic tests.",
keywords = "Bagging, Boosting, Random Subspace, Rotation Forest, SMOTE",
author = "Seyyedali Fattahi and Zalinda Othman and {Ali Othman}, Zulaiha",
year = "2015",
language = "English",
volume = "72",
pages = "40--57",
journal = "Journal of Theoretical and Applied Information Technology",
issn = "1992-8645",
publisher = "Asian Research Publishing Network (ARPN)",
number = "1",

}

TY - JOUR

T1 - New approach for imbalanced biological dataset classification

AU - Fattahi, Seyyedali

AU - Othman, Zalinda

AU - Ali Othman, Zulaiha

PY - 2015

Y1 - 2015

N2 - This paper presents a new ensemble classifier for class imbalance problem with the emphasis on two-class (binary) classification. This novel method is a combination of SMOTE (Synthetic Minority Over-sampling Technique), Rotation Forest, and AdaBoostM1 algorithms. SMOTE was employed for the over-sampling of the minority samples at 100%, 200%, 300%, 400%, and 500% of the initial sample size, with attribute selection being conducted in order to prevent the classification from being over-fitted. The ensemble classifier method was presented to solve the problem of imbalanced biological datasets classification by obtaining a low prediction error and raising the prediction performance. The Rotation Forest algorithm was used to produce an ensemble classifier with a lower prediction error, while the AdaBoostM1 algorithm was used to enhance the performance of the classifier. All the tests were carried out using the java-based WEKA (Waikato Environment for Knowledge Analysis) and Orange canvas data mining systems for training datasets. The performances of three types of classifiers on imbalanced biomedical datasets were assessed. This paper explores the efficiency of this new method in producing an accurate overall classifier and in lowering the error rate in the overall performance of the classifier. Tests were carried out on three actual imbalanced biomedical datasets, which were obtained from the KEEL dataset repository. These imbalanced datasets were divided into ten categories according to their imbalance ratios (IR) which ranged from 1.86 to 41.40. The results indicated that the proposed method, which used a combination of three methods and various evaluation metrics in its assessments, was effective. In practical terms, the use of the SMOTERotBoost for the classification of biological datasets results in a low mean absolute error rate as well as high accuracy and precision. The values of the Kappa Coefficient were close to 1, thus indicating that all the rates in every classification were the same even though the false negative rates, which were close to 0, showed the reliability of the measurements. The SMOTE-RotBoost has useful AUC-ROC outputs that characterise the wider area under the curve compared to other classifiers and is a vital method for the assessment of diagnostic tests.

AB - This paper presents a new ensemble classifier for class imbalance problem with the emphasis on two-class (binary) classification. This novel method is a combination of SMOTE (Synthetic Minority Over-sampling Technique), Rotation Forest, and AdaBoostM1 algorithms. SMOTE was employed for the over-sampling of the minority samples at 100%, 200%, 300%, 400%, and 500% of the initial sample size, with attribute selection being conducted in order to prevent the classification from being over-fitted. The ensemble classifier method was presented to solve the problem of imbalanced biological datasets classification by obtaining a low prediction error and raising the prediction performance. The Rotation Forest algorithm was used to produce an ensemble classifier with a lower prediction error, while the AdaBoostM1 algorithm was used to enhance the performance of the classifier. All the tests were carried out using the java-based WEKA (Waikato Environment for Knowledge Analysis) and Orange canvas data mining systems for training datasets. The performances of three types of classifiers on imbalanced biomedical datasets were assessed. This paper explores the efficiency of this new method in producing an accurate overall classifier and in lowering the error rate in the overall performance of the classifier. Tests were carried out on three actual imbalanced biomedical datasets, which were obtained from the KEEL dataset repository. These imbalanced datasets were divided into ten categories according to their imbalance ratios (IR) which ranged from 1.86 to 41.40. The results indicated that the proposed method, which used a combination of three methods and various evaluation metrics in its assessments, was effective. In practical terms, the use of the SMOTERotBoost for the classification of biological datasets results in a low mean absolute error rate as well as high accuracy and precision. The values of the Kappa Coefficient were close to 1, thus indicating that all the rates in every classification were the same even though the false negative rates, which were close to 0, showed the reliability of the measurements. The SMOTE-RotBoost has useful AUC-ROC outputs that characterise the wider area under the curve compared to other classifiers and is a vital method for the assessment of diagnostic tests.

KW - Bagging

KW - Boosting

KW - Random Subspace

KW - Rotation Forest

KW - SMOTE

UR - http://www.scopus.com/inward/record.url?scp=84922710270&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84922710270&partnerID=8YFLogxK

M3 - Article

VL - 72

SP - 40

EP - 57

JO - Journal of Theoretical and Applied Information Technology

JF - Journal of Theoretical and Applied Information Technology

SN - 1992-8645

IS - 1

ER -