Feature ranking for enhancing boosting-based multi-label text categorization

Research output: Contribution to journalArticle

2 Citations (Scopus)

Abstract

Boosting algorithms have been proved effective for multi-label learning. As ensemble learning algorithms, boosting algorithms build classifiers by composing a set of weak hypotheses. The high computational cost of boosting algorithms in learning from large volumes of data such as text categorization datasets is a real challenge. Most boosting algorithms, such as AdaBoost.MH, iteratively examine all training features to generate the weak hypotheses, which increases the learning time. RFBoost was introduced to manage this problem based on a rank-and-filter strategy in which it first ranks the training features and then, in each learning iteration, filters and uses only a subset of the highest-ranked features to construct the weak hypotheses. This step ensures accelerated learning time for RFBoost compared to AdaBoost.MH, as the weak hypotheses produced in each iteration are reduced to a very small number. As feature ranking is the core idea of RFBoost, this paper presents and investigates seven feature ranking methods (information gain, chi-square, GSS-coefficient, mutual information, odds ratio, F1 score, and accuracy) in order to improve RFBoost's performance. Moreover, an accelerated version of RFBoost, called RFBoost1, is also introduced. Rather than filtering a subset of the highest-ranked features, FBoost1 selects only one feature, based on its weight, to build a new weak hypothesis. Experimental results on four benchmark datasets for multi-label text categorization) Reuters-21578, 20-Newsgroups, OHSUMED, and TMC2007(demonstrate that among the methods evaluated for feature ranking, mutual information yields the best performance for RFBoost. In addition, the results prove that RFBoost statistically outperforms both RFBoost1 and AdaBoost.MH on all datasets. Finally, RFBoost1 proved more efficient than AdaBoost.MH, making it a better alternative for addressing classification problems in real-life applications and expert systems.

Original languageEnglish
Pages (from-to)531-543
Number of pages13
JournalExpert Systems with Applications
Volume113
DOIs
Publication statusPublished - 15 Dec 2018

Fingerprint

Adaptive boosting
Labels
Expert systems
Learning algorithms
Classifiers
Costs

Keywords

  • Boosting
  • Feature ranking
  • Multi-label learning
  • RFBoost
  • Text categorization

ASJC Scopus subject areas

  • Engineering(all)
  • Computer Science Applications
  • Artificial Intelligence

Cite this

Feature ranking for enhancing boosting-based multi-label text categorization. / Al-Salemi, Bassam; Ayob, Masri; Mohd Noah, Shahrul Azman.

In: Expert Systems with Applications, Vol. 113, 15.12.2018, p. 531-543.

Research output: Contribution to journalArticle

@article{659b99660284487d862b1a69a9c9e828,
title = "Feature ranking for enhancing boosting-based multi-label text categorization",
abstract = "Boosting algorithms have been proved effective for multi-label learning. As ensemble learning algorithms, boosting algorithms build classifiers by composing a set of weak hypotheses. The high computational cost of boosting algorithms in learning from large volumes of data such as text categorization datasets is a real challenge. Most boosting algorithms, such as AdaBoost.MH, iteratively examine all training features to generate the weak hypotheses, which increases the learning time. RFBoost was introduced to manage this problem based on a rank-and-filter strategy in which it first ranks the training features and then, in each learning iteration, filters and uses only a subset of the highest-ranked features to construct the weak hypotheses. This step ensures accelerated learning time for RFBoost compared to AdaBoost.MH, as the weak hypotheses produced in each iteration are reduced to a very small number. As feature ranking is the core idea of RFBoost, this paper presents and investigates seven feature ranking methods (information gain, chi-square, GSS-coefficient, mutual information, odds ratio, F1 score, and accuracy) in order to improve RFBoost's performance. Moreover, an accelerated version of RFBoost, called RFBoost1, is also introduced. Rather than filtering a subset of the highest-ranked features, FBoost1 selects only one feature, based on its weight, to build a new weak hypothesis. Experimental results on four benchmark datasets for multi-label text categorization) Reuters-21578, 20-Newsgroups, OHSUMED, and TMC2007(demonstrate that among the methods evaluated for feature ranking, mutual information yields the best performance for RFBoost. In addition, the results prove that RFBoost statistically outperforms both RFBoost1 and AdaBoost.MH on all datasets. Finally, RFBoost1 proved more efficient than AdaBoost.MH, making it a better alternative for addressing classification problems in real-life applications and expert systems.",
keywords = "Boosting, Feature ranking, Multi-label learning, RFBoost, Text categorization",
author = "Bassam Al-Salemi and Masri Ayob and {Mohd Noah}, {Shahrul Azman}",
year = "2018",
month = "12",
day = "15",
doi = "10.1016/j.eswa.2018.07.024",
language = "English",
volume = "113",
pages = "531--543",
journal = "Expert Systems with Applications",
issn = "0957-4174",
publisher = "Elsevier Limited",

}

TY - JOUR

T1 - Feature ranking for enhancing boosting-based multi-label text categorization

AU - Al-Salemi, Bassam

AU - Ayob, Masri

AU - Mohd Noah, Shahrul Azman

PY - 2018/12/15

Y1 - 2018/12/15

N2 - Boosting algorithms have been proved effective for multi-label learning. As ensemble learning algorithms, boosting algorithms build classifiers by composing a set of weak hypotheses. The high computational cost of boosting algorithms in learning from large volumes of data such as text categorization datasets is a real challenge. Most boosting algorithms, such as AdaBoost.MH, iteratively examine all training features to generate the weak hypotheses, which increases the learning time. RFBoost was introduced to manage this problem based on a rank-and-filter strategy in which it first ranks the training features and then, in each learning iteration, filters and uses only a subset of the highest-ranked features to construct the weak hypotheses. This step ensures accelerated learning time for RFBoost compared to AdaBoost.MH, as the weak hypotheses produced in each iteration are reduced to a very small number. As feature ranking is the core idea of RFBoost, this paper presents and investigates seven feature ranking methods (information gain, chi-square, GSS-coefficient, mutual information, odds ratio, F1 score, and accuracy) in order to improve RFBoost's performance. Moreover, an accelerated version of RFBoost, called RFBoost1, is also introduced. Rather than filtering a subset of the highest-ranked features, FBoost1 selects only one feature, based on its weight, to build a new weak hypothesis. Experimental results on four benchmark datasets for multi-label text categorization) Reuters-21578, 20-Newsgroups, OHSUMED, and TMC2007(demonstrate that among the methods evaluated for feature ranking, mutual information yields the best performance for RFBoost. In addition, the results prove that RFBoost statistically outperforms both RFBoost1 and AdaBoost.MH on all datasets. Finally, RFBoost1 proved more efficient than AdaBoost.MH, making it a better alternative for addressing classification problems in real-life applications and expert systems.

AB - Boosting algorithms have been proved effective for multi-label learning. As ensemble learning algorithms, boosting algorithms build classifiers by composing a set of weak hypotheses. The high computational cost of boosting algorithms in learning from large volumes of data such as text categorization datasets is a real challenge. Most boosting algorithms, such as AdaBoost.MH, iteratively examine all training features to generate the weak hypotheses, which increases the learning time. RFBoost was introduced to manage this problem based on a rank-and-filter strategy in which it first ranks the training features and then, in each learning iteration, filters and uses only a subset of the highest-ranked features to construct the weak hypotheses. This step ensures accelerated learning time for RFBoost compared to AdaBoost.MH, as the weak hypotheses produced in each iteration are reduced to a very small number. As feature ranking is the core idea of RFBoost, this paper presents and investigates seven feature ranking methods (information gain, chi-square, GSS-coefficient, mutual information, odds ratio, F1 score, and accuracy) in order to improve RFBoost's performance. Moreover, an accelerated version of RFBoost, called RFBoost1, is also introduced. Rather than filtering a subset of the highest-ranked features, FBoost1 selects only one feature, based on its weight, to build a new weak hypothesis. Experimental results on four benchmark datasets for multi-label text categorization) Reuters-21578, 20-Newsgroups, OHSUMED, and TMC2007(demonstrate that among the methods evaluated for feature ranking, mutual information yields the best performance for RFBoost. In addition, the results prove that RFBoost statistically outperforms both RFBoost1 and AdaBoost.MH on all datasets. Finally, RFBoost1 proved more efficient than AdaBoost.MH, making it a better alternative for addressing classification problems in real-life applications and expert systems.

KW - Boosting

KW - Feature ranking

KW - Multi-label learning

KW - RFBoost

KW - Text categorization

UR - http://www.scopus.com/inward/record.url?scp=85050146494&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85050146494&partnerID=8YFLogxK

U2 - 10.1016/j.eswa.2018.07.024

DO - 10.1016/j.eswa.2018.07.024

M3 - Article

VL - 113

SP - 531

EP - 543

JO - Expert Systems with Applications

JF - Expert Systems with Applications

SN - 0957-4174

ER -