Boosting algorithms with topic modeling for multi-label text categorization: A comparative empirical study

Research output: Contribution to journalArticle

7 Citations (Scopus)

Abstract

Boosting algorithms have received significant attention over the past several years and are considered to be the state-of-the-art classifiers for multi-label classification tasks. The disadvantage of using boosting algorithms for text categorization (TC) is the vast number of features that are generated using the traditional Bag-of-Words (BOW) text representation, which dramatically increases the computational complexity. In this paper, an alternative text representation method using topic modeling for enhancing and accelerating multi-label boosting algorithms is concerned. An extensive empirical experimental comparison of eight multi-label boosting algorithms using topic-based and BOW representation methods was undertaken. For the evaluation, three well-known multi-label TC datasets were used. Furthermore, to justify boosting algorithms performance, three well-known instance-based multi-label algorithms were involved in the evaluation. For completely credible evaluations, all algorithms were evaluated using their native software tools, except for data formats and user settings. The experimental results demonstrated that the topic-based representation significantly accelerated all algorithms and slightly enhanced the classification performance, especially for near-balanced and balanced datasets. For the imbalanced dataset, BOW representation led to the best performance. The MP-Boost algorithm is the most efficient and effective algorithm for imbalanced datasets using BOW representation. For topic-based representation, AdaBoost.MH with meta base learners, Hamming Tree (AdaMH-Tree) and Product (AdaMH-Product) achieved the best performance; however, with respect to the computational time, these algorithms are the slowest overall. Moreover, the results indicated that topic-based representation is more significant for instance-based algorithms; nevertheless, boosting algorithms, such as MP-Boost, AdaMH-Tree and AdaMH-Product, notably exceed their performance.

Original languageEnglish
Pages (from-to)732-746
Number of pages15
JournalJournal of Information Science
Volume41
Issue number5
DOIs
Publication statusPublished - 24 Oct 2015

Fingerprint

Labels
performance
evaluation
Adaptive boosting
Computational complexity
Classifiers

Keywords

  • AdaBoost.MH
  • boosting
  • multi-label classification
  • text categorization
  • text representation
  • topic modeling

ASJC Scopus subject areas

  • Information Systems
  • Library and Information Sciences

Cite this

@article{b0751eba6275431fad2c62426c8ba0db,
title = "Boosting algorithms with topic modeling for multi-label text categorization: A comparative empirical study",
abstract = "Boosting algorithms have received significant attention over the past several years and are considered to be the state-of-the-art classifiers for multi-label classification tasks. The disadvantage of using boosting algorithms for text categorization (TC) is the vast number of features that are generated using the traditional Bag-of-Words (BOW) text representation, which dramatically increases the computational complexity. In this paper, an alternative text representation method using topic modeling for enhancing and accelerating multi-label boosting algorithms is concerned. An extensive empirical experimental comparison of eight multi-label boosting algorithms using topic-based and BOW representation methods was undertaken. For the evaluation, three well-known multi-label TC datasets were used. Furthermore, to justify boosting algorithms performance, three well-known instance-based multi-label algorithms were involved in the evaluation. For completely credible evaluations, all algorithms were evaluated using their native software tools, except for data formats and user settings. The experimental results demonstrated that the topic-based representation significantly accelerated all algorithms and slightly enhanced the classification performance, especially for near-balanced and balanced datasets. For the imbalanced dataset, BOW representation led to the best performance. The MP-Boost algorithm is the most efficient and effective algorithm for imbalanced datasets using BOW representation. For topic-based representation, AdaBoost.MH with meta base learners, Hamming Tree (AdaMH-Tree) and Product (AdaMH-Product) achieved the best performance; however, with respect to the computational time, these algorithms are the slowest overall. Moreover, the results indicated that topic-based representation is more significant for instance-based algorithms; nevertheless, boosting algorithms, such as MP-Boost, AdaMH-Tree and AdaMH-Product, notably exceed their performance.",
keywords = "AdaBoost.MH, boosting, multi-label classification, text categorization, text representation, topic modeling",
author = "Bassam Al-Salemi and {Ab Aziz}, {Mohd Juzaiddin} and {Mohd Noah}, {Shahrul Azman}",
year = "2015",
month = "10",
day = "24",
doi = "10.1177/0165551515590079",
language = "English",
volume = "41",
pages = "732--746",
journal = "Journal of Information Science",
issn = "0165-5515",
publisher = "SAGE Publications Ltd",
number = "5",

}

TY - JOUR

T1 - Boosting algorithms with topic modeling for multi-label text categorization

T2 - A comparative empirical study

AU - Al-Salemi, Bassam

AU - Ab Aziz, Mohd Juzaiddin

AU - Mohd Noah, Shahrul Azman

PY - 2015/10/24

Y1 - 2015/10/24

N2 - Boosting algorithms have received significant attention over the past several years and are considered to be the state-of-the-art classifiers for multi-label classification tasks. The disadvantage of using boosting algorithms for text categorization (TC) is the vast number of features that are generated using the traditional Bag-of-Words (BOW) text representation, which dramatically increases the computational complexity. In this paper, an alternative text representation method using topic modeling for enhancing and accelerating multi-label boosting algorithms is concerned. An extensive empirical experimental comparison of eight multi-label boosting algorithms using topic-based and BOW representation methods was undertaken. For the evaluation, three well-known multi-label TC datasets were used. Furthermore, to justify boosting algorithms performance, three well-known instance-based multi-label algorithms were involved in the evaluation. For completely credible evaluations, all algorithms were evaluated using their native software tools, except for data formats and user settings. The experimental results demonstrated that the topic-based representation significantly accelerated all algorithms and slightly enhanced the classification performance, especially for near-balanced and balanced datasets. For the imbalanced dataset, BOW representation led to the best performance. The MP-Boost algorithm is the most efficient and effective algorithm for imbalanced datasets using BOW representation. For topic-based representation, AdaBoost.MH with meta base learners, Hamming Tree (AdaMH-Tree) and Product (AdaMH-Product) achieved the best performance; however, with respect to the computational time, these algorithms are the slowest overall. Moreover, the results indicated that topic-based representation is more significant for instance-based algorithms; nevertheless, boosting algorithms, such as MP-Boost, AdaMH-Tree and AdaMH-Product, notably exceed their performance.

AB - Boosting algorithms have received significant attention over the past several years and are considered to be the state-of-the-art classifiers for multi-label classification tasks. The disadvantage of using boosting algorithms for text categorization (TC) is the vast number of features that are generated using the traditional Bag-of-Words (BOW) text representation, which dramatically increases the computational complexity. In this paper, an alternative text representation method using topic modeling for enhancing and accelerating multi-label boosting algorithms is concerned. An extensive empirical experimental comparison of eight multi-label boosting algorithms using topic-based and BOW representation methods was undertaken. For the evaluation, three well-known multi-label TC datasets were used. Furthermore, to justify boosting algorithms performance, three well-known instance-based multi-label algorithms were involved in the evaluation. For completely credible evaluations, all algorithms were evaluated using their native software tools, except for data formats and user settings. The experimental results demonstrated that the topic-based representation significantly accelerated all algorithms and slightly enhanced the classification performance, especially for near-balanced and balanced datasets. For the imbalanced dataset, BOW representation led to the best performance. The MP-Boost algorithm is the most efficient and effective algorithm for imbalanced datasets using BOW representation. For topic-based representation, AdaBoost.MH with meta base learners, Hamming Tree (AdaMH-Tree) and Product (AdaMH-Product) achieved the best performance; however, with respect to the computational time, these algorithms are the slowest overall. Moreover, the results indicated that topic-based representation is more significant for instance-based algorithms; nevertheless, boosting algorithms, such as MP-Boost, AdaMH-Tree and AdaMH-Product, notably exceed their performance.

KW - AdaBoost.MH

KW - boosting

KW - multi-label classification

KW - text categorization

KW - text representation

KW - topic modeling

UR - http://www.scopus.com/inward/record.url?scp=84942085806&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84942085806&partnerID=8YFLogxK

U2 - 10.1177/0165551515590079

DO - 10.1177/0165551515590079

M3 - Article

AN - SCOPUS:84942085806

VL - 41

SP - 732

EP - 746

JO - Journal of Information Science

JF - Journal of Information Science

SN - 0165-5515

IS - 5

ER -