LDA-AdaBoost.MH: Accelerated AdaBoost.MH based on latent Dirichlet allocation for text categorization

Research output: Contribution to journalArticle

16 Citations (Scopus)

Abstract

AdaBoost.MH is a boosting algorithm that is considered to be one of the most accurate algorithms for multilabel classification. It works by iteratively building a committee of weak hypotheses of decision stumps. To build the weak hypotheses, in each iteration, AdaBoost.MH obtains the whole extracted features and examines them one by one to check their ability to characterize the appropriate category. Using Bag-Of-Words for text representation dramatically increases the computational time of AdaBoost.MH learning, especially for large-scale datasets. In this paper we demonstrate how to improve the efficiency and effectiveness of AdaBoost.MH using latent topics, rather than words. A well-known probabilistic topic modelling method, Latent Dirichlet Allocation, is used to estimate the latent topics in the corpus as features for AdaBoost.MH. To evaluate LDA-AdaBoost.MH, the following four datasets have been used: Reuters-21578-ModApte, WebKB, 20-Newsgroups and a collection of Arabic news. The experimental results confirmed that representing the texts as a small number of latent topics, rather than a large number of words, significantly decreased the computational time of AdaBoost.MH learning and improved its performance for text categorization.

Original languageEnglish
Pages (from-to)27-40
Number of pages14
JournalJournal of Information Science
Volume41
Issue number1
DOIs
Publication statusPublished - 13 Feb 2015

Fingerprint

Adaptive boosting
internet community
learning
news
efficiency
ability
performance
time

Keywords

  • AdaBoost.MH
  • Boosting
  • Latent dirichlet allocation
  • Text categorization
  • Topic modeling

ASJC Scopus subject areas

  • Information Systems
  • Library and Information Sciences

Cite this

@article{87409f443e2542f5b54c4352c999fff0,
title = "LDA-AdaBoost.MH: Accelerated AdaBoost.MH based on latent Dirichlet allocation for text categorization",
abstract = "AdaBoost.MH is a boosting algorithm that is considered to be one of the most accurate algorithms for multilabel classification. It works by iteratively building a committee of weak hypotheses of decision stumps. To build the weak hypotheses, in each iteration, AdaBoost.MH obtains the whole extracted features and examines them one by one to check their ability to characterize the appropriate category. Using Bag-Of-Words for text representation dramatically increases the computational time of AdaBoost.MH learning, especially for large-scale datasets. In this paper we demonstrate how to improve the efficiency and effectiveness of AdaBoost.MH using latent topics, rather than words. A well-known probabilistic topic modelling method, Latent Dirichlet Allocation, is used to estimate the latent topics in the corpus as features for AdaBoost.MH. To evaluate LDA-AdaBoost.MH, the following four datasets have been used: Reuters-21578-ModApte, WebKB, 20-Newsgroups and a collection of Arabic news. The experimental results confirmed that representing the texts as a small number of latent topics, rather than a large number of words, significantly decreased the computational time of AdaBoost.MH learning and improved its performance for text categorization.",
keywords = "AdaBoost.MH, Boosting, Latent dirichlet allocation, Text categorization, Topic modeling",
author = "Bassam Al-Salemi and {Ab Aziz}, {Mohd Juzaiddin} and {Mohd Noah}, {Shahrul Azman}",
year = "2015",
month = "2",
day = "13",
doi = "10.1177/0165551514551496",
language = "English",
volume = "41",
pages = "27--40",
journal = "Journal of Information Science",
issn = "0165-5515",
publisher = "SAGE Publications Ltd",
number = "1",

}

TY - JOUR

T1 - LDA-AdaBoost.MH

T2 - Accelerated AdaBoost.MH based on latent Dirichlet allocation for text categorization

AU - Al-Salemi, Bassam

AU - Ab Aziz, Mohd Juzaiddin

AU - Mohd Noah, Shahrul Azman

PY - 2015/2/13

Y1 - 2015/2/13

N2 - AdaBoost.MH is a boosting algorithm that is considered to be one of the most accurate algorithms for multilabel classification. It works by iteratively building a committee of weak hypotheses of decision stumps. To build the weak hypotheses, in each iteration, AdaBoost.MH obtains the whole extracted features and examines them one by one to check their ability to characterize the appropriate category. Using Bag-Of-Words for text representation dramatically increases the computational time of AdaBoost.MH learning, especially for large-scale datasets. In this paper we demonstrate how to improve the efficiency and effectiveness of AdaBoost.MH using latent topics, rather than words. A well-known probabilistic topic modelling method, Latent Dirichlet Allocation, is used to estimate the latent topics in the corpus as features for AdaBoost.MH. To evaluate LDA-AdaBoost.MH, the following four datasets have been used: Reuters-21578-ModApte, WebKB, 20-Newsgroups and a collection of Arabic news. The experimental results confirmed that representing the texts as a small number of latent topics, rather than a large number of words, significantly decreased the computational time of AdaBoost.MH learning and improved its performance for text categorization.

AB - AdaBoost.MH is a boosting algorithm that is considered to be one of the most accurate algorithms for multilabel classification. It works by iteratively building a committee of weak hypotheses of decision stumps. To build the weak hypotheses, in each iteration, AdaBoost.MH obtains the whole extracted features and examines them one by one to check their ability to characterize the appropriate category. Using Bag-Of-Words for text representation dramatically increases the computational time of AdaBoost.MH learning, especially for large-scale datasets. In this paper we demonstrate how to improve the efficiency and effectiveness of AdaBoost.MH using latent topics, rather than words. A well-known probabilistic topic modelling method, Latent Dirichlet Allocation, is used to estimate the latent topics in the corpus as features for AdaBoost.MH. To evaluate LDA-AdaBoost.MH, the following four datasets have been used: Reuters-21578-ModApte, WebKB, 20-Newsgroups and a collection of Arabic news. The experimental results confirmed that representing the texts as a small number of latent topics, rather than a large number of words, significantly decreased the computational time of AdaBoost.MH learning and improved its performance for text categorization.

KW - AdaBoost.MH

KW - Boosting

KW - Latent dirichlet allocation

KW - Text categorization

KW - Topic modeling

UR - http://www.scopus.com/inward/record.url?scp=84920861875&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84920861875&partnerID=8YFLogxK

U2 - 10.1177/0165551514551496

DO - 10.1177/0165551514551496

M3 - Article

AN - SCOPUS:84920861875

VL - 41

SP - 27

EP - 40

JO - Journal of Information Science

JF - Journal of Information Science

SN - 0165-5515

IS - 1

ER -