BoWT

A hybrid text representation model for improving text categorization based on Adaboost.MH

Research output: Chapter in Book/Report/Conference proceedingConference contribution

2 Citations (Scopus)

Abstract

Text representation is the fundamental task in text categorization system. The BAG-OF-WORDS (BOW) is a typical model for representing the texts into vectors of single words. Even though it is a simple representation model, BOW has been criticized for its disregard of the relationships between the words. Alternatively, the Latent Dirichlet Allocation (LDA) topic model has been proposed to represent the texts into a BAG-OF-TopICS (BOT). In LDA, the words in the corpus are statistically grouped into a small number of themes called “latent topics” in which the topics capture the semantic relationships between the words. Thus, representing the documents using BOT will dramatically accelerate the training time; as well improve the classification performance. However, BOT has been proven to not be effective for imbalanced datasets. Accordingly, this paper presents a hybrid text representation model as a combination of BOW and BOT, namely BOWT. In BOWT, the high weighted BOW’s features are merged with the BOT’s features to produce a new feature space. The proposed representation model BOWT is evaluated for multi-label text categorization based on the well-known boosting algorithm ADABOOST.MH. The experimental results on four benchmarks demonstrated that the BOWT representation model notably outperforms both BOW and BOT and dramatically improves the classification performance of ADABOOST.MH for text categorization.

Original languageEnglish
Title of host publicationMulti-disciplinary Trends in Artificial Intelligence - 10th International Workshop, MIWAI 2016, Proceedings
PublisherSpringer Verlag
Pages3-11
Number of pages9
Volume10053 LNAI
ISBN (Print)9783319493961
DOIs
Publication statusPublished - 2016
Event10th Multi-Disciplinary International Workshop on Artificial Intelligence, MIWAI 2016 - Chiang Mai, Thailand
Duration: 7 Dec 20169 Dec 2016

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume10053 LNAI
ISSN (Print)03029743
ISSN (Electronic)16113349

Other

Other10th Multi-Disciplinary International Workshop on Artificial Intelligence, MIWAI 2016
CountryThailand
CityChiang Mai
Period7/12/169/12/16

Fingerprint

Text Categorization
Adaptive boosting
AdaBoost
Dirichlet
Model
Boosting
Feature Space
Accelerate
Text
Labels
Semantics
Benchmark
Experimental Results

Keywords

  • Adaboost.MH
  • Bowt
  • Text categorization
  • Text representation
  • Topic modeling

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Computer Science(all)

Cite this

Al-Salemi, B., Ab Aziz, M. J., & Mohd Noah, S. A. (2016). BoWT: A hybrid text representation model for improving text categorization based on Adaboost.MH. In Multi-disciplinary Trends in Artificial Intelligence - 10th International Workshop, MIWAI 2016, Proceedings (Vol. 10053 LNAI, pp. 3-11). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 10053 LNAI). Springer Verlag. https://doi.org/10.1007/978-3-319-49397-8_1

BoWT : A hybrid text representation model for improving text categorization based on Adaboost.MH. / Al-Salemi, Bassam; Ab Aziz, Mohd Juzaiddin; Mohd Noah, Shahrul Azman.

Multi-disciplinary Trends in Artificial Intelligence - 10th International Workshop, MIWAI 2016, Proceedings. Vol. 10053 LNAI Springer Verlag, 2016. p. 3-11 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 10053 LNAI).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Al-Salemi, B, Ab Aziz, MJ & Mohd Noah, SA 2016, BoWT: A hybrid text representation model for improving text categorization based on Adaboost.MH. in Multi-disciplinary Trends in Artificial Intelligence - 10th International Workshop, MIWAI 2016, Proceedings. vol. 10053 LNAI, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 10053 LNAI, Springer Verlag, pp. 3-11, 10th Multi-Disciplinary International Workshop on Artificial Intelligence, MIWAI 2016, Chiang Mai, Thailand, 7/12/16. https://doi.org/10.1007/978-3-319-49397-8_1
Al-Salemi B, Ab Aziz MJ, Mohd Noah SA. BoWT: A hybrid text representation model for improving text categorization based on Adaboost.MH. In Multi-disciplinary Trends in Artificial Intelligence - 10th International Workshop, MIWAI 2016, Proceedings. Vol. 10053 LNAI. Springer Verlag. 2016. p. 3-11. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). https://doi.org/10.1007/978-3-319-49397-8_1
Al-Salemi, Bassam ; Ab Aziz, Mohd Juzaiddin ; Mohd Noah, Shahrul Azman. / BoWT : A hybrid text representation model for improving text categorization based on Adaboost.MH. Multi-disciplinary Trends in Artificial Intelligence - 10th International Workshop, MIWAI 2016, Proceedings. Vol. 10053 LNAI Springer Verlag, 2016. pp. 3-11 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
@inproceedings{4f91662d73aa49109822d8c4289ce740,
title = "BoWT: A hybrid text representation model for improving text categorization based on Adaboost.MH",
abstract = "Text representation is the fundamental task in text categorization system. The BAG-OF-WORDS (BOW) is a typical model for representing the texts into vectors of single words. Even though it is a simple representation model, BOW has been criticized for its disregard of the relationships between the words. Alternatively, the Latent Dirichlet Allocation (LDA) topic model has been proposed to represent the texts into a BAG-OF-TopICS (BOT). In LDA, the words in the corpus are statistically grouped into a small number of themes called “latent topics” in which the topics capture the semantic relationships between the words. Thus, representing the documents using BOT will dramatically accelerate the training time; as well improve the classification performance. However, BOT has been proven to not be effective for imbalanced datasets. Accordingly, this paper presents a hybrid text representation model as a combination of BOW and BOT, namely BOWT. In BOWT, the high weighted BOW’s features are merged with the BOT’s features to produce a new feature space. The proposed representation model BOWT is evaluated for multi-label text categorization based on the well-known boosting algorithm ADABOOST.MH. The experimental results on four benchmarks demonstrated that the BOWT representation model notably outperforms both BOW and BOT and dramatically improves the classification performance of ADABOOST.MH for text categorization.",
keywords = "Adaboost.MH, Bowt, Text categorization, Text representation, Topic modeling",
author = "Bassam Al-Salemi and {Ab Aziz}, {Mohd Juzaiddin} and {Mohd Noah}, {Shahrul Azman}",
year = "2016",
doi = "10.1007/978-3-319-49397-8_1",
language = "English",
isbn = "9783319493961",
volume = "10053 LNAI",
series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
publisher = "Springer Verlag",
pages = "3--11",
booktitle = "Multi-disciplinary Trends in Artificial Intelligence - 10th International Workshop, MIWAI 2016, Proceedings",
address = "Germany",

}

TY - GEN

T1 - BoWT

T2 - A hybrid text representation model for improving text categorization based on Adaboost.MH

AU - Al-Salemi, Bassam

AU - Ab Aziz, Mohd Juzaiddin

AU - Mohd Noah, Shahrul Azman

PY - 2016

Y1 - 2016

N2 - Text representation is the fundamental task in text categorization system. The BAG-OF-WORDS (BOW) is a typical model for representing the texts into vectors of single words. Even though it is a simple representation model, BOW has been criticized for its disregard of the relationships between the words. Alternatively, the Latent Dirichlet Allocation (LDA) topic model has been proposed to represent the texts into a BAG-OF-TopICS (BOT). In LDA, the words in the corpus are statistically grouped into a small number of themes called “latent topics” in which the topics capture the semantic relationships between the words. Thus, representing the documents using BOT will dramatically accelerate the training time; as well improve the classification performance. However, BOT has been proven to not be effective for imbalanced datasets. Accordingly, this paper presents a hybrid text representation model as a combination of BOW and BOT, namely BOWT. In BOWT, the high weighted BOW’s features are merged with the BOT’s features to produce a new feature space. The proposed representation model BOWT is evaluated for multi-label text categorization based on the well-known boosting algorithm ADABOOST.MH. The experimental results on four benchmarks demonstrated that the BOWT representation model notably outperforms both BOW and BOT and dramatically improves the classification performance of ADABOOST.MH for text categorization.

AB - Text representation is the fundamental task in text categorization system. The BAG-OF-WORDS (BOW) is a typical model for representing the texts into vectors of single words. Even though it is a simple representation model, BOW has been criticized for its disregard of the relationships between the words. Alternatively, the Latent Dirichlet Allocation (LDA) topic model has been proposed to represent the texts into a BAG-OF-TopICS (BOT). In LDA, the words in the corpus are statistically grouped into a small number of themes called “latent topics” in which the topics capture the semantic relationships between the words. Thus, representing the documents using BOT will dramatically accelerate the training time; as well improve the classification performance. However, BOT has been proven to not be effective for imbalanced datasets. Accordingly, this paper presents a hybrid text representation model as a combination of BOW and BOT, namely BOWT. In BOWT, the high weighted BOW’s features are merged with the BOT’s features to produce a new feature space. The proposed representation model BOWT is evaluated for multi-label text categorization based on the well-known boosting algorithm ADABOOST.MH. The experimental results on four benchmarks demonstrated that the BOWT representation model notably outperforms both BOW and BOT and dramatically improves the classification performance of ADABOOST.MH for text categorization.

KW - Adaboost.MH

KW - Bowt

KW - Text categorization

KW - Text representation

KW - Topic modeling

UR - http://www.scopus.com/inward/record.url?scp=85007188858&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85007188858&partnerID=8YFLogxK

U2 - 10.1007/978-3-319-49397-8_1

DO - 10.1007/978-3-319-49397-8_1

M3 - Conference contribution

SN - 9783319493961

VL - 10053 LNAI

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 3

EP - 11

BT - Multi-disciplinary Trends in Artificial Intelligence - 10th International Workshop, MIWAI 2016, Proceedings

PB - Springer Verlag

ER -