Multi-label Arabic text categorization: A benchmark and baseline comparison of multi-label learning algorithms

Bassam Al-Salemi, Masri Ayob, Graham Kendall, Shahrul Azman Mohd Noah

Research output: Contribution to journalArticle

1 Citation (Scopus)

Abstract

Multi-label text categorization refers to the problem of assigning each document to a subset of categories by means of multi-label learning algorithms. Unlike English and most other languages, the unavailability of Arabic benchmark datasets prevents evaluating multi-label learning algorithms for Arabic text categorization. As a result, only a few recent studies have dealt with multi-label Arabic text categorization on non-benchmark and inaccessible datasets. Therefore, this work aims to promote multi-label Arabic text categorization through (a) introducing “RTAnews”, a new benchmark dataset of multi-label Arabic news articles for text categorization and other supervised learning tasks. The benchmark is publicly available in several formats compatible with the existing multi-label learning tools, such as MEKA and Mulan. (b) Conducting an extensive comparison of most of the well-known multi-label learning algorithms for Arabic text categorization in order to have baseline results and show the effectiveness of these algorithms for Arabic text categorization on RTAnews. The evaluation involves four multi-label transformation-based algorithms: Binary Relevance, Classifier Chains, Calibrated Ranking by Pairwise Comparison and Label Powerset, with three base learners (Support Vector Machine, k-Nearest-Neighbors and Random Forest); and four adaptation-based algorithms (Multi-label kNN, Instance-Based Learning by Logistic Regression Multi-label, Binary Relevance kNN and RFBoost). The reported baseline results show that both RFBoost and Label Powerset with Support Vector Machine as base learner outperformed other compared algorithms. Results also demonstrated that adaptation-based algorithms are faster than transformation-based algorithms.

Original languageEnglish
Pages (from-to)212-227
Number of pages16
JournalInformation Processing and Management
Volume56
Issue number1
DOIs
Publication statusPublished - 1 Jan 2019

Fingerprint

Learning algorithms
Labels
learning
Benchmark
Text categorization
Learning algorithm
Support vector machines
ranking
news
Supervised learning
logistics
Set theory
regression
Logistics
Classifiers
language
evaluation

Keywords

  • Arabic text categorization
  • Multi-label benchmark
  • Multi-label learning
  • RTAnews

ASJC Scopus subject areas

  • Information Systems
  • Media Technology
  • Computer Science Applications
  • Management Science and Operations Research
  • Library and Information Sciences

Cite this

Multi-label Arabic text categorization : A benchmark and baseline comparison of multi-label learning algorithms. / Al-Salemi, Bassam; Ayob, Masri; Kendall, Graham; Mohd Noah, Shahrul Azman.

In: Information Processing and Management, Vol. 56, No. 1, 01.01.2019, p. 212-227.

Research output: Contribution to journalArticle

@article{1776f9c187cf40cdb0d35158ae389ed9,
title = "Multi-label Arabic text categorization: A benchmark and baseline comparison of multi-label learning algorithms",
abstract = "Multi-label text categorization refers to the problem of assigning each document to a subset of categories by means of multi-label learning algorithms. Unlike English and most other languages, the unavailability of Arabic benchmark datasets prevents evaluating multi-label learning algorithms for Arabic text categorization. As a result, only a few recent studies have dealt with multi-label Arabic text categorization on non-benchmark and inaccessible datasets. Therefore, this work aims to promote multi-label Arabic text categorization through (a) introducing “RTAnews”, a new benchmark dataset of multi-label Arabic news articles for text categorization and other supervised learning tasks. The benchmark is publicly available in several formats compatible with the existing multi-label learning tools, such as MEKA and Mulan. (b) Conducting an extensive comparison of most of the well-known multi-label learning algorithms for Arabic text categorization in order to have baseline results and show the effectiveness of these algorithms for Arabic text categorization on RTAnews. The evaluation involves four multi-label transformation-based algorithms: Binary Relevance, Classifier Chains, Calibrated Ranking by Pairwise Comparison and Label Powerset, with three base learners (Support Vector Machine, k-Nearest-Neighbors and Random Forest); and four adaptation-based algorithms (Multi-label kNN, Instance-Based Learning by Logistic Regression Multi-label, Binary Relevance kNN and RFBoost). The reported baseline results show that both RFBoost and Label Powerset with Support Vector Machine as base learner outperformed other compared algorithms. Results also demonstrated that adaptation-based algorithms are faster than transformation-based algorithms.",
keywords = "Arabic text categorization, Multi-label benchmark, Multi-label learning, RTAnews",
author = "Bassam Al-Salemi and Masri Ayob and Graham Kendall and {Mohd Noah}, {Shahrul Azman}",
year = "2019",
month = "1",
day = "1",
doi = "10.1016/j.ipm.2018.09.008",
language = "English",
volume = "56",
pages = "212--227",
journal = "Information Processing and Management",
issn = "0306-4573",
publisher = "Elsevier Limited",
number = "1",

}

TY - JOUR

T1 - Multi-label Arabic text categorization

T2 - A benchmark and baseline comparison of multi-label learning algorithms

AU - Al-Salemi, Bassam

AU - Ayob, Masri

AU - Kendall, Graham

AU - Mohd Noah, Shahrul Azman

PY - 2019/1/1

Y1 - 2019/1/1

N2 - Multi-label text categorization refers to the problem of assigning each document to a subset of categories by means of multi-label learning algorithms. Unlike English and most other languages, the unavailability of Arabic benchmark datasets prevents evaluating multi-label learning algorithms for Arabic text categorization. As a result, only a few recent studies have dealt with multi-label Arabic text categorization on non-benchmark and inaccessible datasets. Therefore, this work aims to promote multi-label Arabic text categorization through (a) introducing “RTAnews”, a new benchmark dataset of multi-label Arabic news articles for text categorization and other supervised learning tasks. The benchmark is publicly available in several formats compatible with the existing multi-label learning tools, such as MEKA and Mulan. (b) Conducting an extensive comparison of most of the well-known multi-label learning algorithms for Arabic text categorization in order to have baseline results and show the effectiveness of these algorithms for Arabic text categorization on RTAnews. The evaluation involves four multi-label transformation-based algorithms: Binary Relevance, Classifier Chains, Calibrated Ranking by Pairwise Comparison and Label Powerset, with three base learners (Support Vector Machine, k-Nearest-Neighbors and Random Forest); and four adaptation-based algorithms (Multi-label kNN, Instance-Based Learning by Logistic Regression Multi-label, Binary Relevance kNN and RFBoost). The reported baseline results show that both RFBoost and Label Powerset with Support Vector Machine as base learner outperformed other compared algorithms. Results also demonstrated that adaptation-based algorithms are faster than transformation-based algorithms.

AB - Multi-label text categorization refers to the problem of assigning each document to a subset of categories by means of multi-label learning algorithms. Unlike English and most other languages, the unavailability of Arabic benchmark datasets prevents evaluating multi-label learning algorithms for Arabic text categorization. As a result, only a few recent studies have dealt with multi-label Arabic text categorization on non-benchmark and inaccessible datasets. Therefore, this work aims to promote multi-label Arabic text categorization through (a) introducing “RTAnews”, a new benchmark dataset of multi-label Arabic news articles for text categorization and other supervised learning tasks. The benchmark is publicly available in several formats compatible with the existing multi-label learning tools, such as MEKA and Mulan. (b) Conducting an extensive comparison of most of the well-known multi-label learning algorithms for Arabic text categorization in order to have baseline results and show the effectiveness of these algorithms for Arabic text categorization on RTAnews. The evaluation involves four multi-label transformation-based algorithms: Binary Relevance, Classifier Chains, Calibrated Ranking by Pairwise Comparison and Label Powerset, with three base learners (Support Vector Machine, k-Nearest-Neighbors and Random Forest); and four adaptation-based algorithms (Multi-label kNN, Instance-Based Learning by Logistic Regression Multi-label, Binary Relevance kNN and RFBoost). The reported baseline results show that both RFBoost and Label Powerset with Support Vector Machine as base learner outperformed other compared algorithms. Results also demonstrated that adaptation-based algorithms are faster than transformation-based algorithms.

KW - Arabic text categorization

KW - Multi-label benchmark

KW - Multi-label learning

KW - RTAnews

UR - http://www.scopus.com/inward/record.url?scp=85055113600&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85055113600&partnerID=8YFLogxK

U2 - 10.1016/j.ipm.2018.09.008

DO - 10.1016/j.ipm.2018.09.008

M3 - Article

AN - SCOPUS:85055113600

VL - 56

SP - 212

EP - 227

JO - Information Processing and Management

JF - Information Processing and Management

SN - 0306-4573

IS - 1

ER -