Comparative analysis of ML POS on Arabic tweets

Mustafa Abdulkareem, Sabrina Tiun

Research output: Contribution to journalArticle

3 Citations (Scopus)

Abstract

One of the challenges of natural language processing is social media text like tweets. Conversational text in contrast to genres that are highly edited (standard language) which traditional NLP tools have been developed for contains many syntactic patterns and non-standard lexical items. These are the outcomes of dialectal variation, diversity in topic, orthography, unintended errors, conversational errors and creative language use. The fact that twitter text is characterized by idiosyncratic style, noise and linguistic errors makes it difficult to part-of-speech tag. The aim of this paper is to design and implement models of speech tagging for Arabic tweets by investigating numerous models of machine learning like K-Nearest Neighbour, Naïve Bayes and Decision tree models. In this paper, a novel Arabic Twitter corpus is introduced while assessing various state-of-the-art POS taggers which retrained on the given corpus. A state-of-the-art accuracy of 87.97% is achieved when tagging twitter.

Original languageEnglish
Pages (from-to)403-411
Number of pages9
JournalJournal of Theoretical and Applied Information Technology
Volume95
Issue number2
Publication statusPublished - 31 Jan 2017
Externally publishedYes

Fingerprint

Comparative Analysis
Tagging
Social Media
Syntactics
Bayes
Decision trees
Linguistics
Decision tree
Natural Language
Learning systems
Nearest Neighbor
Machine Learning
Model
Processing
Text
Language
Corpus
Speech

Keywords

  • Arabic part of speech tagging
  • Arabic tweets classification
  • Feature extraction

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Computer Science(all)

Cite this

Comparative analysis of ML POS on Arabic tweets. / Abdulkareem, Mustafa; Tiun, Sabrina.

In: Journal of Theoretical and Applied Information Technology, Vol. 95, No. 2, 31.01.2017, p. 403-411.

Research output: Contribution to journalArticle

@article{9cc6191b0b3946d9a0ed6961a3284713,
title = "Comparative analysis of ML POS on Arabic tweets",
abstract = "One of the challenges of natural language processing is social media text like tweets. Conversational text in contrast to genres that are highly edited (standard language) which traditional NLP tools have been developed for contains many syntactic patterns and non-standard lexical items. These are the outcomes of dialectal variation, diversity in topic, orthography, unintended errors, conversational errors and creative language use. The fact that twitter text is characterized by idiosyncratic style, noise and linguistic errors makes it difficult to part-of-speech tag. The aim of this paper is to design and implement models of speech tagging for Arabic tweets by investigating numerous models of machine learning like K-Nearest Neighbour, Na{\"i}ve Bayes and Decision tree models. In this paper, a novel Arabic Twitter corpus is introduced while assessing various state-of-the-art POS taggers which retrained on the given corpus. A state-of-the-art accuracy of 87.97{\%} is achieved when tagging twitter.",
keywords = "Arabic part of speech tagging, Arabic tweets classification, Feature extraction",
author = "Mustafa Abdulkareem and Sabrina Tiun",
year = "2017",
month = "1",
day = "31",
language = "English",
volume = "95",
pages = "403--411",
journal = "Journal of Theoretical and Applied Information Technology",
issn = "1992-8645",
publisher = "Asian Research Publishing Network (ARPN)",
number = "2",

}

TY - JOUR

T1 - Comparative analysis of ML POS on Arabic tweets

AU - Abdulkareem, Mustafa

AU - Tiun, Sabrina

PY - 2017/1/31

Y1 - 2017/1/31

N2 - One of the challenges of natural language processing is social media text like tweets. Conversational text in contrast to genres that are highly edited (standard language) which traditional NLP tools have been developed for contains many syntactic patterns and non-standard lexical items. These are the outcomes of dialectal variation, diversity in topic, orthography, unintended errors, conversational errors and creative language use. The fact that twitter text is characterized by idiosyncratic style, noise and linguistic errors makes it difficult to part-of-speech tag. The aim of this paper is to design and implement models of speech tagging for Arabic tweets by investigating numerous models of machine learning like K-Nearest Neighbour, Naïve Bayes and Decision tree models. In this paper, a novel Arabic Twitter corpus is introduced while assessing various state-of-the-art POS taggers which retrained on the given corpus. A state-of-the-art accuracy of 87.97% is achieved when tagging twitter.

AB - One of the challenges of natural language processing is social media text like tweets. Conversational text in contrast to genres that are highly edited (standard language) which traditional NLP tools have been developed for contains many syntactic patterns and non-standard lexical items. These are the outcomes of dialectal variation, diversity in topic, orthography, unintended errors, conversational errors and creative language use. The fact that twitter text is characterized by idiosyncratic style, noise and linguistic errors makes it difficult to part-of-speech tag. The aim of this paper is to design and implement models of speech tagging for Arabic tweets by investigating numerous models of machine learning like K-Nearest Neighbour, Naïve Bayes and Decision tree models. In this paper, a novel Arabic Twitter corpus is introduced while assessing various state-of-the-art POS taggers which retrained on the given corpus. A state-of-the-art accuracy of 87.97% is achieved when tagging twitter.

KW - Arabic part of speech tagging

KW - Arabic tweets classification

KW - Feature extraction

UR - http://www.scopus.com/inward/record.url?scp=85011698040&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85011698040&partnerID=8YFLogxK

M3 - Article

VL - 95

SP - 403

EP - 411

JO - Journal of Theoretical and Applied Information Technology

JF - Journal of Theoretical and Applied Information Technology

SN - 1992-8645

IS - 2

ER -