Part-of-speech tagger for malay social media texts

Siti Noor Allia Noor Ariffin, Sabrina Tiun

Research output: Contribution to journalArticle

Abstract

Processing the meaning of words in social media texts, such as tweets, is challenging in natural language processing. Malay tweets are no exception because they demonstrate distinct linguistic phenomena, such as the use of dialects from each state in Malaysia; borrowing foreign language terms in the context of Malay language; and using mixed languages, abbreviations and spelling errors or mistakes in sentence structure. Tagging the word class of tweets is an arduous task because tweets are characterised by their distinctive style, linguistic sounds and errors. Currently, existing works on Malay part-of-speech (POS) are based only on standard Malay and formal texts and are thus unsuitable for tagging tweet texts. Thus, a POS model of tweet tagging for non-standardised Malay language must be developed. This study aims to design and implement a non-standardised Malay POS model for tweets and performs assessment on the basis of the word tagging accuracy of test data of unnormalised and normalised tweet texts. A solution that adopts a probabilistic POS tagging called QTAG is proposed. Results show that the Malay QTAG achieves best average POS tagging accuracies of 90% and 88.8% for normalised and unnormalised test datasets, respectively.

Original languageEnglish
Pages (from-to)124-142
Number of pages19
JournalGEMA Online Journal of Language Studies
Volume18
Issue number4
DOIs
Publication statusPublished - 1 Nov 2018

Fingerprint

social media
language
linguistics
dialect
foreign language
Malaysia
Social Media
Part of Speech
Tagging
Tag
Language
Part-of-speech Tagging

Keywords

  • Informal malay text
  • Malay POS tagger
  • Malay tweet
  • Part-of-speech
  • QTAG

ASJC Scopus subject areas

  • Language and Linguistics
  • Linguistics and Language
  • Literature and Literary Theory

Cite this

Part-of-speech tagger for malay social media texts. / Ariffin, Siti Noor Allia Noor; Tiun, Sabrina.

In: GEMA Online Journal of Language Studies, Vol. 18, No. 4, 01.11.2018, p. 124-142.

Research output: Contribution to journalArticle

Ariffin, Siti Noor Allia Noor ; Tiun, Sabrina. / Part-of-speech tagger for malay social media texts. In: GEMA Online Journal of Language Studies. 2018 ; Vol. 18, No. 4. pp. 124-142.
@article{668184426ba845638fe39dc66415b98f,
title = "Part-of-speech tagger for malay social media texts",
abstract = "Processing the meaning of words in social media texts, such as tweets, is challenging in natural language processing. Malay tweets are no exception because they demonstrate distinct linguistic phenomena, such as the use of dialects from each state in Malaysia; borrowing foreign language terms in the context of Malay language; and using mixed languages, abbreviations and spelling errors or mistakes in sentence structure. Tagging the word class of tweets is an arduous task because tweets are characterised by their distinctive style, linguistic sounds and errors. Currently, existing works on Malay part-of-speech (POS) are based only on standard Malay and formal texts and are thus unsuitable for tagging tweet texts. Thus, a POS model of tweet tagging for non-standardised Malay language must be developed. This study aims to design and implement a non-standardised Malay POS model for tweets and performs assessment on the basis of the word tagging accuracy of test data of unnormalised and normalised tweet texts. A solution that adopts a probabilistic POS tagging called QTAG is proposed. Results show that the Malay QTAG achieves best average POS tagging accuracies of 90{\%} and 88.8{\%} for normalised and unnormalised test datasets, respectively.",
keywords = "Informal malay text, Malay POS tagger, Malay tweet, Part-of-speech, QTAG",
author = "Ariffin, {Siti Noor Allia Noor} and Sabrina Tiun",
year = "2018",
month = "11",
day = "1",
doi = "10.17576/gema-2018-1804-09",
language = "English",
volume = "18",
pages = "124--142",
journal = "GEMA Online Journal of Language Studies",
issn = "1675-8021",
publisher = "Universiti Kebangsaan Malaysia",
number = "4",

}

TY - JOUR

T1 - Part-of-speech tagger for malay social media texts

AU - Ariffin, Siti Noor Allia Noor

AU - Tiun, Sabrina

PY - 2018/11/1

Y1 - 2018/11/1

N2 - Processing the meaning of words in social media texts, such as tweets, is challenging in natural language processing. Malay tweets are no exception because they demonstrate distinct linguistic phenomena, such as the use of dialects from each state in Malaysia; borrowing foreign language terms in the context of Malay language; and using mixed languages, abbreviations and spelling errors or mistakes in sentence structure. Tagging the word class of tweets is an arduous task because tweets are characterised by their distinctive style, linguistic sounds and errors. Currently, existing works on Malay part-of-speech (POS) are based only on standard Malay and formal texts and are thus unsuitable for tagging tweet texts. Thus, a POS model of tweet tagging for non-standardised Malay language must be developed. This study aims to design and implement a non-standardised Malay POS model for tweets and performs assessment on the basis of the word tagging accuracy of test data of unnormalised and normalised tweet texts. A solution that adopts a probabilistic POS tagging called QTAG is proposed. Results show that the Malay QTAG achieves best average POS tagging accuracies of 90% and 88.8% for normalised and unnormalised test datasets, respectively.

AB - Processing the meaning of words in social media texts, such as tweets, is challenging in natural language processing. Malay tweets are no exception because they demonstrate distinct linguistic phenomena, such as the use of dialects from each state in Malaysia; borrowing foreign language terms in the context of Malay language; and using mixed languages, abbreviations and spelling errors or mistakes in sentence structure. Tagging the word class of tweets is an arduous task because tweets are characterised by their distinctive style, linguistic sounds and errors. Currently, existing works on Malay part-of-speech (POS) are based only on standard Malay and formal texts and are thus unsuitable for tagging tweet texts. Thus, a POS model of tweet tagging for non-standardised Malay language must be developed. This study aims to design and implement a non-standardised Malay POS model for tweets and performs assessment on the basis of the word tagging accuracy of test data of unnormalised and normalised tweet texts. A solution that adopts a probabilistic POS tagging called QTAG is proposed. Results show that the Malay QTAG achieves best average POS tagging accuracies of 90% and 88.8% for normalised and unnormalised test datasets, respectively.

KW - Informal malay text

KW - Malay POS tagger

KW - Malay tweet

KW - Part-of-speech

KW - QTAG

UR - http://www.scopus.com/inward/record.url?scp=85057754576&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85057754576&partnerID=8YFLogxK

U2 - 10.17576/gema-2018-1804-09

DO - 10.17576/gema-2018-1804-09

M3 - Article

VL - 18

SP - 124

EP - 142

JO - GEMA Online Journal of Language Studies

JF - GEMA Online Journal of Language Studies

SN - 1675-8021

IS - 4

ER -