Arabic nested noun compound extraction based on linguistic features and statistical measures

Nazlia Omar, Qasem Al-Tashi

Research output: Contribution to journalArticle

Abstract

The extraction of Arabic nested noun compound is significant for several research areas such as sentiment analysis, text summarization, word categorization, grammar checker, and machine translation. Much research has studied the extraction of Arabic noun compound using linguistic approaches, statistical methods, or a hybrid of both. A wide range of the existing approaches concentrate on the extraction of the bi-gram or tri-gram noun compound. Nonetheless, extracting a 4-gram or 5-gram nested noun compound is a challenging task due to the morphological, orthographic, syntactic and semantic variations. Many features have an important effect on the efficiency of extracting a noun compound such as unit-hood, contextual information, and term-hood. Hence, there is a need to improve the effectiveness of the Arabic nested noun compound extraction. Thus, this paper proposes a hybrid linguistic approach and a statistical method with a view to enhance the extraction of the Arabic nested noun compound. A number of pre-processing phases are presented, including transformation, tokenization, and normalisation. The linguistic approaches that have been used in this study consist of a part-of-speech tagging and the named entities pattern, whereas the proposed statistical methods that have been used in this study consist of the NC-value, NTC-value, NLC-value, and the combination of these association measures. The proposed methods have demonstrated that the combined association measures have outperformed the NLC-value, NTC-value, and NC-value in terms of nested noun compound extraction by achieving 90%, 88%, 87%, and 81% for bigram, trigram, 4-gram, and 5-gram, respectively.

Original languageEnglish
Pages (from-to)93-107
Number of pages15
JournalGEMA Online Journal of Language Studies
Volume18
Issue number2
DOIs
Publication statusPublished - 1 May 2018

Fingerprint

linguistics
statistical method
Values
text analysis
normalization
Linguistic Features
Noun Compounds
grammar
semantics
efficiency
Statistical Methods

Keywords

  • Arabic multi-word expressions
  • Association measures
  • Nested noun compound
  • Noun compound
  • POS tagging

ASJC Scopus subject areas

  • Language and Linguistics
  • Linguistics and Language
  • Literature and Literary Theory

Cite this

Arabic nested noun compound extraction based on linguistic features and statistical measures. / Omar, Nazlia; Al-Tashi, Qasem.

In: GEMA Online Journal of Language Studies, Vol. 18, No. 2, 01.05.2018, p. 93-107.

Research output: Contribution to journalArticle

@article{34ccd92151e04b6dab68b8b9e9795e0c,
title = "Arabic nested noun compound extraction based on linguistic features and statistical measures",
abstract = "The extraction of Arabic nested noun compound is significant for several research areas such as sentiment analysis, text summarization, word categorization, grammar checker, and machine translation. Much research has studied the extraction of Arabic noun compound using linguistic approaches, statistical methods, or a hybrid of both. A wide range of the existing approaches concentrate on the extraction of the bi-gram or tri-gram noun compound. Nonetheless, extracting a 4-gram or 5-gram nested noun compound is a challenging task due to the morphological, orthographic, syntactic and semantic variations. Many features have an important effect on the efficiency of extracting a noun compound such as unit-hood, contextual information, and term-hood. Hence, there is a need to improve the effectiveness of the Arabic nested noun compound extraction. Thus, this paper proposes a hybrid linguistic approach and a statistical method with a view to enhance the extraction of the Arabic nested noun compound. A number of pre-processing phases are presented, including transformation, tokenization, and normalisation. The linguistic approaches that have been used in this study consist of a part-of-speech tagging and the named entities pattern, whereas the proposed statistical methods that have been used in this study consist of the NC-value, NTC-value, NLC-value, and the combination of these association measures. The proposed methods have demonstrated that the combined association measures have outperformed the NLC-value, NTC-value, and NC-value in terms of nested noun compound extraction by achieving 90{\%}, 88{\%}, 87{\%}, and 81{\%} for bigram, trigram, 4-gram, and 5-gram, respectively.",
keywords = "Arabic multi-word expressions, Association measures, Nested noun compound, Noun compound, POS tagging",
author = "Nazlia Omar and Qasem Al-Tashi",
year = "2018",
month = "5",
day = "1",
doi = "10.17576/gema-2018-1802-07",
language = "English",
volume = "18",
pages = "93--107",
journal = "GEMA Online Journal of Language Studies",
issn = "1675-8021",
publisher = "Universiti Kebangsaan Malaysia",
number = "2",

}

TY - JOUR

T1 - Arabic nested noun compound extraction based on linguistic features and statistical measures

AU - Omar, Nazlia

AU - Al-Tashi, Qasem

PY - 2018/5/1

Y1 - 2018/5/1

N2 - The extraction of Arabic nested noun compound is significant for several research areas such as sentiment analysis, text summarization, word categorization, grammar checker, and machine translation. Much research has studied the extraction of Arabic noun compound using linguistic approaches, statistical methods, or a hybrid of both. A wide range of the existing approaches concentrate on the extraction of the bi-gram or tri-gram noun compound. Nonetheless, extracting a 4-gram or 5-gram nested noun compound is a challenging task due to the morphological, orthographic, syntactic and semantic variations. Many features have an important effect on the efficiency of extracting a noun compound such as unit-hood, contextual information, and term-hood. Hence, there is a need to improve the effectiveness of the Arabic nested noun compound extraction. Thus, this paper proposes a hybrid linguistic approach and a statistical method with a view to enhance the extraction of the Arabic nested noun compound. A number of pre-processing phases are presented, including transformation, tokenization, and normalisation. The linguistic approaches that have been used in this study consist of a part-of-speech tagging and the named entities pattern, whereas the proposed statistical methods that have been used in this study consist of the NC-value, NTC-value, NLC-value, and the combination of these association measures. The proposed methods have demonstrated that the combined association measures have outperformed the NLC-value, NTC-value, and NC-value in terms of nested noun compound extraction by achieving 90%, 88%, 87%, and 81% for bigram, trigram, 4-gram, and 5-gram, respectively.

AB - The extraction of Arabic nested noun compound is significant for several research areas such as sentiment analysis, text summarization, word categorization, grammar checker, and machine translation. Much research has studied the extraction of Arabic noun compound using linguistic approaches, statistical methods, or a hybrid of both. A wide range of the existing approaches concentrate on the extraction of the bi-gram or tri-gram noun compound. Nonetheless, extracting a 4-gram or 5-gram nested noun compound is a challenging task due to the morphological, orthographic, syntactic and semantic variations. Many features have an important effect on the efficiency of extracting a noun compound such as unit-hood, contextual information, and term-hood. Hence, there is a need to improve the effectiveness of the Arabic nested noun compound extraction. Thus, this paper proposes a hybrid linguistic approach and a statistical method with a view to enhance the extraction of the Arabic nested noun compound. A number of pre-processing phases are presented, including transformation, tokenization, and normalisation. The linguistic approaches that have been used in this study consist of a part-of-speech tagging and the named entities pattern, whereas the proposed statistical methods that have been used in this study consist of the NC-value, NTC-value, NLC-value, and the combination of these association measures. The proposed methods have demonstrated that the combined association measures have outperformed the NLC-value, NTC-value, and NC-value in terms of nested noun compound extraction by achieving 90%, 88%, 87%, and 81% for bigram, trigram, 4-gram, and 5-gram, respectively.

KW - Arabic multi-word expressions

KW - Association measures

KW - Nested noun compound

KW - Noun compound

KW - POS tagging

UR - http://www.scopus.com/inward/record.url?scp=85047951005&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85047951005&partnerID=8YFLogxK

U2 - 10.17576/gema-2018-1802-07

DO - 10.17576/gema-2018-1802-07

M3 - Article

AN - SCOPUS:85047951005

VL - 18

SP - 93

EP - 107

JO - GEMA Online Journal of Language Studies

JF - GEMA Online Journal of Language Studies

SN - 1675-8021

IS - 2

ER -