Extraction of arabic nested noun compounds based on a hybrid method of linguistic approach and statistical methods

Maryam Al-Mashhadani, Nazlia Omar

Research output: Contribution to journalArticle

7 Citations (Scopus)

Abstract

Arabic noun compound are phrases that consist of two or more nouns. The key characteristic behind noun compounds lies on its frequent occurrences within the text. Therefore, extracting these noun compounds is essential for several domains of research such as Information Retrieval, Sentiment Analysis and Question Answering. Many research efforts have been proposed in terms of extracting Arabic noun compounds using linguistic approaches, statistical measures or a combination of both. Most of the existing methods have concentrated on the extraction of bi-gram or tri-gram noun compound. However, extracting 4-gram or 5- gram noun compound or nested noun compound is challenging due to the difficulty of selecting an appropriate method with effective results. Multiple features have a significant impact on the effectiveness of extracting noun compound such as contextual information, unit-hood and term-hood. Thus, there is still room for improvement in terms of enhancing the effectiveness of nested noun compound extraction. Therefore, this study proposed a combination of linguistic approach and statistical measures in order to enhance the extraction of nested noun compound. Several preprocessing steps are involved including transformation, normalization, tokenization, and stemming. The linguistic approach that have been used in this study is Part-of-Speech tagging. In addition, a new linguistic pattern for named entities has been utilized using a list of Arabic named entities in order to enhance the linguistic approach in terms of noun compound recognition. The proposed statistical measures consists of NC-value, NTC-value and NLCvalue. The experimental results have demonstrated that NLC-value has outperformed NTC-value and NCvalue regarding to nested noun compound extraction by achieving 83%, 76%, 72% and 65% for bigram, trigram, 4-gram and 5-gram respectively.

Original languageEnglish
Pages (from-to)408-416
Number of pages9
JournalJournal of Theoretical and Applied Information Technology
Volume76
Issue number3
Publication statusPublished - 30 Jun 2015

Fingerprint

Hybrid Method
Linguistics
Statistical method
Statistical methods
Sentiment Analysis
Question Answering
Tagging
Information retrieval
Information Retrieval
Normalization
Preprocessing
Unit
Experimental Results
Term

Keywords

  • Multi-word expressions
  • Nested noun compound
  • Noun compound
  • POS tagging
  • Statistical method

ASJC Scopus subject areas

  • Computer Science(all)
  • Theoretical Computer Science

Cite this

@article{7c24689a197945a48ff39ded4c01787b,
title = "Extraction of arabic nested noun compounds based on a hybrid method of linguistic approach and statistical methods",
abstract = "Arabic noun compound are phrases that consist of two or more nouns. The key characteristic behind noun compounds lies on its frequent occurrences within the text. Therefore, extracting these noun compounds is essential for several domains of research such as Information Retrieval, Sentiment Analysis and Question Answering. Many research efforts have been proposed in terms of extracting Arabic noun compounds using linguistic approaches, statistical measures or a combination of both. Most of the existing methods have concentrated on the extraction of bi-gram or tri-gram noun compound. However, extracting 4-gram or 5- gram noun compound or nested noun compound is challenging due to the difficulty of selecting an appropriate method with effective results. Multiple features have a significant impact on the effectiveness of extracting noun compound such as contextual information, unit-hood and term-hood. Thus, there is still room for improvement in terms of enhancing the effectiveness of nested noun compound extraction. Therefore, this study proposed a combination of linguistic approach and statistical measures in order to enhance the extraction of nested noun compound. Several preprocessing steps are involved including transformation, normalization, tokenization, and stemming. The linguistic approach that have been used in this study is Part-of-Speech tagging. In addition, a new linguistic pattern for named entities has been utilized using a list of Arabic named entities in order to enhance the linguistic approach in terms of noun compound recognition. The proposed statistical measures consists of NC-value, NTC-value and NLCvalue. The experimental results have demonstrated that NLC-value has outperformed NTC-value and NCvalue regarding to nested noun compound extraction by achieving 83{\%}, 76{\%}, 72{\%} and 65{\%} for bigram, trigram, 4-gram and 5-gram respectively.",
keywords = "Multi-word expressions, Nested noun compound, Noun compound, POS tagging, Statistical method",
author = "Maryam Al-Mashhadani and Nazlia Omar",
year = "2015",
month = "6",
day = "30",
language = "English",
volume = "76",
pages = "408--416",
journal = "Journal of Theoretical and Applied Information Technology",
issn = "1992-8645",
publisher = "Asian Research Publishing Network (ARPN)",
number = "3",

}

TY - JOUR

T1 - Extraction of arabic nested noun compounds based on a hybrid method of linguistic approach and statistical methods

AU - Al-Mashhadani, Maryam

AU - Omar, Nazlia

PY - 2015/6/30

Y1 - 2015/6/30

N2 - Arabic noun compound are phrases that consist of two or more nouns. The key characteristic behind noun compounds lies on its frequent occurrences within the text. Therefore, extracting these noun compounds is essential for several domains of research such as Information Retrieval, Sentiment Analysis and Question Answering. Many research efforts have been proposed in terms of extracting Arabic noun compounds using linguistic approaches, statistical measures or a combination of both. Most of the existing methods have concentrated on the extraction of bi-gram or tri-gram noun compound. However, extracting 4-gram or 5- gram noun compound or nested noun compound is challenging due to the difficulty of selecting an appropriate method with effective results. Multiple features have a significant impact on the effectiveness of extracting noun compound such as contextual information, unit-hood and term-hood. Thus, there is still room for improvement in terms of enhancing the effectiveness of nested noun compound extraction. Therefore, this study proposed a combination of linguistic approach and statistical measures in order to enhance the extraction of nested noun compound. Several preprocessing steps are involved including transformation, normalization, tokenization, and stemming. The linguistic approach that have been used in this study is Part-of-Speech tagging. In addition, a new linguistic pattern for named entities has been utilized using a list of Arabic named entities in order to enhance the linguistic approach in terms of noun compound recognition. The proposed statistical measures consists of NC-value, NTC-value and NLCvalue. The experimental results have demonstrated that NLC-value has outperformed NTC-value and NCvalue regarding to nested noun compound extraction by achieving 83%, 76%, 72% and 65% for bigram, trigram, 4-gram and 5-gram respectively.

AB - Arabic noun compound are phrases that consist of two or more nouns. The key characteristic behind noun compounds lies on its frequent occurrences within the text. Therefore, extracting these noun compounds is essential for several domains of research such as Information Retrieval, Sentiment Analysis and Question Answering. Many research efforts have been proposed in terms of extracting Arabic noun compounds using linguistic approaches, statistical measures or a combination of both. Most of the existing methods have concentrated on the extraction of bi-gram or tri-gram noun compound. However, extracting 4-gram or 5- gram noun compound or nested noun compound is challenging due to the difficulty of selecting an appropriate method with effective results. Multiple features have a significant impact on the effectiveness of extracting noun compound such as contextual information, unit-hood and term-hood. Thus, there is still room for improvement in terms of enhancing the effectiveness of nested noun compound extraction. Therefore, this study proposed a combination of linguistic approach and statistical measures in order to enhance the extraction of nested noun compound. Several preprocessing steps are involved including transformation, normalization, tokenization, and stemming. The linguistic approach that have been used in this study is Part-of-Speech tagging. In addition, a new linguistic pattern for named entities has been utilized using a list of Arabic named entities in order to enhance the linguistic approach in terms of noun compound recognition. The proposed statistical measures consists of NC-value, NTC-value and NLCvalue. The experimental results have demonstrated that NLC-value has outperformed NTC-value and NCvalue regarding to nested noun compound extraction by achieving 83%, 76%, 72% and 65% for bigram, trigram, 4-gram and 5-gram respectively.

KW - Multi-word expressions

KW - Nested noun compound

KW - Noun compound

KW - POS tagging

KW - Statistical method

UR - http://www.scopus.com/inward/record.url?scp=84934764170&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84934764170&partnerID=8YFLogxK

M3 - Article

VL - 76

SP - 408

EP - 416

JO - Journal of Theoretical and Applied Information Technology

JF - Journal of Theoretical and Applied Information Technology

SN - 1992-8645

IS - 3

ER -