Automatic extraction of Malay compound nouns using a hybrid of statistical and machine learning methods

Muneer A S Hazaa, Nazlia Omar, Fadl Mutaher Ba-Alwi, Mohammed Albared

Research output: Contribution to journalArticle

8 Citations (Scopus)

Abstract

Identifying of compound nouns is important for a wide spectrum of applications in the field of natural language processing such as machine translation and information retrieval. Extraction of compound nouns requires deep or shallow syntactic preprocessing tools and large corpora. This paper investigates several methods for extracting Noun compounds from Malay text corpora. First, we present the empirical results of sixteen statistical association measures of Malay N+N compound nouns extraction. Second, we introduce the possibility of integrating multiple association measures. Third, this work also provides a standard dataset intended to provide a common platform for evaluating research on the identification compound Nouns in Malay language. The standard data set contains 7,235 unique N-N candidates, 2,970 of them are N-N compound nouns collocations. The extraction algorithms are evaluated against this reference data set. The experimental results demonstrate that a group of association measures (Ttest , Piatersky-Shapiro (PS) , C-value, FGM and rank combination method) are the best association measure and outperforms the other association measures for N+N collocations in the Malay corpus. Finally, we describe several classification methods for combining association measures scores of the basic measures, followed by their evaluation. Evaluation results show that classification algorithms significantly outperform individual association measures. Experimental results obtained are quite satisfactory in terms of the Precision, Recall and F-score.

Original languageEnglish
Pages (from-to)925-935
Number of pages11
JournalInternational Journal of Electrical and Computer Engineering
Volume6
Issue number3
DOIs
Publication statusPublished - 1 Jun 2016

Fingerprint

Learning systems
Functionally graded materials
Syntactics
Information retrieval
Processing

Keywords

  • Association Measures
  • Classification Algorithms
  • Compound Nouns
  • Malay Language

ASJC Scopus subject areas

  • Electrical and Electronic Engineering
  • Computer Science(all)

Cite this

Automatic extraction of Malay compound nouns using a hybrid of statistical and machine learning methods. / Hazaa, Muneer A S; Omar, Nazlia; Ba-Alwi, Fadl Mutaher; Albared, Mohammed.

In: International Journal of Electrical and Computer Engineering, Vol. 6, No. 3, 01.06.2016, p. 925-935.

Research output: Contribution to journalArticle

@article{f96f7f3e88f14a76ae3391f6fb992cdd,
title = "Automatic extraction of Malay compound nouns using a hybrid of statistical and machine learning methods",
abstract = "Identifying of compound nouns is important for a wide spectrum of applications in the field of natural language processing such as machine translation and information retrieval. Extraction of compound nouns requires deep or shallow syntactic preprocessing tools and large corpora. This paper investigates several methods for extracting Noun compounds from Malay text corpora. First, we present the empirical results of sixteen statistical association measures of Malay N+N compound nouns extraction. Second, we introduce the possibility of integrating multiple association measures. Third, this work also provides a standard dataset intended to provide a common platform for evaluating research on the identification compound Nouns in Malay language. The standard data set contains 7,235 unique N-N candidates, 2,970 of them are N-N compound nouns collocations. The extraction algorithms are evaluated against this reference data set. The experimental results demonstrate that a group of association measures (Ttest , Piatersky-Shapiro (PS) , C-value, FGM and rank combination method) are the best association measure and outperforms the other association measures for N+N collocations in the Malay corpus. Finally, we describe several classification methods for combining association measures scores of the basic measures, followed by their evaluation. Evaluation results show that classification algorithms significantly outperform individual association measures. Experimental results obtained are quite satisfactory in terms of the Precision, Recall and F-score.",
keywords = "Association Measures, Classification Algorithms, Compound Nouns, Malay Language",
author = "Hazaa, {Muneer A S} and Nazlia Omar and Ba-Alwi, {Fadl Mutaher} and Mohammed Albared",
year = "2016",
month = "6",
day = "1",
doi = "10.11591/ijece.v6i3.9663",
language = "English",
volume = "6",
pages = "925--935",
journal = "International Journal of Electrical and Computer Engineering",
issn = "2088-8708",
publisher = "Institute of Advanced Engineering and Science (IAES)",
number = "3",

}

TY - JOUR

T1 - Automatic extraction of Malay compound nouns using a hybrid of statistical and machine learning methods

AU - Hazaa, Muneer A S

AU - Omar, Nazlia

AU - Ba-Alwi, Fadl Mutaher

AU - Albared, Mohammed

PY - 2016/6/1

Y1 - 2016/6/1

N2 - Identifying of compound nouns is important for a wide spectrum of applications in the field of natural language processing such as machine translation and information retrieval. Extraction of compound nouns requires deep or shallow syntactic preprocessing tools and large corpora. This paper investigates several methods for extracting Noun compounds from Malay text corpora. First, we present the empirical results of sixteen statistical association measures of Malay N+N compound nouns extraction. Second, we introduce the possibility of integrating multiple association measures. Third, this work also provides a standard dataset intended to provide a common platform for evaluating research on the identification compound Nouns in Malay language. The standard data set contains 7,235 unique N-N candidates, 2,970 of them are N-N compound nouns collocations. The extraction algorithms are evaluated against this reference data set. The experimental results demonstrate that a group of association measures (Ttest , Piatersky-Shapiro (PS) , C-value, FGM and rank combination method) are the best association measure and outperforms the other association measures for N+N collocations in the Malay corpus. Finally, we describe several classification methods for combining association measures scores of the basic measures, followed by their evaluation. Evaluation results show that classification algorithms significantly outperform individual association measures. Experimental results obtained are quite satisfactory in terms of the Precision, Recall and F-score.

AB - Identifying of compound nouns is important for a wide spectrum of applications in the field of natural language processing such as machine translation and information retrieval. Extraction of compound nouns requires deep or shallow syntactic preprocessing tools and large corpora. This paper investigates several methods for extracting Noun compounds from Malay text corpora. First, we present the empirical results of sixteen statistical association measures of Malay N+N compound nouns extraction. Second, we introduce the possibility of integrating multiple association measures. Third, this work also provides a standard dataset intended to provide a common platform for evaluating research on the identification compound Nouns in Malay language. The standard data set contains 7,235 unique N-N candidates, 2,970 of them are N-N compound nouns collocations. The extraction algorithms are evaluated against this reference data set. The experimental results demonstrate that a group of association measures (Ttest , Piatersky-Shapiro (PS) , C-value, FGM and rank combination method) are the best association measure and outperforms the other association measures for N+N collocations in the Malay corpus. Finally, we describe several classification methods for combining association measures scores of the basic measures, followed by their evaluation. Evaluation results show that classification algorithms significantly outperform individual association measures. Experimental results obtained are quite satisfactory in terms of the Precision, Recall and F-score.

KW - Association Measures

KW - Classification Algorithms

KW - Compound Nouns

KW - Malay Language

UR - http://www.scopus.com/inward/record.url?scp=84979205194&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84979205194&partnerID=8YFLogxK

U2 - 10.11591/ijece.v6i3.9663

DO - 10.11591/ijece.v6i3.9663

M3 - Article

AN - SCOPUS:84979205194

VL - 6

SP - 925

EP - 935

JO - International Journal of Electrical and Computer Engineering

JF - International Journal of Electrical and Computer Engineering

SN - 2088-8708

IS - 3

ER -