The effectiveness of bottom up technique with probabilistic approach for a malay parser

Muhammad Azhar Fairuzz Hiloh, Mohd Juzaiddin Ab Aziz, Lailatul Qadri Zakaria

Research output: Contribution to journalArticle

Abstract

Parsing is a process of analyzing the input string in a sentence to define the syntax structures according to rules of grammar. This task is performed by a parser which will produce a parse tree as output. However, a problem occurs when the parsing process produces two or more parse trees in which the parser unable to represent a precise parse tree. This limitation is caused by ambiguity in the structure of sentences. Ambiguity is occurred when a word is classified more than one category of syntax and its usage will affect the semantics of the sentence. Thus, the parser needs to have an approach to solve the ambiguity problem and is able to process the most appropriate parse tree to present a sentence. Like other languages in the world, Malay language, a national language for Malaysian, is not exempted from ambiguity problem. However, due to its grammar being context-free grammar, the probabilistic context-free grammar approach can be used to support the parser in determining a more accurate parse tree. This study focuses on the development of statistical parser using a bottom-up technique for Malay language. The training data, in the form of simple Malay language sentences, are collected from various sources. Based on this training data, a statistical lexical corpus of Malay language which consists of vocabulary, grammar rules and their probability was developed. The bottom up parsing will be supported by implementing Cocke–Younger–Kasami (CYK) algorithm. The parser’s performance is evaluated based on its effectiveness to overcome ambiguity by suggesting a more precise parse tree. In conclusion, the Malay Language Parser can be useful to help user identify the appropriate parse tree and solve ambiguity issues in Malay Language.

Original languageEnglish
Pages (from-to)124-133
Number of pages10
JournalGEMA Online Journal of Language Studies
Volume18
Issue number2
DOIs
Publication statusPublished - 1 May 2018

Fingerprint

grammar
language
syntax
Bottom-up
Language
vocabulary
Grammar
semantics
Parsing
performance
Syntax

Keywords

  • Ambiguity
  • CYK
  • Malay language
  • Parsing technique
  • Probabilistic context-free grammar

ASJC Scopus subject areas

  • Language and Linguistics
  • Linguistics and Language
  • Literature and Literary Theory

Cite this

The effectiveness of bottom up technique with probabilistic approach for a malay parser. / Fairuzz Hiloh, Muhammad Azhar; Ab Aziz, Mohd Juzaiddin; Zakaria, Lailatul Qadri.

In: GEMA Online Journal of Language Studies, Vol. 18, No. 2, 01.05.2018, p. 124-133.

Research output: Contribution to journalArticle

@article{3a91b8bb1eaa49fd8a43c8b2ddbc8988,
title = "The effectiveness of bottom up technique with probabilistic approach for a malay parser",
abstract = "Parsing is a process of analyzing the input string in a sentence to define the syntax structures according to rules of grammar. This task is performed by a parser which will produce a parse tree as output. However, a problem occurs when the parsing process produces two or more parse trees in which the parser unable to represent a precise parse tree. This limitation is caused by ambiguity in the structure of sentences. Ambiguity is occurred when a word is classified more than one category of syntax and its usage will affect the semantics of the sentence. Thus, the parser needs to have an approach to solve the ambiguity problem and is able to process the most appropriate parse tree to present a sentence. Like other languages in the world, Malay language, a national language for Malaysian, is not exempted from ambiguity problem. However, due to its grammar being context-free grammar, the probabilistic context-free grammar approach can be used to support the parser in determining a more accurate parse tree. This study focuses on the development of statistical parser using a bottom-up technique for Malay language. The training data, in the form of simple Malay language sentences, are collected from various sources. Based on this training data, a statistical lexical corpus of Malay language which consists of vocabulary, grammar rules and their probability was developed. The bottom up parsing will be supported by implementing Cocke–Younger–Kasami (CYK) algorithm. The parser’s performance is evaluated based on its effectiveness to overcome ambiguity by suggesting a more precise parse tree. In conclusion, the Malay Language Parser can be useful to help user identify the appropriate parse tree and solve ambiguity issues in Malay Language.",
keywords = "Ambiguity, CYK, Malay language, Parsing technique, Probabilistic context-free grammar",
author = "{Fairuzz Hiloh}, {Muhammad Azhar} and {Ab Aziz}, {Mohd Juzaiddin} and Zakaria, {Lailatul Qadri}",
year = "2018",
month = "5",
day = "1",
doi = "10.17576/gema-2018-1802-09",
language = "English",
volume = "18",
pages = "124--133",
journal = "GEMA Online Journal of Language Studies",
issn = "1675-8021",
publisher = "Universiti Kebangsaan Malaysia",
number = "2",

}

TY - JOUR

T1 - The effectiveness of bottom up technique with probabilistic approach for a malay parser

AU - Fairuzz Hiloh, Muhammad Azhar

AU - Ab Aziz, Mohd Juzaiddin

AU - Zakaria, Lailatul Qadri

PY - 2018/5/1

Y1 - 2018/5/1

N2 - Parsing is a process of analyzing the input string in a sentence to define the syntax structures according to rules of grammar. This task is performed by a parser which will produce a parse tree as output. However, a problem occurs when the parsing process produces two or more parse trees in which the parser unable to represent a precise parse tree. This limitation is caused by ambiguity in the structure of sentences. Ambiguity is occurred when a word is classified more than one category of syntax and its usage will affect the semantics of the sentence. Thus, the parser needs to have an approach to solve the ambiguity problem and is able to process the most appropriate parse tree to present a sentence. Like other languages in the world, Malay language, a national language for Malaysian, is not exempted from ambiguity problem. However, due to its grammar being context-free grammar, the probabilistic context-free grammar approach can be used to support the parser in determining a more accurate parse tree. This study focuses on the development of statistical parser using a bottom-up technique for Malay language. The training data, in the form of simple Malay language sentences, are collected from various sources. Based on this training data, a statistical lexical corpus of Malay language which consists of vocabulary, grammar rules and their probability was developed. The bottom up parsing will be supported by implementing Cocke–Younger–Kasami (CYK) algorithm. The parser’s performance is evaluated based on its effectiveness to overcome ambiguity by suggesting a more precise parse tree. In conclusion, the Malay Language Parser can be useful to help user identify the appropriate parse tree and solve ambiguity issues in Malay Language.

AB - Parsing is a process of analyzing the input string in a sentence to define the syntax structures according to rules of grammar. This task is performed by a parser which will produce a parse tree as output. However, a problem occurs when the parsing process produces two or more parse trees in which the parser unable to represent a precise parse tree. This limitation is caused by ambiguity in the structure of sentences. Ambiguity is occurred when a word is classified more than one category of syntax and its usage will affect the semantics of the sentence. Thus, the parser needs to have an approach to solve the ambiguity problem and is able to process the most appropriate parse tree to present a sentence. Like other languages in the world, Malay language, a national language for Malaysian, is not exempted from ambiguity problem. However, due to its grammar being context-free grammar, the probabilistic context-free grammar approach can be used to support the parser in determining a more accurate parse tree. This study focuses on the development of statistical parser using a bottom-up technique for Malay language. The training data, in the form of simple Malay language sentences, are collected from various sources. Based on this training data, a statistical lexical corpus of Malay language which consists of vocabulary, grammar rules and their probability was developed. The bottom up parsing will be supported by implementing Cocke–Younger–Kasami (CYK) algorithm. The parser’s performance is evaluated based on its effectiveness to overcome ambiguity by suggesting a more precise parse tree. In conclusion, the Malay Language Parser can be useful to help user identify the appropriate parse tree and solve ambiguity issues in Malay Language.

KW - Ambiguity

KW - CYK

KW - Malay language

KW - Parsing technique

KW - Probabilistic context-free grammar

UR - http://www.scopus.com/inward/record.url?scp=85047903574&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85047903574&partnerID=8YFLogxK

U2 - 10.17576/gema-2018-1802-09

DO - 10.17576/gema-2018-1802-09

M3 - Article

AN - SCOPUS:85047903574

VL - 18

SP - 124

EP - 133

JO - GEMA Online Journal of Language Studies

JF - GEMA Online Journal of Language Studies

SN - 1675-8021

IS - 2

ER -