Word sense disambiguation based on yarowsky approach in english quranic information retrieval system

Omar Jamal Mohamed, Sabrina Tiun

Research output: Contribution to journalArticle

5 Citations (Scopus)

Abstract

Word sense disambiguation (WSD) is the process of eliminating ambiguity that lies on some words by identifying the exact sense of a given word. In the natural languages, many words could yield multiple meaning based on the context. WSD aims to identify the most accurate sense for such cases. In particular, when translating one language to another, there would be a possibility to tackle an ambiguity among the translated words. Quran, which is the holy book for approximately billion Muslims, has been originally written in Arabic language. Apparently, when translating Quran to English language, several semantic issues have been caught by researchers. Such issues lies on the ambiguity of words such as, which are translated into ‘day and night’ and ‘judgment day’. Such ambiguity has to be eliminated by determining the exact sense of the translated word. Several research efforts have been intended to disambiguate the sense of translated Quran. However, the process of identifying an appropriate method for WSD in translated Quran is still challenging task. This is due to the complexity of Arabic morphology. Hence, this study aims to propose an adaption for Yarowsky algorithm as a WSD method for Quranic translation. In addition, this study aims to develop an IR prototype based on the proposed adaption method in order to evaluate such method based on the retrieval effectiveness. In fact, the dataset that has been used in this study is a collection of Quranic content. Several pre-processing tasks have been performed in order to eliminate the irrelevant data such as stop-words, numbers and punctuation. Sequentially, two lists of senses for each ambiguity word will be created with their context. This would be performed in order to let the Yarowsky algorithm train on such example set. After that, a decision list will be constructed by the Yarowsky algorithm, which depicts the labelling sense of each word. The evaluation method that has been used in this study is the three IR evaluation metrics; Precision, Recall and F-measure. The experimental results have shown a 77% of f-measure. Such result seems to be weak in compared to the results of Yarowsky that have been applied in open domain. This is due to the lack of examples that could be extracted from Quran for both senses. Meanwhile, such result seems to be competitive in WSD of Quranic translation. Finally, it can be concluded that WSD has a significant impact on the IR system by improving the retrieval effectiveness.

Original languageEnglish
Pages (from-to)163-171
Number of pages9
JournalJournal of Theoretical and Applied Information Technology
Volume82
Issue number1
Publication statusPublished - 10 Dec 2015

Fingerprint

Word Sense Disambiguation
Information retrieval systems
Information Retrieval
Labeling
Retrieval
Semantics
Processing
Evaluation Method
Natural Language
Preprocessing
Eliminate
Ambiguity
Prototype
Metric
Evaluate
Evaluation
Experimental Results
Language

Keywords

  • Information retrieval
  • Natural language processing
  • Quran
  • Word sense disambiguation
  • Yarowsky algorithm

ASJC Scopus subject areas

  • Computer Science(all)
  • Theoretical Computer Science

Cite this

Word sense disambiguation based on yarowsky approach in english quranic information retrieval system. / Mohamed, Omar Jamal; Tiun, Sabrina.

In: Journal of Theoretical and Applied Information Technology, Vol. 82, No. 1, 10.12.2015, p. 163-171.

Research output: Contribution to journalArticle

@article{77eba16516424145bca4995021deef58,
title = "Word sense disambiguation based on yarowsky approach in english quranic information retrieval system",
abstract = "Word sense disambiguation (WSD) is the process of eliminating ambiguity that lies on some words by identifying the exact sense of a given word. In the natural languages, many words could yield multiple meaning based on the context. WSD aims to identify the most accurate sense for such cases. In particular, when translating one language to another, there would be a possibility to tackle an ambiguity among the translated words. Quran, which is the holy book for approximately billion Muslims, has been originally written in Arabic language. Apparently, when translating Quran to English language, several semantic issues have been caught by researchers. Such issues lies on the ambiguity of words such as, which are translated into ‘day and night’ and ‘judgment day’. Such ambiguity has to be eliminated by determining the exact sense of the translated word. Several research efforts have been intended to disambiguate the sense of translated Quran. However, the process of identifying an appropriate method for WSD in translated Quran is still challenging task. This is due to the complexity of Arabic morphology. Hence, this study aims to propose an adaption for Yarowsky algorithm as a WSD method for Quranic translation. In addition, this study aims to develop an IR prototype based on the proposed adaption method in order to evaluate such method based on the retrieval effectiveness. In fact, the dataset that has been used in this study is a collection of Quranic content. Several pre-processing tasks have been performed in order to eliminate the irrelevant data such as stop-words, numbers and punctuation. Sequentially, two lists of senses for each ambiguity word will be created with their context. This would be performed in order to let the Yarowsky algorithm train on such example set. After that, a decision list will be constructed by the Yarowsky algorithm, which depicts the labelling sense of each word. The evaluation method that has been used in this study is the three IR evaluation metrics; Precision, Recall and F-measure. The experimental results have shown a 77{\%} of f-measure. Such result seems to be weak in compared to the results of Yarowsky that have been applied in open domain. This is due to the lack of examples that could be extracted from Quran for both senses. Meanwhile, such result seems to be competitive in WSD of Quranic translation. Finally, it can be concluded that WSD has a significant impact on the IR system by improving the retrieval effectiveness.",
keywords = "Information retrieval, Natural language processing, Quran, Word sense disambiguation, Yarowsky algorithm",
author = "Mohamed, {Omar Jamal} and Sabrina Tiun",
year = "2015",
month = "12",
day = "10",
language = "English",
volume = "82",
pages = "163--171",
journal = "Journal of Theoretical and Applied Information Technology",
issn = "1992-8645",
publisher = "Asian Research Publishing Network (ARPN)",
number = "1",

}

TY - JOUR

T1 - Word sense disambiguation based on yarowsky approach in english quranic information retrieval system

AU - Mohamed, Omar Jamal

AU - Tiun, Sabrina

PY - 2015/12/10

Y1 - 2015/12/10

N2 - Word sense disambiguation (WSD) is the process of eliminating ambiguity that lies on some words by identifying the exact sense of a given word. In the natural languages, many words could yield multiple meaning based on the context. WSD aims to identify the most accurate sense for such cases. In particular, when translating one language to another, there would be a possibility to tackle an ambiguity among the translated words. Quran, which is the holy book for approximately billion Muslims, has been originally written in Arabic language. Apparently, when translating Quran to English language, several semantic issues have been caught by researchers. Such issues lies on the ambiguity of words such as, which are translated into ‘day and night’ and ‘judgment day’. Such ambiguity has to be eliminated by determining the exact sense of the translated word. Several research efforts have been intended to disambiguate the sense of translated Quran. However, the process of identifying an appropriate method for WSD in translated Quran is still challenging task. This is due to the complexity of Arabic morphology. Hence, this study aims to propose an adaption for Yarowsky algorithm as a WSD method for Quranic translation. In addition, this study aims to develop an IR prototype based on the proposed adaption method in order to evaluate such method based on the retrieval effectiveness. In fact, the dataset that has been used in this study is a collection of Quranic content. Several pre-processing tasks have been performed in order to eliminate the irrelevant data such as stop-words, numbers and punctuation. Sequentially, two lists of senses for each ambiguity word will be created with their context. This would be performed in order to let the Yarowsky algorithm train on such example set. After that, a decision list will be constructed by the Yarowsky algorithm, which depicts the labelling sense of each word. The evaluation method that has been used in this study is the three IR evaluation metrics; Precision, Recall and F-measure. The experimental results have shown a 77% of f-measure. Such result seems to be weak in compared to the results of Yarowsky that have been applied in open domain. This is due to the lack of examples that could be extracted from Quran for both senses. Meanwhile, such result seems to be competitive in WSD of Quranic translation. Finally, it can be concluded that WSD has a significant impact on the IR system by improving the retrieval effectiveness.

AB - Word sense disambiguation (WSD) is the process of eliminating ambiguity that lies on some words by identifying the exact sense of a given word. In the natural languages, many words could yield multiple meaning based on the context. WSD aims to identify the most accurate sense for such cases. In particular, when translating one language to another, there would be a possibility to tackle an ambiguity among the translated words. Quran, which is the holy book for approximately billion Muslims, has been originally written in Arabic language. Apparently, when translating Quran to English language, several semantic issues have been caught by researchers. Such issues lies on the ambiguity of words such as, which are translated into ‘day and night’ and ‘judgment day’. Such ambiguity has to be eliminated by determining the exact sense of the translated word. Several research efforts have been intended to disambiguate the sense of translated Quran. However, the process of identifying an appropriate method for WSD in translated Quran is still challenging task. This is due to the complexity of Arabic morphology. Hence, this study aims to propose an adaption for Yarowsky algorithm as a WSD method for Quranic translation. In addition, this study aims to develop an IR prototype based on the proposed adaption method in order to evaluate such method based on the retrieval effectiveness. In fact, the dataset that has been used in this study is a collection of Quranic content. Several pre-processing tasks have been performed in order to eliminate the irrelevant data such as stop-words, numbers and punctuation. Sequentially, two lists of senses for each ambiguity word will be created with their context. This would be performed in order to let the Yarowsky algorithm train on such example set. After that, a decision list will be constructed by the Yarowsky algorithm, which depicts the labelling sense of each word. The evaluation method that has been used in this study is the three IR evaluation metrics; Precision, Recall and F-measure. The experimental results have shown a 77% of f-measure. Such result seems to be weak in compared to the results of Yarowsky that have been applied in open domain. This is due to the lack of examples that could be extracted from Quran for both senses. Meanwhile, such result seems to be competitive in WSD of Quranic translation. Finally, it can be concluded that WSD has a significant impact on the IR system by improving the retrieval effectiveness.

KW - Information retrieval

KW - Natural language processing

KW - Quran

KW - Word sense disambiguation

KW - Yarowsky algorithm

UR - http://www.scopus.com/inward/record.url?scp=84949505454&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84949505454&partnerID=8YFLogxK

M3 - Article

AN - SCOPUS:84949505454

VL - 82

SP - 163

EP - 171

JO - Journal of Theoretical and Applied Information Technology

JF - Journal of Theoretical and Applied Information Technology

SN - 1992-8645

IS - 1

ER -