Building sense tagged corpus using wikipedia for supervised word sense disambiguation

Abdulgabbar Saif, Nazlia Omar, Ummi Zakiah Zainodin, Mohd Juzaiddin Ab Aziz

Research output: Contribution to journalConference article

1 Citation (Scopus)

Abstract

Building of sense-tagged data is a main challenge for supervised techniques that achieved promising results in word sense disambiguation. The manual building of sense-tagged data is a labor and a time-consuming task because each ambiguous word has to be labeled in collected contexts by linguistic experts. Therefore, this paper proposes a knowledge-based method for building the Arabic sense-tagged corpus from Wikipedia. The method starts with mapping Arabic WordNet and Wikipedia to select the Wikipedia article for the corresponding sense in WordNet. In this mapping step, the cross-lingual method is used to measure the similarity between features of a Wikipedia article and a WordNet sense separately. Then, the incoming-links of Wikipedia articles are exploited to extract instances for the sense of each ambiguous word in WordNet. For handling the lack of instances of some articles in Wikipedia, the multiword-based technique is proposed to increase a number of each concept. Experimental results show that the cross-lingual method outperforms monolingual method that is based on Arabic features only. The sense-tagged corpus is created for 50 ambiguous words yielding 148 senses with 30,961 instances.

Original languageEnglish
Pages (from-to)403-412
Number of pages10
JournalProcedia Computer Science
Volume123
DOIs
Publication statusPublished - 1 Jan 2018
Event8th Annual International Conference on Biologically Inspired Cognitive Architectures, BICA 2017 - Moscow, Russian Federation
Duration: 1 Aug 20176 Aug 2017

Fingerprint

Linguistics
Personnel

Keywords

  • Alignment
  • Arabic WordNet
  • Similarity Measures
  • Text Processing
  • Wikipedia
  • Word Sense Disambiguation

ASJC Scopus subject areas

  • Computer Science(all)

Cite this

Building sense tagged corpus using wikipedia for supervised word sense disambiguation. / Saif, Abdulgabbar; Omar, Nazlia; Zainodin, Ummi Zakiah; Ab Aziz, Mohd Juzaiddin.

In: Procedia Computer Science, Vol. 123, 01.01.2018, p. 403-412.

Research output: Contribution to journalConference article

@article{70c48afc80714bc88b6e5cd0cb470bed,
title = "Building sense tagged corpus using wikipedia for supervised word sense disambiguation",
abstract = "Building of sense-tagged data is a main challenge for supervised techniques that achieved promising results in word sense disambiguation. The manual building of sense-tagged data is a labor and a time-consuming task because each ambiguous word has to be labeled in collected contexts by linguistic experts. Therefore, this paper proposes a knowledge-based method for building the Arabic sense-tagged corpus from Wikipedia. The method starts with mapping Arabic WordNet and Wikipedia to select the Wikipedia article for the corresponding sense in WordNet. In this mapping step, the cross-lingual method is used to measure the similarity between features of a Wikipedia article and a WordNet sense separately. Then, the incoming-links of Wikipedia articles are exploited to extract instances for the sense of each ambiguous word in WordNet. For handling the lack of instances of some articles in Wikipedia, the multiword-based technique is proposed to increase a number of each concept. Experimental results show that the cross-lingual method outperforms monolingual method that is based on Arabic features only. The sense-tagged corpus is created for 50 ambiguous words yielding 148 senses with 30,961 instances.",
keywords = "Alignment, Arabic WordNet, Similarity Measures, Text Processing, Wikipedia, Word Sense Disambiguation",
author = "Abdulgabbar Saif and Nazlia Omar and Zainodin, {Ummi Zakiah} and {Ab Aziz}, {Mohd Juzaiddin}",
year = "2018",
month = "1",
day = "1",
doi = "10.1016/j.procs.2018.01.062",
language = "English",
volume = "123",
pages = "403--412",
journal = "Procedia Computer Science",
issn = "1877-0509",
publisher = "Elsevier BV",

}

TY - JOUR

T1 - Building sense tagged corpus using wikipedia for supervised word sense disambiguation

AU - Saif, Abdulgabbar

AU - Omar, Nazlia

AU - Zainodin, Ummi Zakiah

AU - Ab Aziz, Mohd Juzaiddin

PY - 2018/1/1

Y1 - 2018/1/1

N2 - Building of sense-tagged data is a main challenge for supervised techniques that achieved promising results in word sense disambiguation. The manual building of sense-tagged data is a labor and a time-consuming task because each ambiguous word has to be labeled in collected contexts by linguistic experts. Therefore, this paper proposes a knowledge-based method for building the Arabic sense-tagged corpus from Wikipedia. The method starts with mapping Arabic WordNet and Wikipedia to select the Wikipedia article for the corresponding sense in WordNet. In this mapping step, the cross-lingual method is used to measure the similarity between features of a Wikipedia article and a WordNet sense separately. Then, the incoming-links of Wikipedia articles are exploited to extract instances for the sense of each ambiguous word in WordNet. For handling the lack of instances of some articles in Wikipedia, the multiword-based technique is proposed to increase a number of each concept. Experimental results show that the cross-lingual method outperforms monolingual method that is based on Arabic features only. The sense-tagged corpus is created for 50 ambiguous words yielding 148 senses with 30,961 instances.

AB - Building of sense-tagged data is a main challenge for supervised techniques that achieved promising results in word sense disambiguation. The manual building of sense-tagged data is a labor and a time-consuming task because each ambiguous word has to be labeled in collected contexts by linguistic experts. Therefore, this paper proposes a knowledge-based method for building the Arabic sense-tagged corpus from Wikipedia. The method starts with mapping Arabic WordNet and Wikipedia to select the Wikipedia article for the corresponding sense in WordNet. In this mapping step, the cross-lingual method is used to measure the similarity between features of a Wikipedia article and a WordNet sense separately. Then, the incoming-links of Wikipedia articles are exploited to extract instances for the sense of each ambiguous word in WordNet. For handling the lack of instances of some articles in Wikipedia, the multiword-based technique is proposed to increase a number of each concept. Experimental results show that the cross-lingual method outperforms monolingual method that is based on Arabic features only. The sense-tagged corpus is created for 50 ambiguous words yielding 148 senses with 30,961 instances.

KW - Alignment

KW - Arabic WordNet

KW - Similarity Measures

KW - Text Processing

KW - Wikipedia

KW - Word Sense Disambiguation

UR - http://www.scopus.com/inward/record.url?scp=85045644130&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85045644130&partnerID=8YFLogxK

U2 - 10.1016/j.procs.2018.01.062

DO - 10.1016/j.procs.2018.01.062

M3 - Conference article

AN - SCOPUS:85045644130

VL - 123

SP - 403

EP - 412

JO - Procedia Computer Science

JF - Procedia Computer Science

SN - 1877-0509

ER -