Semantic concept model using Wikipedia semantic features

Abdulgabbar Saif, Nazlia Omar, Mohd Juzaiddin Ab Aziz, Ummi Zakiah Zainodin, Naomie Salim

Research output: Contribution to journalArticle

2 Citations (Scopus)

Abstract

Wikipedia has become a high coverage knowledge source which has been used in many research areas such as natural language processing, text mining and information retrieval. Several methods have been introduced for extracting explicit or implicit relations from Wikipedia to represent semantics of concepts/words. However, the main challenge in semantic representation is how to incorporate different types of semantic relations to capture more semantic evidences of the associations of concepts. In this article, we propose a semantic concept model that incorporates different types of semantic features extracting from Wikipedia. For each concept that corresponds to an article, four semantic features are introduced: template links, categories, salient concepts and topics. The proposed model is based on the probability distributions that are defined for these semantic features of a Wikipedia concept. The template links and categories are the document-level features which are directly extracted from the structured information included in the article. On the other hand, the salient concepts and topics are corpus-level features which are extracted to capture implicit relations among concepts. For the salient concepts feature, the distributional-based method is utilised on the hypertext corpus to extract this feature for each Wikipedia concept. Then, the probability product kernel is used to improve the weight of each concept in this feature. For the topic feature, the Labelled latent Dirichlet allocation is adapted on the supervised multi-label of Wikipedia to train the probabilistic model of this feature. Finally, we used the linear interpolation for incorporating these semantic features into the probabilistic model to estimate the semantic relation probability of the specific concept over Wikipedia articles. The proposed model is evaluated on 12 benchmark datasets in three natural language processing tasks: measuring the semantic relatedness of concepts/words in general and in the biomedical domain, semantic textual relatedness measurement and measuring the semantic compositionality of noun compounds. The model is also compared with five methods that depends on separate semantic features in Wikipedia. Experimental results show that the proposed model achieves promising results in three tasks and outperforms the baseline methods in most of the evaluation datasets. This implies that incorporation of explicit and implicit semantic features is useful for representing semantics of concepts in Wikipedia.

Original languageEnglish
JournalJournal of Information Science
DOIs
Publication statusAccepted/In press - 1 Apr 2017

Fingerprint

Wikipedia
Semantics
semantics
Text processing
text processing
hypertext
language
Information retrieval
information retrieval

Keywords

  • Semantic relatedness
  • semantic representation
  • topic models
  • Wikipedia features

ASJC Scopus subject areas

  • Information Systems
  • Library and Information Sciences

Cite this

Semantic concept model using Wikipedia semantic features. / Saif, Abdulgabbar; Omar, Nazlia; Ab Aziz, Mohd Juzaiddin; Zainodin, Ummi Zakiah; Salim, Naomie.

In: Journal of Information Science, 01.04.2017.

Research output: Contribution to journalArticle

@article{574c81ddd25243cc9fbb2967f8adea76,
title = "Semantic concept model using Wikipedia semantic features",
abstract = "Wikipedia has become a high coverage knowledge source which has been used in many research areas such as natural language processing, text mining and information retrieval. Several methods have been introduced for extracting explicit or implicit relations from Wikipedia to represent semantics of concepts/words. However, the main challenge in semantic representation is how to incorporate different types of semantic relations to capture more semantic evidences of the associations of concepts. In this article, we propose a semantic concept model that incorporates different types of semantic features extracting from Wikipedia. For each concept that corresponds to an article, four semantic features are introduced: template links, categories, salient concepts and topics. The proposed model is based on the probability distributions that are defined for these semantic features of a Wikipedia concept. The template links and categories are the document-level features which are directly extracted from the structured information included in the article. On the other hand, the salient concepts and topics are corpus-level features which are extracted to capture implicit relations among concepts. For the salient concepts feature, the distributional-based method is utilised on the hypertext corpus to extract this feature for each Wikipedia concept. Then, the probability product kernel is used to improve the weight of each concept in this feature. For the topic feature, the Labelled latent Dirichlet allocation is adapted on the supervised multi-label of Wikipedia to train the probabilistic model of this feature. Finally, we used the linear interpolation for incorporating these semantic features into the probabilistic model to estimate the semantic relation probability of the specific concept over Wikipedia articles. The proposed model is evaluated on 12 benchmark datasets in three natural language processing tasks: measuring the semantic relatedness of concepts/words in general and in the biomedical domain, semantic textual relatedness measurement and measuring the semantic compositionality of noun compounds. The model is also compared with five methods that depends on separate semantic features in Wikipedia. Experimental results show that the proposed model achieves promising results in three tasks and outperforms the baseline methods in most of the evaluation datasets. This implies that incorporation of explicit and implicit semantic features is useful for representing semantics of concepts in Wikipedia.",
keywords = "Semantic relatedness, semantic representation, topic models, Wikipedia features",
author = "Abdulgabbar Saif and Nazlia Omar and {Ab Aziz}, {Mohd Juzaiddin} and Zainodin, {Ummi Zakiah} and Naomie Salim",
year = "2017",
month = "4",
day = "1",
doi = "10.1177/0165551517706231",
language = "English",
journal = "Journal of Information Science",
issn = "0165-5515",
publisher = "SAGE Publications Ltd",

}

TY - JOUR

T1 - Semantic concept model using Wikipedia semantic features

AU - Saif, Abdulgabbar

AU - Omar, Nazlia

AU - Ab Aziz, Mohd Juzaiddin

AU - Zainodin, Ummi Zakiah

AU - Salim, Naomie

PY - 2017/4/1

Y1 - 2017/4/1

N2 - Wikipedia has become a high coverage knowledge source which has been used in many research areas such as natural language processing, text mining and information retrieval. Several methods have been introduced for extracting explicit or implicit relations from Wikipedia to represent semantics of concepts/words. However, the main challenge in semantic representation is how to incorporate different types of semantic relations to capture more semantic evidences of the associations of concepts. In this article, we propose a semantic concept model that incorporates different types of semantic features extracting from Wikipedia. For each concept that corresponds to an article, four semantic features are introduced: template links, categories, salient concepts and topics. The proposed model is based on the probability distributions that are defined for these semantic features of a Wikipedia concept. The template links and categories are the document-level features which are directly extracted from the structured information included in the article. On the other hand, the salient concepts and topics are corpus-level features which are extracted to capture implicit relations among concepts. For the salient concepts feature, the distributional-based method is utilised on the hypertext corpus to extract this feature for each Wikipedia concept. Then, the probability product kernel is used to improve the weight of each concept in this feature. For the topic feature, the Labelled latent Dirichlet allocation is adapted on the supervised multi-label of Wikipedia to train the probabilistic model of this feature. Finally, we used the linear interpolation for incorporating these semantic features into the probabilistic model to estimate the semantic relation probability of the specific concept over Wikipedia articles. The proposed model is evaluated on 12 benchmark datasets in three natural language processing tasks: measuring the semantic relatedness of concepts/words in general and in the biomedical domain, semantic textual relatedness measurement and measuring the semantic compositionality of noun compounds. The model is also compared with five methods that depends on separate semantic features in Wikipedia. Experimental results show that the proposed model achieves promising results in three tasks and outperforms the baseline methods in most of the evaluation datasets. This implies that incorporation of explicit and implicit semantic features is useful for representing semantics of concepts in Wikipedia.

AB - Wikipedia has become a high coverage knowledge source which has been used in many research areas such as natural language processing, text mining and information retrieval. Several methods have been introduced for extracting explicit or implicit relations from Wikipedia to represent semantics of concepts/words. However, the main challenge in semantic representation is how to incorporate different types of semantic relations to capture more semantic evidences of the associations of concepts. In this article, we propose a semantic concept model that incorporates different types of semantic features extracting from Wikipedia. For each concept that corresponds to an article, four semantic features are introduced: template links, categories, salient concepts and topics. The proposed model is based on the probability distributions that are defined for these semantic features of a Wikipedia concept. The template links and categories are the document-level features which are directly extracted from the structured information included in the article. On the other hand, the salient concepts and topics are corpus-level features which are extracted to capture implicit relations among concepts. For the salient concepts feature, the distributional-based method is utilised on the hypertext corpus to extract this feature for each Wikipedia concept. Then, the probability product kernel is used to improve the weight of each concept in this feature. For the topic feature, the Labelled latent Dirichlet allocation is adapted on the supervised multi-label of Wikipedia to train the probabilistic model of this feature. Finally, we used the linear interpolation for incorporating these semantic features into the probabilistic model to estimate the semantic relation probability of the specific concept over Wikipedia articles. The proposed model is evaluated on 12 benchmark datasets in three natural language processing tasks: measuring the semantic relatedness of concepts/words in general and in the biomedical domain, semantic textual relatedness measurement and measuring the semantic compositionality of noun compounds. The model is also compared with five methods that depends on separate semantic features in Wikipedia. Experimental results show that the proposed model achieves promising results in three tasks and outperforms the baseline methods in most of the evaluation datasets. This implies that incorporation of explicit and implicit semantic features is useful for representing semantics of concepts in Wikipedia.

KW - Semantic relatedness

KW - semantic representation

KW - topic models

KW - Wikipedia features

UR - http://www.scopus.com/inward/record.url?scp=85042370865&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85042370865&partnerID=8YFLogxK

U2 - 10.1177/0165551517706231

DO - 10.1177/0165551517706231

M3 - Article

JO - Journal of Information Science

JF - Journal of Information Science

SN - 0165-5515

ER -