New Information Content Glossary Relatedness (ICGR) approach for Short Text Similarity (STS) tasks

Ali Muftah BenOmran, Mohd Juzaiddin Ab Aziz

Research output: Contribution to journalArticle

Abstract

The measurement of the relatedness of word semantics based on complementary Wikipedia and WordNet-based methods takes two forms, combined and integrative, which are aimed at increasing the semantic space between related words. However, each form has its own set of issues regarding its components and the strategy that is used to combine and integrate corpus-based and knowledge-based methods. In the integrative strategy, a large corpus, such as Wikipedia, is used to extract a set of related words for a particular concept as a basis for searching the WordNet space. The drawback to this strategy is in its use of a fixed scaling parameter, which only fits an implemented dataset that is near to a human score. Other corpusbased methods use a cut-off threshold that is determined experimentally to reduce the semantic space and to increase the search for a more accurate semantic space. Such methods merely take into account the frequency of bigrams, while ignoring the frequency of individual terms. Knowledge-based methods using a gloss overlap have a similar limitation to the corpus-based methods, where they lead to the loss of many valuable relatedness features that determine a more accurate measurement. Thus, in this paper, a new Information Content Glossary Relatedness (ICGR) approach was proposed in two steps, namely, an Extended-PMI based on a cut-off density threshold was proposed to extract a Robust Relatedness Vector set (RVS) of a large Wikipedia dataset. Then, a Semantic Structural Information (SSI) method was presented to use the RVS as a fulcrum to define the most relatedness gloss in the WordNet of each gloss and to select the top 5 glosses related to each RVS. The results showed that the proposed approach outperformed the state-of-the-art set, where the Extended-PMI achieved a Spearman's correlation of 0.89 to the human score and the ICGR approach achieved a Spearman's correlation of 0.8 to the human score.

Original languageEnglish
Pages (from-to)769-784
Number of pages16
JournalJournal of Computer Science
Volume15
Issue number6
DOIs
Publication statusPublished - 1 Jan 2019

Fingerprint

Glossaries
Semantics

Keywords

  • Gloss
  • PMI
  • Structural information
  • STS
  • Wikipedia
  • WordNet

ASJC Scopus subject areas

  • Software
  • Computer Networks and Communications
  • Artificial Intelligence

Cite this

New Information Content Glossary Relatedness (ICGR) approach for Short Text Similarity (STS) tasks. / BenOmran, Ali Muftah; Ab Aziz, Mohd Juzaiddin.

In: Journal of Computer Science, Vol. 15, No. 6, 01.01.2019, p. 769-784.

Research output: Contribution to journalArticle

@article{84b16eaf7e7841009155db55a82cb8d6,
title = "New Information Content Glossary Relatedness (ICGR) approach for Short Text Similarity (STS) tasks",
abstract = "The measurement of the relatedness of word semantics based on complementary Wikipedia and WordNet-based methods takes two forms, combined and integrative, which are aimed at increasing the semantic space between related words. However, each form has its own set of issues regarding its components and the strategy that is used to combine and integrate corpus-based and knowledge-based methods. In the integrative strategy, a large corpus, such as Wikipedia, is used to extract a set of related words for a particular concept as a basis for searching the WordNet space. The drawback to this strategy is in its use of a fixed scaling parameter, which only fits an implemented dataset that is near to a human score. Other corpusbased methods use a cut-off threshold that is determined experimentally to reduce the semantic space and to increase the search for a more accurate semantic space. Such methods merely take into account the frequency of bigrams, while ignoring the frequency of individual terms. Knowledge-based methods using a gloss overlap have a similar limitation to the corpus-based methods, where they lead to the loss of many valuable relatedness features that determine a more accurate measurement. Thus, in this paper, a new Information Content Glossary Relatedness (ICGR) approach was proposed in two steps, namely, an Extended-PMI based on a cut-off density threshold was proposed to extract a Robust Relatedness Vector set (RVS) of a large Wikipedia dataset. Then, a Semantic Structural Information (SSI) method was presented to use the RVS as a fulcrum to define the most relatedness gloss in the WordNet of each gloss and to select the top 5 glosses related to each RVS. The results showed that the proposed approach outperformed the state-of-the-art set, where the Extended-PMI achieved a Spearman's correlation of 0.89 to the human score and the ICGR approach achieved a Spearman's correlation of 0.8 to the human score.",
keywords = "Gloss, PMI, Structural information, STS, Wikipedia, WordNet",
author = "BenOmran, {Ali Muftah} and {Ab Aziz}, {Mohd Juzaiddin}",
year = "2019",
month = "1",
day = "1",
doi = "10.3844/jcssp.2019.769.784",
language = "English",
volume = "15",
pages = "769--784",
journal = "Journal of Computer Science",
issn = "1549-3636",
publisher = "Science Publications",
number = "6",

}

TY - JOUR

T1 - New Information Content Glossary Relatedness (ICGR) approach for Short Text Similarity (STS) tasks

AU - BenOmran, Ali Muftah

AU - Ab Aziz, Mohd Juzaiddin

PY - 2019/1/1

Y1 - 2019/1/1

N2 - The measurement of the relatedness of word semantics based on complementary Wikipedia and WordNet-based methods takes two forms, combined and integrative, which are aimed at increasing the semantic space between related words. However, each form has its own set of issues regarding its components and the strategy that is used to combine and integrate corpus-based and knowledge-based methods. In the integrative strategy, a large corpus, such as Wikipedia, is used to extract a set of related words for a particular concept as a basis for searching the WordNet space. The drawback to this strategy is in its use of a fixed scaling parameter, which only fits an implemented dataset that is near to a human score. Other corpusbased methods use a cut-off threshold that is determined experimentally to reduce the semantic space and to increase the search for a more accurate semantic space. Such methods merely take into account the frequency of bigrams, while ignoring the frequency of individual terms. Knowledge-based methods using a gloss overlap have a similar limitation to the corpus-based methods, where they lead to the loss of many valuable relatedness features that determine a more accurate measurement. Thus, in this paper, a new Information Content Glossary Relatedness (ICGR) approach was proposed in two steps, namely, an Extended-PMI based on a cut-off density threshold was proposed to extract a Robust Relatedness Vector set (RVS) of a large Wikipedia dataset. Then, a Semantic Structural Information (SSI) method was presented to use the RVS as a fulcrum to define the most relatedness gloss in the WordNet of each gloss and to select the top 5 glosses related to each RVS. The results showed that the proposed approach outperformed the state-of-the-art set, where the Extended-PMI achieved a Spearman's correlation of 0.89 to the human score and the ICGR approach achieved a Spearman's correlation of 0.8 to the human score.

AB - The measurement of the relatedness of word semantics based on complementary Wikipedia and WordNet-based methods takes two forms, combined and integrative, which are aimed at increasing the semantic space between related words. However, each form has its own set of issues regarding its components and the strategy that is used to combine and integrate corpus-based and knowledge-based methods. In the integrative strategy, a large corpus, such as Wikipedia, is used to extract a set of related words for a particular concept as a basis for searching the WordNet space. The drawback to this strategy is in its use of a fixed scaling parameter, which only fits an implemented dataset that is near to a human score. Other corpusbased methods use a cut-off threshold that is determined experimentally to reduce the semantic space and to increase the search for a more accurate semantic space. Such methods merely take into account the frequency of bigrams, while ignoring the frequency of individual terms. Knowledge-based methods using a gloss overlap have a similar limitation to the corpus-based methods, where they lead to the loss of many valuable relatedness features that determine a more accurate measurement. Thus, in this paper, a new Information Content Glossary Relatedness (ICGR) approach was proposed in two steps, namely, an Extended-PMI based on a cut-off density threshold was proposed to extract a Robust Relatedness Vector set (RVS) of a large Wikipedia dataset. Then, a Semantic Structural Information (SSI) method was presented to use the RVS as a fulcrum to define the most relatedness gloss in the WordNet of each gloss and to select the top 5 glosses related to each RVS. The results showed that the proposed approach outperformed the state-of-the-art set, where the Extended-PMI achieved a Spearman's correlation of 0.89 to the human score and the ICGR approach achieved a Spearman's correlation of 0.8 to the human score.

KW - Gloss

KW - PMI

KW - Structural information

KW - STS

KW - Wikipedia

KW - WordNet

UR - http://www.scopus.com/inward/record.url?scp=85071369751&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85071369751&partnerID=8YFLogxK

U2 - 10.3844/jcssp.2019.769.784

DO - 10.3844/jcssp.2019.769.784

M3 - Article

AN - SCOPUS:85071369751

VL - 15

SP - 769

EP - 784

JO - Journal of Computer Science

JF - Journal of Computer Science

SN - 1549-3636

IS - 6

ER -