Reducing explicit semantic representation vectors using Latent Dirichlet Allocation

Research output: Contribution to journalArticle

11 Citations (Scopus)

Abstract

Explicit Semantic Analysis (ESA) is a knowledge-based method which builds the semantic representation of the words depending on the textual description of the concepts in the certain knowledge source. Due to its simplicity and success, ESA has received wide attention from researchers in the computational linguistics and information retrieval. However, the representation vectors formed by ESA method are generally very excessive, high dimensional, and may contain many redundant concepts. In this paper, we introduce a reduced semantic representation method that constructs the semantic interpretation of the words as the vectors over the latent topics from the original ESA representation vectors. For modeling the latent topics, the Latent Dirichlet Allocation (LDA) is adapted to the ESA vectors for extracting the topics as the probability distributions over the concepts rather than the words in the traditional model. The proposed method is applied to the wide knowledge sources used in the computational semantic analysis: WordNet and Wikipedia. For evaluation, we use the proposed method in two natural language processing tasks: measuring the semantic relatedness between words/texts and text clustering. The experimental results indicate that the proposed method overcomes the limitations of the representation of the ESA method.

Original languageEnglish
JournalKnowledge-Based Systems
DOIs
Publication statusAccepted/In press - 18 Feb 2015

Fingerprint

Semantics
Dirichlet
Computational linguistics
Information retrieval
Probability distributions
Processing

Keywords

  • Explicit Semantic Analysis
  • Knowledge-based method
  • Semantic representation
  • Topic modeling

ASJC Scopus subject areas

  • Software
  • Artificial Intelligence
  • Management Information Systems
  • Information Systems and Management

Cite this

@article{a5fca1af055248c8b4e2af7cf7bd2ce8,
title = "Reducing explicit semantic representation vectors using Latent Dirichlet Allocation",
abstract = "Explicit Semantic Analysis (ESA) is a knowledge-based method which builds the semantic representation of the words depending on the textual description of the concepts in the certain knowledge source. Due to its simplicity and success, ESA has received wide attention from researchers in the computational linguistics and information retrieval. However, the representation vectors formed by ESA method are generally very excessive, high dimensional, and may contain many redundant concepts. In this paper, we introduce a reduced semantic representation method that constructs the semantic interpretation of the words as the vectors over the latent topics from the original ESA representation vectors. For modeling the latent topics, the Latent Dirichlet Allocation (LDA) is adapted to the ESA vectors for extracting the topics as the probability distributions over the concepts rather than the words in the traditional model. The proposed method is applied to the wide knowledge sources used in the computational semantic analysis: WordNet and Wikipedia. For evaluation, we use the proposed method in two natural language processing tasks: measuring the semantic relatedness between words/texts and text clustering. The experimental results indicate that the proposed method overcomes the limitations of the representation of the ESA method.",
keywords = "Explicit Semantic Analysis, Knowledge-based method, Semantic representation, Topic modeling",
author = "Abdulgabbar Saif and {Ab Aziz}, {Mohd Juzaiddin} and Nazlia Omar",
year = "2015",
month = "2",
day = "18",
doi = "10.1016/j.knosys.2016.03.002",
language = "English",
journal = "Knowledge-Based Systems",
issn = "0950-7051",
publisher = "Elsevier",

}

TY - JOUR

T1 - Reducing explicit semantic representation vectors using Latent Dirichlet Allocation

AU - Saif, Abdulgabbar

AU - Ab Aziz, Mohd Juzaiddin

AU - Omar, Nazlia

PY - 2015/2/18

Y1 - 2015/2/18

N2 - Explicit Semantic Analysis (ESA) is a knowledge-based method which builds the semantic representation of the words depending on the textual description of the concepts in the certain knowledge source. Due to its simplicity and success, ESA has received wide attention from researchers in the computational linguistics and information retrieval. However, the representation vectors formed by ESA method are generally very excessive, high dimensional, and may contain many redundant concepts. In this paper, we introduce a reduced semantic representation method that constructs the semantic interpretation of the words as the vectors over the latent topics from the original ESA representation vectors. For modeling the latent topics, the Latent Dirichlet Allocation (LDA) is adapted to the ESA vectors for extracting the topics as the probability distributions over the concepts rather than the words in the traditional model. The proposed method is applied to the wide knowledge sources used in the computational semantic analysis: WordNet and Wikipedia. For evaluation, we use the proposed method in two natural language processing tasks: measuring the semantic relatedness between words/texts and text clustering. The experimental results indicate that the proposed method overcomes the limitations of the representation of the ESA method.

AB - Explicit Semantic Analysis (ESA) is a knowledge-based method which builds the semantic representation of the words depending on the textual description of the concepts in the certain knowledge source. Due to its simplicity and success, ESA has received wide attention from researchers in the computational linguistics and information retrieval. However, the representation vectors formed by ESA method are generally very excessive, high dimensional, and may contain many redundant concepts. In this paper, we introduce a reduced semantic representation method that constructs the semantic interpretation of the words as the vectors over the latent topics from the original ESA representation vectors. For modeling the latent topics, the Latent Dirichlet Allocation (LDA) is adapted to the ESA vectors for extracting the topics as the probability distributions over the concepts rather than the words in the traditional model. The proposed method is applied to the wide knowledge sources used in the computational semantic analysis: WordNet and Wikipedia. For evaluation, we use the proposed method in two natural language processing tasks: measuring the semantic relatedness between words/texts and text clustering. The experimental results indicate that the proposed method overcomes the limitations of the representation of the ESA method.

KW - Explicit Semantic Analysis

KW - Knowledge-based method

KW - Semantic representation

KW - Topic modeling

UR - http://www.scopus.com/inward/record.url?scp=84977992467&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84977992467&partnerID=8YFLogxK

U2 - 10.1016/j.knosys.2016.03.002

DO - 10.1016/j.knosys.2016.03.002

M3 - Article

AN - SCOPUS:84977992467

JO - Knowledge-Based Systems

JF - Knowledge-Based Systems

SN - 0950-7051

ER -