Building compact entity embeddings using Wikidata

Mohamed Lubani, Shahrul Azman Mohd Noah

Research output: Contribution to journalArticle

Abstract

Representing natural language sentences has always been a challenge in statistical language modeling. Atomic discrete representations of words make it difficult to represent semantically related sentences. Other sentence components such as phrases and named-entities should be recognized and given representations as units instead of individual words. Different entity senses should be assigned different representations even though they share identical words. In this paper, we focus on building the vector representations (embedding) of named-entities from their contexts to facilitate the task of ontology population where named-entities need to be recognized and disambiguated in natural language text. Given a list of target named-entities, Wikidata is used to compensate for the lack of a labeled corpus to build the contexts of all target named-entities as well as all their senses. Description text and semantic relations with other named-entities are considered when building the contexts from Wikidata. To avoid noisy and uninformative features in the embedding generated from artificially built contexts, we propose a method to build compact entity representations to sharpen entity embedding by removing irrelevant features and emphasizing the most detailed ones. An extended version of the Continuous Bag-of-Words model (CBOW) is used to build the joint vector representations of words and named-entities using Wikidata contexts. Each entity context is then represented by a subset of elements that maximizes the chances of keeping the most descriptive features about the target entity. The final entity representations are built by compressing the embedding of the chosen subset using a deep stacked auto encoders model. Cosine similarity and t-SNE visualization technique are used to evaluate the final entity vectors. Results show that semantically related entities are clustered near each other in the vector space. Entities that appear in similar contexts are assigned similar compact vector representations based on their contexts.

Original languageEnglish
Pages (from-to)1437-1445
Number of pages9
JournalInternational Journal on Advanced Science, Engineering and Information Technology
Volume8
Issue number4-2
Publication statusPublished - 1 Jan 2018

Fingerprint

Language
Semantics
automobiles
bags
Vector spaces
Joints
Ontology
Visualization
methodology
Population

Keywords

  • Entity embeddings
  • Entity vector representations
  • Named entity disambiguation

ASJC Scopus subject areas

  • Computer Science(all)
  • Agricultural and Biological Sciences(all)
  • Engineering(all)

Cite this

Building compact entity embeddings using Wikidata. / Lubani, Mohamed; Mohd Noah, Shahrul Azman.

In: International Journal on Advanced Science, Engineering and Information Technology, Vol. 8, No. 4-2, 01.01.2018, p. 1437-1445.

Research output: Contribution to journalArticle

@article{d2a2ca6433f644c096cedfb62d5dfa44,
title = "Building compact entity embeddings using Wikidata",
abstract = "Representing natural language sentences has always been a challenge in statistical language modeling. Atomic discrete representations of words make it difficult to represent semantically related sentences. Other sentence components such as phrases and named-entities should be recognized and given representations as units instead of individual words. Different entity senses should be assigned different representations even though they share identical words. In this paper, we focus on building the vector representations (embedding) of named-entities from their contexts to facilitate the task of ontology population where named-entities need to be recognized and disambiguated in natural language text. Given a list of target named-entities, Wikidata is used to compensate for the lack of a labeled corpus to build the contexts of all target named-entities as well as all their senses. Description text and semantic relations with other named-entities are considered when building the contexts from Wikidata. To avoid noisy and uninformative features in the embedding generated from artificially built contexts, we propose a method to build compact entity representations to sharpen entity embedding by removing irrelevant features and emphasizing the most detailed ones. An extended version of the Continuous Bag-of-Words model (CBOW) is used to build the joint vector representations of words and named-entities using Wikidata contexts. Each entity context is then represented by a subset of elements that maximizes the chances of keeping the most descriptive features about the target entity. The final entity representations are built by compressing the embedding of the chosen subset using a deep stacked auto encoders model. Cosine similarity and t-SNE visualization technique are used to evaluate the final entity vectors. Results show that semantically related entities are clustered near each other in the vector space. Entities that appear in similar contexts are assigned similar compact vector representations based on their contexts.",
keywords = "Entity embeddings, Entity vector representations, Named entity disambiguation",
author = "Mohamed Lubani and {Mohd Noah}, {Shahrul Azman}",
year = "2018",
month = "1",
day = "1",
language = "English",
volume = "8",
pages = "1437--1445",
journal = "International Journal on Advanced Science, Engineering and Information Technology",
issn = "2088-5334",
publisher = "INSIGHT - Indonesian Society for Knowledge and Human Development",
number = "4-2",

}

TY - JOUR

T1 - Building compact entity embeddings using Wikidata

AU - Lubani, Mohamed

AU - Mohd Noah, Shahrul Azman

PY - 2018/1/1

Y1 - 2018/1/1

N2 - Representing natural language sentences has always been a challenge in statistical language modeling. Atomic discrete representations of words make it difficult to represent semantically related sentences. Other sentence components such as phrases and named-entities should be recognized and given representations as units instead of individual words. Different entity senses should be assigned different representations even though they share identical words. In this paper, we focus on building the vector representations (embedding) of named-entities from their contexts to facilitate the task of ontology population where named-entities need to be recognized and disambiguated in natural language text. Given a list of target named-entities, Wikidata is used to compensate for the lack of a labeled corpus to build the contexts of all target named-entities as well as all their senses. Description text and semantic relations with other named-entities are considered when building the contexts from Wikidata. To avoid noisy and uninformative features in the embedding generated from artificially built contexts, we propose a method to build compact entity representations to sharpen entity embedding by removing irrelevant features and emphasizing the most detailed ones. An extended version of the Continuous Bag-of-Words model (CBOW) is used to build the joint vector representations of words and named-entities using Wikidata contexts. Each entity context is then represented by a subset of elements that maximizes the chances of keeping the most descriptive features about the target entity. The final entity representations are built by compressing the embedding of the chosen subset using a deep stacked auto encoders model. Cosine similarity and t-SNE visualization technique are used to evaluate the final entity vectors. Results show that semantically related entities are clustered near each other in the vector space. Entities that appear in similar contexts are assigned similar compact vector representations based on their contexts.

AB - Representing natural language sentences has always been a challenge in statistical language modeling. Atomic discrete representations of words make it difficult to represent semantically related sentences. Other sentence components such as phrases and named-entities should be recognized and given representations as units instead of individual words. Different entity senses should be assigned different representations even though they share identical words. In this paper, we focus on building the vector representations (embedding) of named-entities from their contexts to facilitate the task of ontology population where named-entities need to be recognized and disambiguated in natural language text. Given a list of target named-entities, Wikidata is used to compensate for the lack of a labeled corpus to build the contexts of all target named-entities as well as all their senses. Description text and semantic relations with other named-entities are considered when building the contexts from Wikidata. To avoid noisy and uninformative features in the embedding generated from artificially built contexts, we propose a method to build compact entity representations to sharpen entity embedding by removing irrelevant features and emphasizing the most detailed ones. An extended version of the Continuous Bag-of-Words model (CBOW) is used to build the joint vector representations of words and named-entities using Wikidata contexts. Each entity context is then represented by a subset of elements that maximizes the chances of keeping the most descriptive features about the target entity. The final entity representations are built by compressing the embedding of the chosen subset using a deep stacked auto encoders model. Cosine similarity and t-SNE visualization technique are used to evaluate the final entity vectors. Results show that semantically related entities are clustered near each other in the vector space. Entities that appear in similar contexts are assigned similar compact vector representations based on their contexts.

KW - Entity embeddings

KW - Entity vector representations

KW - Named entity disambiguation

UR - http://www.scopus.com/inward/record.url?scp=85055353372&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85055353372&partnerID=8YFLogxK

M3 - Article

VL - 8

SP - 1437

EP - 1445

JO - International Journal on Advanced Science, Engineering and Information Technology

JF - International Journal on Advanced Science, Engineering and Information Technology

SN - 2088-5334

IS - 4-2

ER -