A comparative study of word representation methods with conditional random fields and maximum entropy markov for bio-named entity recognition

Maan Tareq Abd, Masnizah Mohd

Research output: Contribution to journalArticle

Abstract

Bio-Named Entity Recognition (Bio-NER) is the process of identifying and semantically classifying biomedical technical terms and named entities in Biomedicine literature. Therefore, it is a major task in biomedical knowledge acquisition. Meanwhile, Natural Language Processing (NLP) plays an important role in Bio-NER in the biomedical domain. The first and most essential biomedical literature mining task incorporates biomedical entity recognition such as protein, gene, and chemicals. The most recent Bio-NER methods rely on predefined traditional features, which attempt to capture the specific surface properties of entity types. However, these empirically predefined feature sets differ between entity types and are manually constructed and complicated, which means developing them is costly. In this paper, we systematically present a comparative evaluation study of three methods, which are: the traditional feature representation method, the continuous bag-of-words (CBOW) model, and a new prototypical representation method with two popular sequence-labeling approaches (Conditional Random Fields (CRFs) and Maximum Entropy Markov Models (MEMM)). We evaluated these models with two major Bio-NER tasks, which involve the JNLPBA and GENETAG corpora. This paper examined the prototypical word representation method and found that Word2Vec can be successfully used for Bio-NER. Our results show that the new prototypical representation method improved the performance of the two machine learning models with different datasets. Also, the new prototypical representation method performed better than the traditional feature representation method and CBOW model for both datasets. Finally, our experiment proved that the CRF classifier with the new prototypical representation method achieved the best results when 90% data was used as training data, yielding overall F-measure values of 0.79% and 0.85% for the JNLPBA corpus and GENETAG corpus, respectively. In comparison, the results achieved using the ME classifier yielded overall F-measure values of 0.76% and 0.78% for the JNLPBA corpus and GENETAG corpus, respectively.

Original languageEnglish
Pages (from-to)15-30
Number of pages16
JournalMalaysian Journal of Computer Science
Volume31
Issue number5
DOIs
Publication statusPublished - 1 Jan 2018

Fingerprint

Entropy
Classifiers
Knowledge acquisition
Labeling
Surface properties
Learning systems
Genes
Proteins
Processing
Experiments

Keywords

  • Biomedical named entity
  • Data representation methods
  • Prototypical representation
  • Word2Vec

ASJC Scopus subject areas

  • Computer Science(all)

Cite this

@article{74109dbd619345578e990a1f34a7dfe1,
title = "A comparative study of word representation methods with conditional random fields and maximum entropy markov for bio-named entity recognition",
abstract = "Bio-Named Entity Recognition (Bio-NER) is the process of identifying and semantically classifying biomedical technical terms and named entities in Biomedicine literature. Therefore, it is a major task in biomedical knowledge acquisition. Meanwhile, Natural Language Processing (NLP) plays an important role in Bio-NER in the biomedical domain. The first and most essential biomedical literature mining task incorporates biomedical entity recognition such as protein, gene, and chemicals. The most recent Bio-NER methods rely on predefined traditional features, which attempt to capture the specific surface properties of entity types. However, these empirically predefined feature sets differ between entity types and are manually constructed and complicated, which means developing them is costly. In this paper, we systematically present a comparative evaluation study of three methods, which are: the traditional feature representation method, the continuous bag-of-words (CBOW) model, and a new prototypical representation method with two popular sequence-labeling approaches (Conditional Random Fields (CRFs) and Maximum Entropy Markov Models (MEMM)). We evaluated these models with two major Bio-NER tasks, which involve the JNLPBA and GENETAG corpora. This paper examined the prototypical word representation method and found that Word2Vec can be successfully used for Bio-NER. Our results show that the new prototypical representation method improved the performance of the two machine learning models with different datasets. Also, the new prototypical representation method performed better than the traditional feature representation method and CBOW model for both datasets. Finally, our experiment proved that the CRF classifier with the new prototypical representation method achieved the best results when 90{\%} data was used as training data, yielding overall F-measure values of 0.79{\%} and 0.85{\%} for the JNLPBA corpus and GENETAG corpus, respectively. In comparison, the results achieved using the ME classifier yielded overall F-measure values of 0.76{\%} and 0.78{\%} for the JNLPBA corpus and GENETAG corpus, respectively.",
keywords = "Biomedical named entity, Data representation methods, Prototypical representation, Word2Vec",
author = "Abd, {Maan Tareq} and Masnizah Mohd",
year = "2018",
month = "1",
day = "1",
doi = "10.22452/mjcs.sp2018no1.2",
language = "English",
volume = "31",
pages = "15--30",
journal = "Malaysian Journal of Computer Science",
issn = "0127-9084",
publisher = "Faculty of Computer Science and Information Technology",
number = "5",

}

TY - JOUR

T1 - A comparative study of word representation methods with conditional random fields and maximum entropy markov for bio-named entity recognition

AU - Abd, Maan Tareq

AU - Mohd, Masnizah

PY - 2018/1/1

Y1 - 2018/1/1

N2 - Bio-Named Entity Recognition (Bio-NER) is the process of identifying and semantically classifying biomedical technical terms and named entities in Biomedicine literature. Therefore, it is a major task in biomedical knowledge acquisition. Meanwhile, Natural Language Processing (NLP) plays an important role in Bio-NER in the biomedical domain. The first and most essential biomedical literature mining task incorporates biomedical entity recognition such as protein, gene, and chemicals. The most recent Bio-NER methods rely on predefined traditional features, which attempt to capture the specific surface properties of entity types. However, these empirically predefined feature sets differ between entity types and are manually constructed and complicated, which means developing them is costly. In this paper, we systematically present a comparative evaluation study of three methods, which are: the traditional feature representation method, the continuous bag-of-words (CBOW) model, and a new prototypical representation method with two popular sequence-labeling approaches (Conditional Random Fields (CRFs) and Maximum Entropy Markov Models (MEMM)). We evaluated these models with two major Bio-NER tasks, which involve the JNLPBA and GENETAG corpora. This paper examined the prototypical word representation method and found that Word2Vec can be successfully used for Bio-NER. Our results show that the new prototypical representation method improved the performance of the two machine learning models with different datasets. Also, the new prototypical representation method performed better than the traditional feature representation method and CBOW model for both datasets. Finally, our experiment proved that the CRF classifier with the new prototypical representation method achieved the best results when 90% data was used as training data, yielding overall F-measure values of 0.79% and 0.85% for the JNLPBA corpus and GENETAG corpus, respectively. In comparison, the results achieved using the ME classifier yielded overall F-measure values of 0.76% and 0.78% for the JNLPBA corpus and GENETAG corpus, respectively.

AB - Bio-Named Entity Recognition (Bio-NER) is the process of identifying and semantically classifying biomedical technical terms and named entities in Biomedicine literature. Therefore, it is a major task in biomedical knowledge acquisition. Meanwhile, Natural Language Processing (NLP) plays an important role in Bio-NER in the biomedical domain. The first and most essential biomedical literature mining task incorporates biomedical entity recognition such as protein, gene, and chemicals. The most recent Bio-NER methods rely on predefined traditional features, which attempt to capture the specific surface properties of entity types. However, these empirically predefined feature sets differ between entity types and are manually constructed and complicated, which means developing them is costly. In this paper, we systematically present a comparative evaluation study of three methods, which are: the traditional feature representation method, the continuous bag-of-words (CBOW) model, and a new prototypical representation method with two popular sequence-labeling approaches (Conditional Random Fields (CRFs) and Maximum Entropy Markov Models (MEMM)). We evaluated these models with two major Bio-NER tasks, which involve the JNLPBA and GENETAG corpora. This paper examined the prototypical word representation method and found that Word2Vec can be successfully used for Bio-NER. Our results show that the new prototypical representation method improved the performance of the two machine learning models with different datasets. Also, the new prototypical representation method performed better than the traditional feature representation method and CBOW model for both datasets. Finally, our experiment proved that the CRF classifier with the new prototypical representation method achieved the best results when 90% data was used as training data, yielding overall F-measure values of 0.79% and 0.85% for the JNLPBA corpus and GENETAG corpus, respectively. In comparison, the results achieved using the ME classifier yielded overall F-measure values of 0.76% and 0.78% for the JNLPBA corpus and GENETAG corpus, respectively.

KW - Biomedical named entity

KW - Data representation methods

KW - Prototypical representation

KW - Word2Vec

UR - http://www.scopus.com/inward/record.url?scp=85065496307&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85065496307&partnerID=8YFLogxK

U2 - 10.22452/mjcs.sp2018no1.2

DO - 10.22452/mjcs.sp2018no1.2

M3 - Article

AN - SCOPUS:85065496307

VL - 31

SP - 15

EP - 30

JO - Malaysian Journal of Computer Science

JF - Malaysian Journal of Computer Science

SN - 0127-9084

IS - 5

ER -