Comparative analysis of different data representations for the task of chemical compound extraction

Basel Alshaikhdeeb, Kamsuriah Ahmad

Research output: Contribution to journalArticle

1 Citation (Scopus)

Abstract

Chemical Compound Extraction refers to the task of recognizing chemical instances such as oxygen nitrogen and others. The majority of studies that addressed the task of chemical compound extraction used machine-learning techniques. The key challenge behind using machine-learning techniques lies in employing a robust set of features. The literature shows that there are numerous types of features used in the task of chemical compound extraction. Such dimensionality of features can be determined via data representation. Some researchers have used N-gram representation for biomedical named entity recognition, where the most significant terms are represented as features. Meanwhile, others have used detailed-attribute representation in which the features are generalized. As a result, identifying the best combination of features to yield high-accuracy classification becomes challenging. This paper aims to apply the Wrapper Subset Selection approach using two data representations-N-gram and detailed-attributes. Since each data representation would suit a specific classification algorithm, two classifiers were utilized-Naïve Bayes (for detailedattributes) and Support Vector Machine (for N-gram). The results show that the application of feature selection using detailedattributes outperformed that of N-gram representation by achieving a 0.722 f-measure. Despite the higher classification accuracy, the selected features using detailed-attribute representation have more meaning and can be applied for further datasets.

Original languageEnglish
Pages (from-to)2189-2195
Number of pages7
JournalInternational Journal on Advanced Science, Engineering and Information Technology
Volume8
Issue number5
DOIs
Publication statusPublished - 1 Jan 2018

Fingerprint

Chemical compounds
chemical compounds
data analysis
artificial intelligence
Learning systems
Support vector machines
Feature extraction
Classifiers
Nitrogen
researchers
Research Personnel
Oxygen
oxygen
nitrogen
methodology
Machine Learning

Keywords

  • Attribute selection
  • Chemical compounds extraction
  • Data representation
  • Detailed-attributes
  • N-gram
  • Naïve Bayes
  • Support vector machine

ASJC Scopus subject areas

  • Computer Science(all)
  • Agricultural and Biological Sciences(all)
  • Engineering(all)

Cite this

@article{ad433f18b9354c5383b2518ef5635112,
title = "Comparative analysis of different data representations for the task of chemical compound extraction",
abstract = "Chemical Compound Extraction refers to the task of recognizing chemical instances such as oxygen nitrogen and others. The majority of studies that addressed the task of chemical compound extraction used machine-learning techniques. The key challenge behind using machine-learning techniques lies in employing a robust set of features. The literature shows that there are numerous types of features used in the task of chemical compound extraction. Such dimensionality of features can be determined via data representation. Some researchers have used N-gram representation for biomedical named entity recognition, where the most significant terms are represented as features. Meanwhile, others have used detailed-attribute representation in which the features are generalized. As a result, identifying the best combination of features to yield high-accuracy classification becomes challenging. This paper aims to apply the Wrapper Subset Selection approach using two data representations-N-gram and detailed-attributes. Since each data representation would suit a specific classification algorithm, two classifiers were utilized-Na{\"i}ve Bayes (for detailedattributes) and Support Vector Machine (for N-gram). The results show that the application of feature selection using detailedattributes outperformed that of N-gram representation by achieving a 0.722 f-measure. Despite the higher classification accuracy, the selected features using detailed-attribute representation have more meaning and can be applied for further datasets.",
keywords = "Attribute selection, Chemical compounds extraction, Data representation, Detailed-attributes, N-gram, Na{\"i}ve Bayes, Support vector machine",
author = "Basel Alshaikhdeeb and Kamsuriah Ahmad",
year = "2018",
month = "1",
day = "1",
doi = "10.18517/ijaseit.8.5.6432",
language = "English",
volume = "8",
pages = "2189--2195",
journal = "International Journal on Advanced Science, Engineering and Information Technology",
issn = "2088-5334",
publisher = "INSIGHT - Indonesian Society for Knowledge and Human Development",
number = "5",

}

TY - JOUR

T1 - Comparative analysis of different data representations for the task of chemical compound extraction

AU - Alshaikhdeeb, Basel

AU - Ahmad, Kamsuriah

PY - 2018/1/1

Y1 - 2018/1/1

N2 - Chemical Compound Extraction refers to the task of recognizing chemical instances such as oxygen nitrogen and others. The majority of studies that addressed the task of chemical compound extraction used machine-learning techniques. The key challenge behind using machine-learning techniques lies in employing a robust set of features. The literature shows that there are numerous types of features used in the task of chemical compound extraction. Such dimensionality of features can be determined via data representation. Some researchers have used N-gram representation for biomedical named entity recognition, where the most significant terms are represented as features. Meanwhile, others have used detailed-attribute representation in which the features are generalized. As a result, identifying the best combination of features to yield high-accuracy classification becomes challenging. This paper aims to apply the Wrapper Subset Selection approach using two data representations-N-gram and detailed-attributes. Since each data representation would suit a specific classification algorithm, two classifiers were utilized-Naïve Bayes (for detailedattributes) and Support Vector Machine (for N-gram). The results show that the application of feature selection using detailedattributes outperformed that of N-gram representation by achieving a 0.722 f-measure. Despite the higher classification accuracy, the selected features using detailed-attribute representation have more meaning and can be applied for further datasets.

AB - Chemical Compound Extraction refers to the task of recognizing chemical instances such as oxygen nitrogen and others. The majority of studies that addressed the task of chemical compound extraction used machine-learning techniques. The key challenge behind using machine-learning techniques lies in employing a robust set of features. The literature shows that there are numerous types of features used in the task of chemical compound extraction. Such dimensionality of features can be determined via data representation. Some researchers have used N-gram representation for biomedical named entity recognition, where the most significant terms are represented as features. Meanwhile, others have used detailed-attribute representation in which the features are generalized. As a result, identifying the best combination of features to yield high-accuracy classification becomes challenging. This paper aims to apply the Wrapper Subset Selection approach using two data representations-N-gram and detailed-attributes. Since each data representation would suit a specific classification algorithm, two classifiers were utilized-Naïve Bayes (for detailedattributes) and Support Vector Machine (for N-gram). The results show that the application of feature selection using detailedattributes outperformed that of N-gram representation by achieving a 0.722 f-measure. Despite the higher classification accuracy, the selected features using detailed-attribute representation have more meaning and can be applied for further datasets.

KW - Attribute selection

KW - Chemical compounds extraction

KW - Data representation

KW - Detailed-attributes

KW - N-gram

KW - Naïve Bayes

KW - Support vector machine

UR - http://www.scopus.com/inward/record.url?scp=85056265042&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85056265042&partnerID=8YFLogxK

U2 - 10.18517/ijaseit.8.5.6432

DO - 10.18517/ijaseit.8.5.6432

M3 - Article

VL - 8

SP - 2189

EP - 2195

JO - International Journal on Advanced Science, Engineering and Information Technology

JF - International Journal on Advanced Science, Engineering and Information Technology

SN - 2088-5334

IS - 5

ER -