An optimized feature set based on genetic algorithm for business web pages named entity recognition

Mohammed Moath Abdulghani, Sabrina Tiun

Research output: Contribution to journalArticle

2 Citations (Scopus)

Abstract

Named Entity Recognition (NER) is the field of identifying proper nouns such as names of people, corporations, places and dates. Recently, extracting information form web pages has caught the researchers’ attentions regarding the valuable information that lies on such pages. The common valuable information is the NEs. However, web pages contain more entities such as addresses, URLs, telephones and faxes. With the variety of features that have been used for NER, identifying the best feature set for extracting NEs from web pages is a challenging task. This paper propose an optimized feature set for extracting NEs from web pages based on GA. The dataset was collected from business web pages. Whilst, the feature set consists of text features such as n-gram and web features such as block position and font type. Finally, a SVM classifier was used to classify the NEs. Results shown that Genetic Algorithm has the ability to identify the most accurate features.

Original languageEnglish
Pages (from-to)298-303
Number of pages6
JournalInternational Review of Automatic Control
Volume9
Issue number5
DOIs
Publication statusPublished - 2016

Fingerprint

Websites
Genetic algorithms
Industry
Telephone sets
Facsimile
Classifiers

Keywords

  • Feature selection
  • Genetic algorithm
  • Named entity recognition
  • Support vector machine
  • Web pages

ASJC Scopus subject areas

  • Control and Systems Engineering

Cite this

An optimized feature set based on genetic algorithm for business web pages named entity recognition. / Abdulghani, Mohammed Moath; Tiun, Sabrina.

In: International Review of Automatic Control, Vol. 9, No. 5, 2016, p. 298-303.

Research output: Contribution to journalArticle

@article{d4b49da42aa24e598d047f78ad46a19b,
title = "An optimized feature set based on genetic algorithm for business web pages named entity recognition",
abstract = "Named Entity Recognition (NER) is the field of identifying proper nouns such as names of people, corporations, places and dates. Recently, extracting information form web pages has caught the researchers’ attentions regarding the valuable information that lies on such pages. The common valuable information is the NEs. However, web pages contain more entities such as addresses, URLs, telephones and faxes. With the variety of features that have been used for NER, identifying the best feature set for extracting NEs from web pages is a challenging task. This paper propose an optimized feature set for extracting NEs from web pages based on GA. The dataset was collected from business web pages. Whilst, the feature set consists of text features such as n-gram and web features such as block position and font type. Finally, a SVM classifier was used to classify the NEs. Results shown that Genetic Algorithm has the ability to identify the most accurate features.",
keywords = "Feature selection, Genetic algorithm, Named entity recognition, Support vector machine, Web pages",
author = "Abdulghani, {Mohammed Moath} and Sabrina Tiun",
year = "2016",
doi = "10.15866/ireaco.v9i5.9698",
language = "English",
volume = "9",
pages = "298--303",
journal = "International Review of Automatic Control",
issn = "1974-6059",
publisher = "Praise Worthy Prize",
number = "5",

}

TY - JOUR

T1 - An optimized feature set based on genetic algorithm for business web pages named entity recognition

AU - Abdulghani, Mohammed Moath

AU - Tiun, Sabrina

PY - 2016

Y1 - 2016

N2 - Named Entity Recognition (NER) is the field of identifying proper nouns such as names of people, corporations, places and dates. Recently, extracting information form web pages has caught the researchers’ attentions regarding the valuable information that lies on such pages. The common valuable information is the NEs. However, web pages contain more entities such as addresses, URLs, telephones and faxes. With the variety of features that have been used for NER, identifying the best feature set for extracting NEs from web pages is a challenging task. This paper propose an optimized feature set for extracting NEs from web pages based on GA. The dataset was collected from business web pages. Whilst, the feature set consists of text features such as n-gram and web features such as block position and font type. Finally, a SVM classifier was used to classify the NEs. Results shown that Genetic Algorithm has the ability to identify the most accurate features.

AB - Named Entity Recognition (NER) is the field of identifying proper nouns such as names of people, corporations, places and dates. Recently, extracting information form web pages has caught the researchers’ attentions regarding the valuable information that lies on such pages. The common valuable information is the NEs. However, web pages contain more entities such as addresses, URLs, telephones and faxes. With the variety of features that have been used for NER, identifying the best feature set for extracting NEs from web pages is a challenging task. This paper propose an optimized feature set for extracting NEs from web pages based on GA. The dataset was collected from business web pages. Whilst, the feature set consists of text features such as n-gram and web features such as block position and font type. Finally, a SVM classifier was used to classify the NEs. Results shown that Genetic Algorithm has the ability to identify the most accurate features.

KW - Feature selection

KW - Genetic algorithm

KW - Named entity recognition

KW - Support vector machine

KW - Web pages

UR - http://www.scopus.com/inward/record.url?scp=85006340122&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85006340122&partnerID=8YFLogxK

U2 - 10.15866/ireaco.v9i5.9698

DO - 10.15866/ireaco.v9i5.9698

M3 - Article

VL - 9

SP - 298

EP - 303

JO - International Review of Automatic Control

JF - International Review of Automatic Control

SN - 1974-6059

IS - 5

ER -