Wrapper approaches for web data extraction: A review

Mohd Amir Bin Mohd Azir, Kamsuriah Ahmad

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Relational databases are known as collections of structured data within the digital structure and are normally arranged in rows and columns. However, most business data are present in the form of unstructured. Data extraction is a process of extracting unstructured, semi-structured, and structured data from the user requirement upon the web pages on the internet, in any type of automation level. Web pages contain data region which is formally in a structured data format. Manipulating and analyzing data using tools always required massive computing server resources. This paper will review existing techniques on data extraction for heterogeneous data in the Big Data environment. This review is aimed to discuss different data extraction approaches together with the basic tools algorithm for extracting favored data from various web sources. The various types of approaches that will be examined are Information Extraction Approaches, Automatic Wrapper Generation, SemiAutomatic Wrapper Generation, Wrapper Induction, and Wrapper Maintenance. Although, many required techniques from web sources have been tested and developed, but the reviews on these techniques are still lacking. This paper reviews data extraction using wrapper approaches and compares each to identify the best approach to extract data from online sites.

Original languageEnglish
Title of host publicationProceedings of the 2017 6th International Conference on Electrical Engineering and Informatics
Subtitle of host publicationSustainable Society Through Digital Innovation, ICEEI 2017
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages1-6
Number of pages6
Volume2017-November
ISBN (Electronic)9781538604755
DOIs
Publication statusPublished - 9 Mar 2018
Event6th International Conference on Electrical Engineering and Informatics, ICEEI 2017 - Langkawi, Malaysia
Duration: 25 Nov 201727 Nov 2017

Other

Other6th International Conference on Electrical Engineering and Informatics, ICEEI 2017
CountryMalaysia
CityLangkawi
Period25/11/1727/11/17

Fingerprint

Wrapper
Information Storage and Retrieval
Automation
Internet
Maintenance
Databases
Websites
Servers
Review
Industry
Information Extraction
Relational Database
Proof by induction

Keywords

  • Extracted Data
  • Semantic Data
  • Unstructured Data
  • Web Data Extraction
  • Wrapper Algorithm

ASJC Scopus subject areas

  • Artificial Intelligence
  • Control and Optimization
  • Computer Networks and Communications
  • Computer Vision and Pattern Recognition
  • Information Systems
  • Software
  • Electrical and Electronic Engineering
  • Health Informatics

Cite this

Bin Mohd Azir, M. A., & Ahmad, K. (2018). Wrapper approaches for web data extraction: A review. In Proceedings of the 2017 6th International Conference on Electrical Engineering and Informatics: Sustainable Society Through Digital Innovation, ICEEI 2017 (Vol. 2017-November, pp. 1-6). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ICEEI.2017.8312458

Wrapper approaches for web data extraction : A review. / Bin Mohd Azir, Mohd Amir; Ahmad, Kamsuriah.

Proceedings of the 2017 6th International Conference on Electrical Engineering and Informatics: Sustainable Society Through Digital Innovation, ICEEI 2017. Vol. 2017-November Institute of Electrical and Electronics Engineers Inc., 2018. p. 1-6.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Bin Mohd Azir, MA & Ahmad, K 2018, Wrapper approaches for web data extraction: A review. in Proceedings of the 2017 6th International Conference on Electrical Engineering and Informatics: Sustainable Society Through Digital Innovation, ICEEI 2017. vol. 2017-November, Institute of Electrical and Electronics Engineers Inc., pp. 1-6, 6th International Conference on Electrical Engineering and Informatics, ICEEI 2017, Langkawi, Malaysia, 25/11/17. https://doi.org/10.1109/ICEEI.2017.8312458
Bin Mohd Azir MA, Ahmad K. Wrapper approaches for web data extraction: A review. In Proceedings of the 2017 6th International Conference on Electrical Engineering and Informatics: Sustainable Society Through Digital Innovation, ICEEI 2017. Vol. 2017-November. Institute of Electrical and Electronics Engineers Inc. 2018. p. 1-6 https://doi.org/10.1109/ICEEI.2017.8312458
Bin Mohd Azir, Mohd Amir ; Ahmad, Kamsuriah. / Wrapper approaches for web data extraction : A review. Proceedings of the 2017 6th International Conference on Electrical Engineering and Informatics: Sustainable Society Through Digital Innovation, ICEEI 2017. Vol. 2017-November Institute of Electrical and Electronics Engineers Inc., 2018. pp. 1-6
@inproceedings{3bffacb431a84c2d9c54ca4e7135214f,
title = "Wrapper approaches for web data extraction: A review",
abstract = "Relational databases are known as collections of structured data within the digital structure and are normally arranged in rows and columns. However, most business data are present in the form of unstructured. Data extraction is a process of extracting unstructured, semi-structured, and structured data from the user requirement upon the web pages on the internet, in any type of automation level. Web pages contain data region which is formally in a structured data format. Manipulating and analyzing data using tools always required massive computing server resources. This paper will review existing techniques on data extraction for heterogeneous data in the Big Data environment. This review is aimed to discuss different data extraction approaches together with the basic tools algorithm for extracting favored data from various web sources. The various types of approaches that will be examined are Information Extraction Approaches, Automatic Wrapper Generation, SemiAutomatic Wrapper Generation, Wrapper Induction, and Wrapper Maintenance. Although, many required techniques from web sources have been tested and developed, but the reviews on these techniques are still lacking. This paper reviews data extraction using wrapper approaches and compares each to identify the best approach to extract data from online sites.",
keywords = "Extracted Data, Semantic Data, Unstructured Data, Web Data Extraction, Wrapper Algorithm",
author = "{Bin Mohd Azir}, {Mohd Amir} and Kamsuriah Ahmad",
year = "2018",
month = "3",
day = "9",
doi = "10.1109/ICEEI.2017.8312458",
language = "English",
volume = "2017-November",
pages = "1--6",
booktitle = "Proceedings of the 2017 6th International Conference on Electrical Engineering and Informatics",
publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - GEN

T1 - Wrapper approaches for web data extraction

T2 - A review

AU - Bin Mohd Azir, Mohd Amir

AU - Ahmad, Kamsuriah

PY - 2018/3/9

Y1 - 2018/3/9

N2 - Relational databases are known as collections of structured data within the digital structure and are normally arranged in rows and columns. However, most business data are present in the form of unstructured. Data extraction is a process of extracting unstructured, semi-structured, and structured data from the user requirement upon the web pages on the internet, in any type of automation level. Web pages contain data region which is formally in a structured data format. Manipulating and analyzing data using tools always required massive computing server resources. This paper will review existing techniques on data extraction for heterogeneous data in the Big Data environment. This review is aimed to discuss different data extraction approaches together with the basic tools algorithm for extracting favored data from various web sources. The various types of approaches that will be examined are Information Extraction Approaches, Automatic Wrapper Generation, SemiAutomatic Wrapper Generation, Wrapper Induction, and Wrapper Maintenance. Although, many required techniques from web sources have been tested and developed, but the reviews on these techniques are still lacking. This paper reviews data extraction using wrapper approaches and compares each to identify the best approach to extract data from online sites.

AB - Relational databases are known as collections of structured data within the digital structure and are normally arranged in rows and columns. However, most business data are present in the form of unstructured. Data extraction is a process of extracting unstructured, semi-structured, and structured data from the user requirement upon the web pages on the internet, in any type of automation level. Web pages contain data region which is formally in a structured data format. Manipulating and analyzing data using tools always required massive computing server resources. This paper will review existing techniques on data extraction for heterogeneous data in the Big Data environment. This review is aimed to discuss different data extraction approaches together with the basic tools algorithm for extracting favored data from various web sources. The various types of approaches that will be examined are Information Extraction Approaches, Automatic Wrapper Generation, SemiAutomatic Wrapper Generation, Wrapper Induction, and Wrapper Maintenance. Although, many required techniques from web sources have been tested and developed, but the reviews on these techniques are still lacking. This paper reviews data extraction using wrapper approaches and compares each to identify the best approach to extract data from online sites.

KW - Extracted Data

KW - Semantic Data

KW - Unstructured Data

KW - Web Data Extraction

KW - Wrapper Algorithm

UR - http://www.scopus.com/inward/record.url?scp=85050820798&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85050820798&partnerID=8YFLogxK

U2 - 10.1109/ICEEI.2017.8312458

DO - 10.1109/ICEEI.2017.8312458

M3 - Conference contribution

AN - SCOPUS:85050820798

VL - 2017-November

SP - 1

EP - 6

BT - Proceedings of the 2017 6th International Conference on Electrical Engineering and Informatics

PB - Institute of Electrical and Electronics Engineers Inc.

ER -