Data extraction for search engine using safe matching

Jer Lang Hong, Ee Xion Tan, Wan Fariza Paizi@Fauzi

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Citation (Scopus)

Abstract

Our study shows that algorithms used to check the similarity of data records affect the efficiency of a wrapper. A closer examination indicates that the accuracy of a wrapper can be improved if the DOM Tree and visual properties of data records can be fully utilized. In this paper, we develop algorithms to check the similarity of data records based on the distinct tags and visual cue of the tree structure of data records and the voting algorithm which can detect the similarity of data records of a relevant data region which may contain irrelevant information such as search identifiers to distinguish the potential data regions more correctly and eliminate data region only when necessary. Experimental results show that our wrapper performs better than state of the art wrapper WISH and it is highly effective in data extraction. This wrapper will be useful for meta search engine application, which needs an accurate tool to locate its source of information.

Original languageEnglish
Title of host publicationAI 2011
Subtitle of host publicationAdvances in Artificial Intelligence - 24th Australasian Joint Conference, Proceedings
Pages759-768
Number of pages10
DOIs
Publication statusPublished - 23 Dec 2011
Externally publishedYes
Event24th Australasian Joint Conference on Artificial Intelligence, AI 2011 - Perth, WA
Duration: 5 Dec 20118 Dec 2011

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume7106 LNAI
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Other

Other24th Australasian Joint Conference on Artificial Intelligence, AI 2011
CityPerth, WA
Period5/12/118/12/11

Fingerprint

Search engines
Search Engine
Wrapper
Tree Structure
Voting
Eliminate
Distinct
Necessary
Experimental Results

Keywords

  • Automatic Wrapper
  • Information Extraction
  • Search Engines

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Computer Science(all)

Cite this

Hong, J. L., Tan, E. X., & Paizi@Fauzi, W. F. (2011). Data extraction for search engine using safe matching. In AI 2011: Advances in Artificial Intelligence - 24th Australasian Joint Conference, Proceedings (pp. 759-768). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 7106 LNAI). https://doi.org/10.1007/978-3-642-25832-9_77

Data extraction for search engine using safe matching. / Hong, Jer Lang; Tan, Ee Xion; Paizi@Fauzi, Wan Fariza.

AI 2011: Advances in Artificial Intelligence - 24th Australasian Joint Conference, Proceedings. 2011. p. 759-768 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 7106 LNAI).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Hong, JL, Tan, EX & Paizi@Fauzi, WF 2011, Data extraction for search engine using safe matching. in AI 2011: Advances in Artificial Intelligence - 24th Australasian Joint Conference, Proceedings. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 7106 LNAI, pp. 759-768, 24th Australasian Joint Conference on Artificial Intelligence, AI 2011, Perth, WA, 5/12/11. https://doi.org/10.1007/978-3-642-25832-9_77
Hong JL, Tan EX, Paizi@Fauzi WF. Data extraction for search engine using safe matching. In AI 2011: Advances in Artificial Intelligence - 24th Australasian Joint Conference, Proceedings. 2011. p. 759-768. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). https://doi.org/10.1007/978-3-642-25832-9_77
Hong, Jer Lang ; Tan, Ee Xion ; Paizi@Fauzi, Wan Fariza. / Data extraction for search engine using safe matching. AI 2011: Advances in Artificial Intelligence - 24th Australasian Joint Conference, Proceedings. 2011. pp. 759-768 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
@inproceedings{d1bd85692b554a0fa007ad4eda6472c5,
title = "Data extraction for search engine using safe matching",
abstract = "Our study shows that algorithms used to check the similarity of data records affect the efficiency of a wrapper. A closer examination indicates that the accuracy of a wrapper can be improved if the DOM Tree and visual properties of data records can be fully utilized. In this paper, we develop algorithms to check the similarity of data records based on the distinct tags and visual cue of the tree structure of data records and the voting algorithm which can detect the similarity of data records of a relevant data region which may contain irrelevant information such as search identifiers to distinguish the potential data regions more correctly and eliminate data region only when necessary. Experimental results show that our wrapper performs better than state of the art wrapper WISH and it is highly effective in data extraction. This wrapper will be useful for meta search engine application, which needs an accurate tool to locate its source of information.",
keywords = "Automatic Wrapper, Information Extraction, Search Engines",
author = "Hong, {Jer Lang} and Tan, {Ee Xion} and Paizi@Fauzi, {Wan Fariza}",
year = "2011",
month = "12",
day = "23",
doi = "10.1007/978-3-642-25832-9_77",
language = "English",
isbn = "9783642258312",
series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
pages = "759--768",
booktitle = "AI 2011",

}

TY - GEN

T1 - Data extraction for search engine using safe matching

AU - Hong, Jer Lang

AU - Tan, Ee Xion

AU - Paizi@Fauzi, Wan Fariza

PY - 2011/12/23

Y1 - 2011/12/23

N2 - Our study shows that algorithms used to check the similarity of data records affect the efficiency of a wrapper. A closer examination indicates that the accuracy of a wrapper can be improved if the DOM Tree and visual properties of data records can be fully utilized. In this paper, we develop algorithms to check the similarity of data records based on the distinct tags and visual cue of the tree structure of data records and the voting algorithm which can detect the similarity of data records of a relevant data region which may contain irrelevant information such as search identifiers to distinguish the potential data regions more correctly and eliminate data region only when necessary. Experimental results show that our wrapper performs better than state of the art wrapper WISH and it is highly effective in data extraction. This wrapper will be useful for meta search engine application, which needs an accurate tool to locate its source of information.

AB - Our study shows that algorithms used to check the similarity of data records affect the efficiency of a wrapper. A closer examination indicates that the accuracy of a wrapper can be improved if the DOM Tree and visual properties of data records can be fully utilized. In this paper, we develop algorithms to check the similarity of data records based on the distinct tags and visual cue of the tree structure of data records and the voting algorithm which can detect the similarity of data records of a relevant data region which may contain irrelevant information such as search identifiers to distinguish the potential data regions more correctly and eliminate data region only when necessary. Experimental results show that our wrapper performs better than state of the art wrapper WISH and it is highly effective in data extraction. This wrapper will be useful for meta search engine application, which needs an accurate tool to locate its source of information.

KW - Automatic Wrapper

KW - Information Extraction

KW - Search Engines

UR - http://www.scopus.com/inward/record.url?scp=83755184383&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=83755184383&partnerID=8YFLogxK

U2 - 10.1007/978-3-642-25832-9_77

DO - 10.1007/978-3-642-25832-9_77

M3 - Conference contribution

SN - 9783642258312

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 759

EP - 768

BT - AI 2011

ER -