Empirical evaluation of the link and content-based focused Treasure-Crawler

Ali Seyfi, Ahmed Patel, Joaquim Celestino Júnior

Research output: Contribution to journalArticle

12 Citations (Scopus)

Abstract

Indexing the Web is becoming a laborious task for search engines as the Web exponentially grows in size and distribution. Presently, the most effective known approach to overcome this problem is the use of focused crawlers. A focused crawler employs a significant and unique algorithm in order to detect the pages on the Web that relate to its topic of interest. For this purpose we proposed a custom method that uses specific HTML elements of a page to predict the topical focus of all the pages that have an unvisited link within the current page. These recognized on-topic pages have to be sorted later based on their relevance to the main topic of the crawler for further actual downloads. In the Treasure-Crawler, we use a hierarchical structure called T-Graph which is an exemplary guide to assign appropriate priority score to each unvisited link. These URLs will later be downloaded based on this priority. This paper embodies the implementation, test results and performance evaluation of the Treasure-Crawler system. The Treasure-Crawler is evaluated in terms of specific information retrieval criteria such as recall and precision, both with values close to 50%. Gaining such outcome asserts the significance of the proposed approach.

Original languageEnglish
Pages (from-to)54-62
Number of pages9
JournalComputer Standards and Interfaces
Volume44
DOIs
Publication statusPublished - 1 Feb 2016
Externally publishedYes

Fingerprint

World Wide Web
HTML
Search engines
indexing
Information retrieval
evaluation
information retrieval
search engine
Websites
performance
Values
Web crawler

Keywords

  • Focused Web crawler
  • HTML data
  • Information retrieval
  • Search engine
  • T-Graph

ASJC Scopus subject areas

  • Software
  • Hardware and Architecture
  • Law

Cite this

Empirical evaluation of the link and content-based focused Treasure-Crawler. / Seyfi, Ali; Patel, Ahmed; Celestino Júnior, Joaquim.

In: Computer Standards and Interfaces, Vol. 44, 01.02.2016, p. 54-62.

Research output: Contribution to journalArticle

Seyfi, Ali ; Patel, Ahmed ; Celestino Júnior, Joaquim. / Empirical evaluation of the link and content-based focused Treasure-Crawler. In: Computer Standards and Interfaces. 2016 ; Vol. 44. pp. 54-62.
@article{eabd6ae563b444d19d59f3b5891d03ed,
title = "Empirical evaluation of the link and content-based focused Treasure-Crawler",
abstract = "Indexing the Web is becoming a laborious task for search engines as the Web exponentially grows in size and distribution. Presently, the most effective known approach to overcome this problem is the use of focused crawlers. A focused crawler employs a significant and unique algorithm in order to detect the pages on the Web that relate to its topic of interest. For this purpose we proposed a custom method that uses specific HTML elements of a page to predict the topical focus of all the pages that have an unvisited link within the current page. These recognized on-topic pages have to be sorted later based on their relevance to the main topic of the crawler for further actual downloads. In the Treasure-Crawler, we use a hierarchical structure called T-Graph which is an exemplary guide to assign appropriate priority score to each unvisited link. These URLs will later be downloaded based on this priority. This paper embodies the implementation, test results and performance evaluation of the Treasure-Crawler system. The Treasure-Crawler is evaluated in terms of specific information retrieval criteria such as recall and precision, both with values close to 50{\%}. Gaining such outcome asserts the significance of the proposed approach.",
keywords = "Focused Web crawler, HTML data, Information retrieval, Search engine, T-Graph",
author = "Ali Seyfi and Ahmed Patel and {Celestino J{\'u}nior}, Joaquim",
year = "2016",
month = "2",
day = "1",
doi = "10.1016/j.csi.2015.09.007",
language = "English",
volume = "44",
pages = "54--62",
journal = "Computer Standards and Interfaces",
issn = "0920-5489",
publisher = "Elsevier",

}

TY - JOUR

T1 - Empirical evaluation of the link and content-based focused Treasure-Crawler

AU - Seyfi, Ali

AU - Patel, Ahmed

AU - Celestino Júnior, Joaquim

PY - 2016/2/1

Y1 - 2016/2/1

N2 - Indexing the Web is becoming a laborious task for search engines as the Web exponentially grows in size and distribution. Presently, the most effective known approach to overcome this problem is the use of focused crawlers. A focused crawler employs a significant and unique algorithm in order to detect the pages on the Web that relate to its topic of interest. For this purpose we proposed a custom method that uses specific HTML elements of a page to predict the topical focus of all the pages that have an unvisited link within the current page. These recognized on-topic pages have to be sorted later based on their relevance to the main topic of the crawler for further actual downloads. In the Treasure-Crawler, we use a hierarchical structure called T-Graph which is an exemplary guide to assign appropriate priority score to each unvisited link. These URLs will later be downloaded based on this priority. This paper embodies the implementation, test results and performance evaluation of the Treasure-Crawler system. The Treasure-Crawler is evaluated in terms of specific information retrieval criteria such as recall and precision, both with values close to 50%. Gaining such outcome asserts the significance of the proposed approach.

AB - Indexing the Web is becoming a laborious task for search engines as the Web exponentially grows in size and distribution. Presently, the most effective known approach to overcome this problem is the use of focused crawlers. A focused crawler employs a significant and unique algorithm in order to detect the pages on the Web that relate to its topic of interest. For this purpose we proposed a custom method that uses specific HTML elements of a page to predict the topical focus of all the pages that have an unvisited link within the current page. These recognized on-topic pages have to be sorted later based on their relevance to the main topic of the crawler for further actual downloads. In the Treasure-Crawler, we use a hierarchical structure called T-Graph which is an exemplary guide to assign appropriate priority score to each unvisited link. These URLs will later be downloaded based on this priority. This paper embodies the implementation, test results and performance evaluation of the Treasure-Crawler system. The Treasure-Crawler is evaluated in terms of specific information retrieval criteria such as recall and precision, both with values close to 50%. Gaining such outcome asserts the significance of the proposed approach.

KW - Focused Web crawler

KW - HTML data

KW - Information retrieval

KW - Search engine

KW - T-Graph

UR - http://www.scopus.com/inward/record.url?scp=84947809720&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84947809720&partnerID=8YFLogxK

U2 - 10.1016/j.csi.2015.09.007

DO - 10.1016/j.csi.2015.09.007

M3 - Article

AN - SCOPUS:84947809720

VL - 44

SP - 54

EP - 62

JO - Computer Standards and Interfaces

JF - Computer Standards and Interfaces

SN - 0920-5489

ER -