A focused crawler combinatory link and content model based on T-Graph principles

Ali Seyfi, Ahmed Patel

Research output: Contribution to journalArticle

6 Citations (Scopus)

Abstract

The two significant tasks of a focused Web crawler are finding relevant documents and prioritizing them for effective download. For the first task, we propose an algorithm to fetch and analyze the most effective HTML elements of the page to predict and elicit the topical focus of each unvisited page with high accuracy. For the second task, we propose a scoring function of the relevant URLs through the use of T-Graph to prioritize each unvisited link. Thus, our novel method uniquely combines these approaches, giving precision and recall values close to 50%, which indicate the significance of the proposed architecture.

Original languageEnglish
Pages (from-to)1-11
Number of pages11
JournalComputer Standards and Interfaces
Volume43
DOIs
Publication statusPublished - 1 Jan 2016
Externally publishedYes

Fingerprint

HTML
Websites
Values
Web crawler

Keywords

  • Focused Web crawler
  • HTML data
  • Information retrieval
  • Search engine
  • T-Graph

ASJC Scopus subject areas

  • Software
  • Hardware and Architecture
  • Law

Cite this

A focused crawler combinatory link and content model based on T-Graph principles. / Seyfi, Ali; Patel, Ahmed.

In: Computer Standards and Interfaces, Vol. 43, 01.01.2016, p. 1-11.

Research output: Contribution to journalArticle

@article{04978cfe887f4ef297788d78b3c9e252,
title = "A focused crawler combinatory link and content model based on T-Graph principles",
abstract = "The two significant tasks of a focused Web crawler are finding relevant documents and prioritizing them for effective download. For the first task, we propose an algorithm to fetch and analyze the most effective HTML elements of the page to predict and elicit the topical focus of each unvisited page with high accuracy. For the second task, we propose a scoring function of the relevant URLs through the use of T-Graph to prioritize each unvisited link. Thus, our novel method uniquely combines these approaches, giving precision and recall values close to 50{\%}, which indicate the significance of the proposed architecture.",
keywords = "Focused Web crawler, HTML data, Information retrieval, Search engine, T-Graph",
author = "Ali Seyfi and Ahmed Patel",
year = "2016",
month = "1",
day = "1",
doi = "10.1016/j.csi.2015.07.001",
language = "English",
volume = "43",
pages = "1--11",
journal = "Computer Standards and Interfaces",
issn = "0920-5489",
publisher = "Elsevier",

}

TY - JOUR

T1 - A focused crawler combinatory link and content model based on T-Graph principles

AU - Seyfi, Ali

AU - Patel, Ahmed

PY - 2016/1/1

Y1 - 2016/1/1

N2 - The two significant tasks of a focused Web crawler are finding relevant documents and prioritizing them for effective download. For the first task, we propose an algorithm to fetch and analyze the most effective HTML elements of the page to predict and elicit the topical focus of each unvisited page with high accuracy. For the second task, we propose a scoring function of the relevant URLs through the use of T-Graph to prioritize each unvisited link. Thus, our novel method uniquely combines these approaches, giving precision and recall values close to 50%, which indicate the significance of the proposed architecture.

AB - The two significant tasks of a focused Web crawler are finding relevant documents and prioritizing them for effective download. For the first task, we propose an algorithm to fetch and analyze the most effective HTML elements of the page to predict and elicit the topical focus of each unvisited page with high accuracy. For the second task, we propose a scoring function of the relevant URLs through the use of T-Graph to prioritize each unvisited link. Thus, our novel method uniquely combines these approaches, giving precision and recall values close to 50%, which indicate the significance of the proposed architecture.

KW - Focused Web crawler

KW - HTML data

KW - Information retrieval

KW - Search engine

KW - T-Graph

UR - http://www.scopus.com/inward/record.url?scp=84943339191&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84943339191&partnerID=8YFLogxK

U2 - 10.1016/j.csi.2015.07.001

DO - 10.1016/j.csi.2015.07.001

M3 - Article

AN - SCOPUS:84943339191

VL - 43

SP - 1

EP - 11

JO - Computer Standards and Interfaces

JF - Computer Standards and Interfaces

SN - 0920-5489

ER -