Focused crawling of online business Web pages using latent semantic indexing approach

Thamer Salah, Sabrina Tiun

Research output: Contribution to journalArticle

1 Citation (Scopus)

Abstract

With the exponential growth of textual information available from the Internet, there has been an emergent need to find relevant, in-time and in-depth knowledge about business topic. The huge size of such data makes the process of retrieving and analyzing and use of the valuable information in such texts manually a very difficult task. In this paper, we attempt to address a challenging task i.e. a crawling business-specific knowledge on the Web. To do that, the main goal of this paper is to describe a new method of focused crawling with latent semantic indexing for online business web pages. We describe a new model for online business text crawling which seeks, acquires, maintains and filter business pages. This model consists mainly from two main modules: a crawling system and a text filtering system. The crawler is used to collect as many web pages as possible from the news websites. This focused crawler is guided by a latent semantic index and information from Word Net (business filter) which learns to recognize the relevance of a web page with respect to the business topic and it is also utilized a set of domain specific keywords. The obtained results also on online real word data show that the focused crawler is very effective for building high-quality collections of business Web documents.

Original languageEnglish
Pages (from-to)9229-9234
Number of pages6
JournalARPN Journal of Engineering and Applied Sciences
Volume11
Issue number15
Publication statusPublished - 2016

Fingerprint

Websites
Semantics
Industry
Internet

Keywords

  • Business data mining
  • Classification
  • Focused crawling
  • Web mining

ASJC Scopus subject areas

  • Engineering(all)

Cite this

Focused crawling of online business Web pages using latent semantic indexing approach. / Salah, Thamer; Tiun, Sabrina.

In: ARPN Journal of Engineering and Applied Sciences, Vol. 11, No. 15, 2016, p. 9229-9234.

Research output: Contribution to journalArticle

@article{bf3be56e0a6c41179ea34c5dbf1325ad,
title = "Focused crawling of online business Web pages using latent semantic indexing approach",
abstract = "With the exponential growth of textual information available from the Internet, there has been an emergent need to find relevant, in-time and in-depth knowledge about business topic. The huge size of such data makes the process of retrieving and analyzing and use of the valuable information in such texts manually a very difficult task. In this paper, we attempt to address a challenging task i.e. a crawling business-specific knowledge on the Web. To do that, the main goal of this paper is to describe a new method of focused crawling with latent semantic indexing for online business web pages. We describe a new model for online business text crawling which seeks, acquires, maintains and filter business pages. This model consists mainly from two main modules: a crawling system and a text filtering system. The crawler is used to collect as many web pages as possible from the news websites. This focused crawler is guided by a latent semantic index and information from Word Net (business filter) which learns to recognize the relevance of a web page with respect to the business topic and it is also utilized a set of domain specific keywords. The obtained results also on online real word data show that the focused crawler is very effective for building high-quality collections of business Web documents.",
keywords = "Business data mining, Classification, Focused crawling, Web mining",
author = "Thamer Salah and Sabrina Tiun",
year = "2016",
language = "English",
volume = "11",
pages = "9229--9234",
journal = "ARPN Journal of Engineering and Applied Sciences",
issn = "1819-6608",
publisher = "Asian Research Publishing Network (ARPN)",
number = "15",

}

TY - JOUR

T1 - Focused crawling of online business Web pages using latent semantic indexing approach

AU - Salah, Thamer

AU - Tiun, Sabrina

PY - 2016

Y1 - 2016

N2 - With the exponential growth of textual information available from the Internet, there has been an emergent need to find relevant, in-time and in-depth knowledge about business topic. The huge size of such data makes the process of retrieving and analyzing and use of the valuable information in such texts manually a very difficult task. In this paper, we attempt to address a challenging task i.e. a crawling business-specific knowledge on the Web. To do that, the main goal of this paper is to describe a new method of focused crawling with latent semantic indexing for online business web pages. We describe a new model for online business text crawling which seeks, acquires, maintains and filter business pages. This model consists mainly from two main modules: a crawling system and a text filtering system. The crawler is used to collect as many web pages as possible from the news websites. This focused crawler is guided by a latent semantic index and information from Word Net (business filter) which learns to recognize the relevance of a web page with respect to the business topic and it is also utilized a set of domain specific keywords. The obtained results also on online real word data show that the focused crawler is very effective for building high-quality collections of business Web documents.

AB - With the exponential growth of textual information available from the Internet, there has been an emergent need to find relevant, in-time and in-depth knowledge about business topic. The huge size of such data makes the process of retrieving and analyzing and use of the valuable information in such texts manually a very difficult task. In this paper, we attempt to address a challenging task i.e. a crawling business-specific knowledge on the Web. To do that, the main goal of this paper is to describe a new method of focused crawling with latent semantic indexing for online business web pages. We describe a new model for online business text crawling which seeks, acquires, maintains and filter business pages. This model consists mainly from two main modules: a crawling system and a text filtering system. The crawler is used to collect as many web pages as possible from the news websites. This focused crawler is guided by a latent semantic index and information from Word Net (business filter) which learns to recognize the relevance of a web page with respect to the business topic and it is also utilized a set of domain specific keywords. The obtained results also on online real word data show that the focused crawler is very effective for building high-quality collections of business Web documents.

KW - Business data mining

KW - Classification

KW - Focused crawling

KW - Web mining

UR - http://www.scopus.com/inward/record.url?scp=84983372410&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84983372410&partnerID=8YFLogxK

M3 - Article

AN - SCOPUS:84983372410

VL - 11

SP - 9229

EP - 9234

JO - ARPN Journal of Engineering and Applied Sciences

JF - ARPN Journal of Engineering and Applied Sciences

SN - 1819-6608

IS - 15

ER -