Web document indexing and retrieval

Byurhan Hyusein, Ahmed Patel

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Web Document Indexing is an important part of every Search Engine (SE). Indexing quality has an overwhelming effect on retrieval effectiveness. A document index is a set of terms which show the contents (topic) of the document and helps in distinguishing a given document from other documents in the collection of documents. Small index size can lead to poor results and may miss some relevant items. Large index size allows retrieval of many useful documents along with a significant number of irrelevant ones and decreases the search speed and effectiveness of the searched item. Though the problem has been studied for many years there is still no algorithm to find the optimal index size and sets of index terms. This paper shows how different attributes of the web document (namely Title, Anchor and Emphasize) contribute to the average precision in the process of search. The experiments are done on the WT10g collection of a 1.69-million page corpus.

Original languageEnglish
Title of host publicationLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
PublisherSpringer Verlag
Pages573-579
Number of pages7
Volume2588
ISBN (Print)3540005323
DOIs
Publication statusPublished - 2003
Externally publishedYes
Event4th International Conference on Intelligent Text Processing and Computational Linguistics, CICLing 2003 - Mexico City, Mexico
Duration: 16 Feb 200322 Feb 2003

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume2588
ISSN (Print)03029743
ISSN (Electronic)16113349

Other

Other4th International Conference on Intelligent Text Processing and Computational Linguistics, CICLing 2003
CountryMexico
CityMexico City
Period16/2/0322/2/03

Fingerprint

Search engines
Anchors
Indexing
Retrieval
Experiments
Term
Search Engine
Attribute
Decrease
Experiment

ASJC Scopus subject areas

  • Computer Science(all)
  • Theoretical Computer Science

Cite this

Hyusein, B., & Patel, A. (2003). Web document indexing and retrieval. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 2588, pp. 573-579). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 2588). Springer Verlag. https://doi.org/10.1007/3-540-36456-0_62

Web document indexing and retrieval. / Hyusein, Byurhan; Patel, Ahmed.

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 2588 Springer Verlag, 2003. p. 573-579 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 2588).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Hyusein, B & Patel, A 2003, Web document indexing and retrieval. in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). vol. 2588, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 2588, Springer Verlag, pp. 573-579, 4th International Conference on Intelligent Text Processing and Computational Linguistics, CICLing 2003, Mexico City, Mexico, 16/2/03. https://doi.org/10.1007/3-540-36456-0_62
Hyusein B, Patel A. Web document indexing and retrieval. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 2588. Springer Verlag. 2003. p. 573-579. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). https://doi.org/10.1007/3-540-36456-0_62
Hyusein, Byurhan ; Patel, Ahmed. / Web document indexing and retrieval. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 2588 Springer Verlag, 2003. pp. 573-579 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
@inproceedings{0be0a7a958af4c5c8055356b4a082aa0,
title = "Web document indexing and retrieval",
abstract = "Web Document Indexing is an important part of every Search Engine (SE). Indexing quality has an overwhelming effect on retrieval effectiveness. A document index is a set of terms which show the contents (topic) of the document and helps in distinguishing a given document from other documents in the collection of documents. Small index size can lead to poor results and may miss some relevant items. Large index size allows retrieval of many useful documents along with a significant number of irrelevant ones and decreases the search speed and effectiveness of the searched item. Though the problem has been studied for many years there is still no algorithm to find the optimal index size and sets of index terms. This paper shows how different attributes of the web document (namely Title, Anchor and Emphasize) contribute to the average precision in the process of search. The experiments are done on the WT10g collection of a 1.69-million page corpus.",
author = "Byurhan Hyusein and Ahmed Patel",
year = "2003",
doi = "10.1007/3-540-36456-0_62",
language = "English",
isbn = "3540005323",
volume = "2588",
series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
publisher = "Springer Verlag",
pages = "573--579",
booktitle = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

}

TY - GEN

T1 - Web document indexing and retrieval

AU - Hyusein, Byurhan

AU - Patel, Ahmed

PY - 2003

Y1 - 2003

N2 - Web Document Indexing is an important part of every Search Engine (SE). Indexing quality has an overwhelming effect on retrieval effectiveness. A document index is a set of terms which show the contents (topic) of the document and helps in distinguishing a given document from other documents in the collection of documents. Small index size can lead to poor results and may miss some relevant items. Large index size allows retrieval of many useful documents along with a significant number of irrelevant ones and decreases the search speed and effectiveness of the searched item. Though the problem has been studied for many years there is still no algorithm to find the optimal index size and sets of index terms. This paper shows how different attributes of the web document (namely Title, Anchor and Emphasize) contribute to the average precision in the process of search. The experiments are done on the WT10g collection of a 1.69-million page corpus.

AB - Web Document Indexing is an important part of every Search Engine (SE). Indexing quality has an overwhelming effect on retrieval effectiveness. A document index is a set of terms which show the contents (topic) of the document and helps in distinguishing a given document from other documents in the collection of documents. Small index size can lead to poor results and may miss some relevant items. Large index size allows retrieval of many useful documents along with a significant number of irrelevant ones and decreases the search speed and effectiveness of the searched item. Though the problem has been studied for many years there is still no algorithm to find the optimal index size and sets of index terms. This paper shows how different attributes of the web document (namely Title, Anchor and Emphasize) contribute to the average precision in the process of search. The experiments are done on the WT10g collection of a 1.69-million page corpus.

UR - http://www.scopus.com/inward/record.url?scp=84958074561&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84958074561&partnerID=8YFLogxK

U2 - 10.1007/3-540-36456-0_62

DO - 10.1007/3-540-36456-0_62

M3 - Conference contribution

AN - SCOPUS:84958074561

SN - 3540005323

VL - 2588

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 573

EP - 579

BT - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

PB - Springer Verlag

ER -