Application of structured document parsing to focused web crawling

Ahmed Patel, Nikita Schmidt

    Research output: Contribution to journalArticle

    14 Citations (Scopus)

    Abstract

    The performance of a focused, or topic-specific Web robot can be improved by taking into consideration the structure of the documents downloaded by the robot. In the case of HTML, document structure is tree-like, defined by nested document elements (tags) and their attributes. By analysing this structure, a robot may use the text of certain HTML elements to prioritise documents for downloading and thus significantly improve the speed of convergence to a topic. Clear separation of the structure-aware document parser from the download scheduler provides flexibility but requires a standard interface and protocol between the two. The paper discusses such an interface in the context of an experimental Web robot, whose speed of convergence to a topic was observed to increase by a factor of 3 to 8, as measured by the number of documents downloaded to reach a given average relevance score.

    Original languageEnglish
    Pages (from-to)325-331
    Number of pages7
    JournalComputer Standards and Interfaces
    Volume33
    Issue number3
    DOIs
    Publication statusPublished - Mar 2011

    Fingerprint

    HTML
    Software agents
    Interfaces (computer)
    Robots
    robot
    Network protocols
    flexibility
    performance

    Keywords

    • Attribute
    • Focused web crawler
    • Information structure
    • Robot
    • Spider
    • Structural element
    • Topic-specific

    ASJC Scopus subject areas

    • Software
    • Hardware and Architecture
    • Law

    Cite this

    Application of structured document parsing to focused web crawling. / Patel, Ahmed; Schmidt, Nikita.

    In: Computer Standards and Interfaces, Vol. 33, No. 3, 03.2011, p. 325-331.

    Research output: Contribution to journalArticle

    Patel, Ahmed ; Schmidt, Nikita. / Application of structured document parsing to focused web crawling. In: Computer Standards and Interfaces. 2011 ; Vol. 33, No. 3. pp. 325-331.
    @article{9cfe116931604a4280ba6fe37308d0b3,
    title = "Application of structured document parsing to focused web crawling",
    abstract = "The performance of a focused, or topic-specific Web robot can be improved by taking into consideration the structure of the documents downloaded by the robot. In the case of HTML, document structure is tree-like, defined by nested document elements (tags) and their attributes. By analysing this structure, a robot may use the text of certain HTML elements to prioritise documents for downloading and thus significantly improve the speed of convergence to a topic. Clear separation of the structure-aware document parser from the download scheduler provides flexibility but requires a standard interface and protocol between the two. The paper discusses such an interface in the context of an experimental Web robot, whose speed of convergence to a topic was observed to increase by a factor of 3 to 8, as measured by the number of documents downloaded to reach a given average relevance score.",
    keywords = "Attribute, Focused web crawler, Information structure, Robot, Spider, Structural element, Topic-specific",
    author = "Ahmed Patel and Nikita Schmidt",
    year = "2011",
    month = "3",
    doi = "10.1016/j.csi.2010.08.002",
    language = "English",
    volume = "33",
    pages = "325--331",
    journal = "Computer Standards and Interfaces",
    issn = "0920-5489",
    publisher = "Elsevier",
    number = "3",

    }

    TY - JOUR

    T1 - Application of structured document parsing to focused web crawling

    AU - Patel, Ahmed

    AU - Schmidt, Nikita

    PY - 2011/3

    Y1 - 2011/3

    N2 - The performance of a focused, or topic-specific Web robot can be improved by taking into consideration the structure of the documents downloaded by the robot. In the case of HTML, document structure is tree-like, defined by nested document elements (tags) and their attributes. By analysing this structure, a robot may use the text of certain HTML elements to prioritise documents for downloading and thus significantly improve the speed of convergence to a topic. Clear separation of the structure-aware document parser from the download scheduler provides flexibility but requires a standard interface and protocol between the two. The paper discusses such an interface in the context of an experimental Web robot, whose speed of convergence to a topic was observed to increase by a factor of 3 to 8, as measured by the number of documents downloaded to reach a given average relevance score.

    AB - The performance of a focused, or topic-specific Web robot can be improved by taking into consideration the structure of the documents downloaded by the robot. In the case of HTML, document structure is tree-like, defined by nested document elements (tags) and their attributes. By analysing this structure, a robot may use the text of certain HTML elements to prioritise documents for downloading and thus significantly improve the speed of convergence to a topic. Clear separation of the structure-aware document parser from the download scheduler provides flexibility but requires a standard interface and protocol between the two. The paper discusses such an interface in the context of an experimental Web robot, whose speed of convergence to a topic was observed to increase by a factor of 3 to 8, as measured by the number of documents downloaded to reach a given average relevance score.

    KW - Attribute

    KW - Focused web crawler

    KW - Information structure

    KW - Robot

    KW - Spider

    KW - Structural element

    KW - Topic-specific

    UR - http://www.scopus.com/inward/record.url?scp=78650239125&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=78650239125&partnerID=8YFLogxK

    U2 - 10.1016/j.csi.2010.08.002

    DO - 10.1016/j.csi.2010.08.002

    M3 - Article

    VL - 33

    SP - 325

    EP - 331

    JO - Computer Standards and Interfaces

    JF - Computer Standards and Interfaces

    SN - 0920-5489

    IS - 3

    ER -