Automatic recognition of document structure from PDF files

Rosmayati Mohemad, Abdul Razak Hamdan, Zulaiha Ali Othman, Noor Maizura Mohamad Noor

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Citation (Scopus)

Abstract

Web-based technology disseminates and stores abundant information electronically. Today, people is more comfortable to use document in Portable Document Format (PDF) because of its operating system independent. However, information on PDF document which in read-only mode are applicable only for human reader. In addition, PDF consists of non-tagged internal structure which make the extraction task difficult. Automatically details analyzing and recognizing of PDF document structures especially paragraph and tabular area is vital for extracting relevant information precisely for use in other domain applications. A combination of heuristic and rule-based approach is proposed to automatically identify and recognize the structure of PDF document. An experimental study has been conducted using a collection of construction tender documents in PDF to test the performance of the proposed approach. The accuracies of precision, recall and f-measures have shown significant results when detecting tabular and paragraph structure.

Original languageEnglish
Title of host publicationCommunications in Computer and Information Science
Pages274-282
Number of pages9
Volume181 CCIS
EditionPART 3
DOIs
Publication statusPublished - 2011
Event2nd International Conference on Software Engineering and Computer Systems, ICSECS 2011 - Kuantan
Duration: 27 Jun 201129 Jun 2011

Publication series

NameCommunications in Computer and Information Science
NumberPART 3
Volume181 CCIS
ISSN (Print)18650929

Other

Other2nd International Conference on Software Engineering and Computer Systems, ICSECS 2011
CityKuantan
Period27/6/1129/6/11

Keywords

  • Document Analysis
  • Information Extraction
  • Portable Document Format

ASJC Scopus subject areas

  • Computer Science(all)

Cite this

Mohemad, R., Hamdan, A. R., Ali Othman, Z., & Mohamad Noor, N. M. (2011). Automatic recognition of document structure from PDF files. In Communications in Computer and Information Science (PART 3 ed., Vol. 181 CCIS, pp. 274-282). (Communications in Computer and Information Science; Vol. 181 CCIS, No. PART 3). https://doi.org/10.1007/978-3-642-22203-0_24

Automatic recognition of document structure from PDF files. / Mohemad, Rosmayati; Hamdan, Abdul Razak; Ali Othman, Zulaiha; Mohamad Noor, Noor Maizura.

Communications in Computer and Information Science. Vol. 181 CCIS PART 3. ed. 2011. p. 274-282 (Communications in Computer and Information Science; Vol. 181 CCIS, No. PART 3).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Mohemad, R, Hamdan, AR, Ali Othman, Z & Mohamad Noor, NM 2011, Automatic recognition of document structure from PDF files. in Communications in Computer and Information Science. PART 3 edn, vol. 181 CCIS, Communications in Computer and Information Science, no. PART 3, vol. 181 CCIS, pp. 274-282, 2nd International Conference on Software Engineering and Computer Systems, ICSECS 2011, Kuantan, 27/6/11. https://doi.org/10.1007/978-3-642-22203-0_24
Mohemad R, Hamdan AR, Ali Othman Z, Mohamad Noor NM. Automatic recognition of document structure from PDF files. In Communications in Computer and Information Science. PART 3 ed. Vol. 181 CCIS. 2011. p. 274-282. (Communications in Computer and Information Science; PART 3). https://doi.org/10.1007/978-3-642-22203-0_24
Mohemad, Rosmayati ; Hamdan, Abdul Razak ; Ali Othman, Zulaiha ; Mohamad Noor, Noor Maizura. / Automatic recognition of document structure from PDF files. Communications in Computer and Information Science. Vol. 181 CCIS PART 3. ed. 2011. pp. 274-282 (Communications in Computer and Information Science; PART 3).
@inproceedings{72942637c2a3484b839c21044c303f73,
title = "Automatic recognition of document structure from PDF files",
abstract = "Web-based technology disseminates and stores abundant information electronically. Today, people is more comfortable to use document in Portable Document Format (PDF) because of its operating system independent. However, information on PDF document which in read-only mode are applicable only for human reader. In addition, PDF consists of non-tagged internal structure which make the extraction task difficult. Automatically details analyzing and recognizing of PDF document structures especially paragraph and tabular area is vital for extracting relevant information precisely for use in other domain applications. A combination of heuristic and rule-based approach is proposed to automatically identify and recognize the structure of PDF document. An experimental study has been conducted using a collection of construction tender documents in PDF to test the performance of the proposed approach. The accuracies of precision, recall and f-measures have shown significant results when detecting tabular and paragraph structure.",
keywords = "Document Analysis, Information Extraction, Portable Document Format",
author = "Rosmayati Mohemad and Hamdan, {Abdul Razak} and {Ali Othman}, Zulaiha and {Mohamad Noor}, {Noor Maizura}",
year = "2011",
doi = "10.1007/978-3-642-22203-0_24",
language = "English",
isbn = "9783642222023",
volume = "181 CCIS",
series = "Communications in Computer and Information Science",
number = "PART 3",
pages = "274--282",
booktitle = "Communications in Computer and Information Science",
edition = "PART 3",

}

TY - GEN

T1 - Automatic recognition of document structure from PDF files

AU - Mohemad, Rosmayati

AU - Hamdan, Abdul Razak

AU - Ali Othman, Zulaiha

AU - Mohamad Noor, Noor Maizura

PY - 2011

Y1 - 2011

N2 - Web-based technology disseminates and stores abundant information electronically. Today, people is more comfortable to use document in Portable Document Format (PDF) because of its operating system independent. However, information on PDF document which in read-only mode are applicable only for human reader. In addition, PDF consists of non-tagged internal structure which make the extraction task difficult. Automatically details analyzing and recognizing of PDF document structures especially paragraph and tabular area is vital for extracting relevant information precisely for use in other domain applications. A combination of heuristic and rule-based approach is proposed to automatically identify and recognize the structure of PDF document. An experimental study has been conducted using a collection of construction tender documents in PDF to test the performance of the proposed approach. The accuracies of precision, recall and f-measures have shown significant results when detecting tabular and paragraph structure.

AB - Web-based technology disseminates and stores abundant information electronically. Today, people is more comfortable to use document in Portable Document Format (PDF) because of its operating system independent. However, information on PDF document which in read-only mode are applicable only for human reader. In addition, PDF consists of non-tagged internal structure which make the extraction task difficult. Automatically details analyzing and recognizing of PDF document structures especially paragraph and tabular area is vital for extracting relevant information precisely for use in other domain applications. A combination of heuristic and rule-based approach is proposed to automatically identify and recognize the structure of PDF document. An experimental study has been conducted using a collection of construction tender documents in PDF to test the performance of the proposed approach. The accuracies of precision, recall and f-measures have shown significant results when detecting tabular and paragraph structure.

KW - Document Analysis

KW - Information Extraction

KW - Portable Document Format

UR - http://www.scopus.com/inward/record.url?scp=79960365491&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=79960365491&partnerID=8YFLogxK

U2 - 10.1007/978-3-642-22203-0_24

DO - 10.1007/978-3-642-22203-0_24

M3 - Conference contribution

AN - SCOPUS:79960365491

SN - 9783642222023

VL - 181 CCIS

T3 - Communications in Computer and Information Science

SP - 274

EP - 282

BT - Communications in Computer and Information Science

ER -