Automatic multi-document Arabic text summarization using clustering and keyphrase extraction

Hamzah Noori Fejer, Nazlia Omar

Research output: Contribution to journalArticle

2 Citations (Scopus)

Abstract

Automatic text summarization has become important due to the rapid growth of information texts since it is very difficult for human beings to manually summarize large documents of texts. A full understanding of the document is essential to form an ideal summary. However, achieving full understanding is either difficult or impossible for computers. Therefore, selecting important sentences from the original text and presenting these sentences as a summary present the most common techniques in automated text summarization. Arabic natural language processing lacks tools and resources which are essential to advance research in Arabic text summarization. In addition to the limited resources, there has been little attention and research done in this field. Arabic text summarization still suffer from low accuracy as they use simple summarization techniques. The aim of this research is to improve Arabic text summarization by using clustering and keyphrase extraction. This study proposes a combined clustering method to group Arabic documents into several clusters. Keyphrase extraction module is applied to extract important keyphrases from each cluster, which helps to identify the most important sentences and find similar sentences based on several similarity algorithms. These algorithms are applied to extract one sentence from a group of similar sentences while ignoring the other similar sentences. The Recall-Oriented Understudy for Gisting Evaluation (ROGUE) metrics were used for the evaluation. For the summarization dataset the corpus DUC2002 was used. This model achieved an accuracy of 43.4%. The experiments have proved that the proposed model has given better performance in comparison to other work.

Original languageEnglish
Pages (from-to)1-9
Number of pages9
JournalJournal of Artificial Intelligence
Volume8
Issue number1
DOIs
Publication statusPublished - 2015

Fingerprint

Processing
Experiments

Keywords

  • Automatic text summarization
  • Clustering
  • Keyphrase extraction
  • Multi-document
  • ROUGE metrics
  • Similarity

ASJC Scopus subject areas

  • Artificial Intelligence
  • Software

Cite this

Automatic multi-document Arabic text summarization using clustering and keyphrase extraction. / Fejer, Hamzah Noori; Omar, Nazlia.

In: Journal of Artificial Intelligence, Vol. 8, No. 1, 2015, p. 1-9.

Research output: Contribution to journalArticle

@article{0fbb7574aaca4422b5e9ab087942f1cd,
title = "Automatic multi-document Arabic text summarization using clustering and keyphrase extraction",
abstract = "Automatic text summarization has become important due to the rapid growth of information texts since it is very difficult for human beings to manually summarize large documents of texts. A full understanding of the document is essential to form an ideal summary. However, achieving full understanding is either difficult or impossible for computers. Therefore, selecting important sentences from the original text and presenting these sentences as a summary present the most common techniques in automated text summarization. Arabic natural language processing lacks tools and resources which are essential to advance research in Arabic text summarization. In addition to the limited resources, there has been little attention and research done in this field. Arabic text summarization still suffer from low accuracy as they use simple summarization techniques. The aim of this research is to improve Arabic text summarization by using clustering and keyphrase extraction. This study proposes a combined clustering method to group Arabic documents into several clusters. Keyphrase extraction module is applied to extract important keyphrases from each cluster, which helps to identify the most important sentences and find similar sentences based on several similarity algorithms. These algorithms are applied to extract one sentence from a group of similar sentences while ignoring the other similar sentences. The Recall-Oriented Understudy for Gisting Evaluation (ROGUE) metrics were used for the evaluation. For the summarization dataset the corpus DUC2002 was used. This model achieved an accuracy of 43.4{\%}. The experiments have proved that the proposed model has given better performance in comparison to other work.",
keywords = "Automatic text summarization, Clustering, Keyphrase extraction, Multi-document, ROUGE metrics, Similarity",
author = "Fejer, {Hamzah Noori} and Nazlia Omar",
year = "2015",
doi = "10.3923/jai.2015.1.9",
language = "English",
volume = "8",
pages = "1--9",
journal = "Journal of Artificial Intelligence",
issn = "1994-5450",
publisher = "Asian Network for Scientific Information",
number = "1",

}

TY - JOUR

T1 - Automatic multi-document Arabic text summarization using clustering and keyphrase extraction

AU - Fejer, Hamzah Noori

AU - Omar, Nazlia

PY - 2015

Y1 - 2015

N2 - Automatic text summarization has become important due to the rapid growth of information texts since it is very difficult for human beings to manually summarize large documents of texts. A full understanding of the document is essential to form an ideal summary. However, achieving full understanding is either difficult or impossible for computers. Therefore, selecting important sentences from the original text and presenting these sentences as a summary present the most common techniques in automated text summarization. Arabic natural language processing lacks tools and resources which are essential to advance research in Arabic text summarization. In addition to the limited resources, there has been little attention and research done in this field. Arabic text summarization still suffer from low accuracy as they use simple summarization techniques. The aim of this research is to improve Arabic text summarization by using clustering and keyphrase extraction. This study proposes a combined clustering method to group Arabic documents into several clusters. Keyphrase extraction module is applied to extract important keyphrases from each cluster, which helps to identify the most important sentences and find similar sentences based on several similarity algorithms. These algorithms are applied to extract one sentence from a group of similar sentences while ignoring the other similar sentences. The Recall-Oriented Understudy for Gisting Evaluation (ROGUE) metrics were used for the evaluation. For the summarization dataset the corpus DUC2002 was used. This model achieved an accuracy of 43.4%. The experiments have proved that the proposed model has given better performance in comparison to other work.

AB - Automatic text summarization has become important due to the rapid growth of information texts since it is very difficult for human beings to manually summarize large documents of texts. A full understanding of the document is essential to form an ideal summary. However, achieving full understanding is either difficult or impossible for computers. Therefore, selecting important sentences from the original text and presenting these sentences as a summary present the most common techniques in automated text summarization. Arabic natural language processing lacks tools and resources which are essential to advance research in Arabic text summarization. In addition to the limited resources, there has been little attention and research done in this field. Arabic text summarization still suffer from low accuracy as they use simple summarization techniques. The aim of this research is to improve Arabic text summarization by using clustering and keyphrase extraction. This study proposes a combined clustering method to group Arabic documents into several clusters. Keyphrase extraction module is applied to extract important keyphrases from each cluster, which helps to identify the most important sentences and find similar sentences based on several similarity algorithms. These algorithms are applied to extract one sentence from a group of similar sentences while ignoring the other similar sentences. The Recall-Oriented Understudy for Gisting Evaluation (ROGUE) metrics were used for the evaluation. For the summarization dataset the corpus DUC2002 was used. This model achieved an accuracy of 43.4%. The experiments have proved that the proposed model has given better performance in comparison to other work.

KW - Automatic text summarization

KW - Clustering

KW - Keyphrase extraction

KW - Multi-document

KW - ROUGE metrics

KW - Similarity

UR - http://www.scopus.com/inward/record.url?scp=84947919285&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84947919285&partnerID=8YFLogxK

U2 - 10.3923/jai.2015.1.9

DO - 10.3923/jai.2015.1.9

M3 - Article

AN - SCOPUS:84947919285

VL - 8

SP - 1

EP - 9

JO - Journal of Artificial Intelligence

JF - Journal of Artificial Intelligence

SN - 1994-5450

IS - 1

ER -