Malay text features for automatic news headline generation

Research output: Contribution to journalArticle

Abstract

The diversity of natural language for meaning representation in documents is one of the causes of information overload in information retrieval. Headline generation is an automatic text summarization technique that can reduce or address such a problem. This research discusses an experimental study on the determination of Malay language characteristics from news genre documents. A Malay news corpus comprising 140 news documents was chosen from the BERNAMA news archive. The selection criteria were limited to hard news, a word count of 50 to 250 words, published between 2007 and 2012, and with news genres of economy, crime, education, or sports only. Three Malay linguistic experts were selected to produce a reference headline for each news document manually. Experiment results identify three characteristics. First, the first two sentences of a news document are suitable candidates for the most important sentences; second, sentences that contain an acronym definition also have the potential to become the most important sentences; and third, the ideal length of a headline is six words. Considering these characteristics will generate intelligent headlines for Malay news.

Original languageEnglish
Pages (from-to)36-41
Number of pages6
JournalJournal of Theoretical and Applied Information Technology
Volume76
Issue number1
Publication statusPublished - 2015

Fingerprint

Crime
Sports
Information retrieval
Linguistics
Education
Acronym
Experiments
Summarization
Overload
Information Retrieval
Natural Language
Experimental Study
Count
Experiment
Text

Keywords

  • Headline generation
  • Malay news
  • Text summarization

ASJC Scopus subject areas

  • Computer Science(all)
  • Theoretical Computer Science

Cite this

Malay text features for automatic news headline generation. / Hasan, Mohd Sabri; Mohd Noah, Shahrul Azman; Mohamad Ali, Nazlena.

In: Journal of Theoretical and Applied Information Technology, Vol. 76, No. 1, 2015, p. 36-41.

Research output: Contribution to journalArticle

@article{8797e851556d4b8383383419f3905d18,
title = "Malay text features for automatic news headline generation",
abstract = "The diversity of natural language for meaning representation in documents is one of the causes of information overload in information retrieval. Headline generation is an automatic text summarization technique that can reduce or address such a problem. This research discusses an experimental study on the determination of Malay language characteristics from news genre documents. A Malay news corpus comprising 140 news documents was chosen from the BERNAMA news archive. The selection criteria were limited to hard news, a word count of 50 to 250 words, published between 2007 and 2012, and with news genres of economy, crime, education, or sports only. Three Malay linguistic experts were selected to produce a reference headline for each news document manually. Experiment results identify three characteristics. First, the first two sentences of a news document are suitable candidates for the most important sentences; second, sentences that contain an acronym definition also have the potential to become the most important sentences; and third, the ideal length of a headline is six words. Considering these characteristics will generate intelligent headlines for Malay news.",
keywords = "Headline generation, Malay news, Text summarization",
author = "Hasan, {Mohd Sabri} and {Mohd Noah}, {Shahrul Azman} and {Mohamad Ali}, Nazlena",
year = "2015",
language = "English",
volume = "76",
pages = "36--41",
journal = "Journal of Theoretical and Applied Information Technology",
issn = "1992-8645",
publisher = "Asian Research Publishing Network (ARPN)",
number = "1",

}

TY - JOUR

T1 - Malay text features for automatic news headline generation

AU - Hasan, Mohd Sabri

AU - Mohd Noah, Shahrul Azman

AU - Mohamad Ali, Nazlena

PY - 2015

Y1 - 2015

N2 - The diversity of natural language for meaning representation in documents is one of the causes of information overload in information retrieval. Headline generation is an automatic text summarization technique that can reduce or address such a problem. This research discusses an experimental study on the determination of Malay language characteristics from news genre documents. A Malay news corpus comprising 140 news documents was chosen from the BERNAMA news archive. The selection criteria were limited to hard news, a word count of 50 to 250 words, published between 2007 and 2012, and with news genres of economy, crime, education, or sports only. Three Malay linguistic experts were selected to produce a reference headline for each news document manually. Experiment results identify three characteristics. First, the first two sentences of a news document are suitable candidates for the most important sentences; second, sentences that contain an acronym definition also have the potential to become the most important sentences; and third, the ideal length of a headline is six words. Considering these characteristics will generate intelligent headlines for Malay news.

AB - The diversity of natural language for meaning representation in documents is one of the causes of information overload in information retrieval. Headline generation is an automatic text summarization technique that can reduce or address such a problem. This research discusses an experimental study on the determination of Malay language characteristics from news genre documents. A Malay news corpus comprising 140 news documents was chosen from the BERNAMA news archive. The selection criteria were limited to hard news, a word count of 50 to 250 words, published between 2007 and 2012, and with news genres of economy, crime, education, or sports only. Three Malay linguistic experts were selected to produce a reference headline for each news document manually. Experiment results identify three characteristics. First, the first two sentences of a news document are suitable candidates for the most important sentences; second, sentences that contain an acronym definition also have the potential to become the most important sentences; and third, the ideal length of a headline is six words. Considering these characteristics will generate intelligent headlines for Malay news.

KW - Headline generation

KW - Malay news

KW - Text summarization

UR - http://www.scopus.com/inward/record.url?scp=84930697495&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84930697495&partnerID=8YFLogxK

M3 - Article

AN - SCOPUS:84930697495

VL - 76

SP - 36

EP - 41

JO - Journal of Theoretical and Applied Information Technology

JF - Journal of Theoretical and Applied Information Technology

SN - 1992-8645

IS - 1

ER -