Penjanaan ringkasan isi utama berita bahasa melayu berdasarkan ciri kata

Translated title of the contribution: Generation of news headline for malay language based on term features

Research output: Contribution to journalArticle

Abstract

Headline generation is an information extraction process to generate a single sentence that represents the content of a text. In Malay language context, research in this area is limited to machine translation approaches. This study is divided into three phases: analysis of news discourse, development of headline generation technique and evaluation of the quality of generated headlines. The study aims to develop headline using statistical and linguistic methods. The statistic method used to identify significant words and sentences based in term weighting approach. The linguistic method is used to increase its preciseness. 140 news and their corresponding headlines model were constructed. Analysis of the news collection shows that the main idea of written text can be identified based on four characteristics: word location in sentences, sentence location in texts, acronym word types and words that represent the person name. Significant words with main idea of written text are determined based on the words weighted values. The values are determined by combining the frequency of words and word location in sentences. The content of the first two sentences are suitable candidates for recognising important sentences in text. Results showed that mean percentage for important sentence recognition 82.9%, mean quality of generated headlines are 0.3194 (precision), 0.5656 (recall), 0.4012 (F-measure), 0.5656 (ROUGE–N), 0.3392 (ROUGE–L), 0.1186 (ROUGE–W) and 0.1232 (ROUGE–S). In conclusion, the consideration of language factors in headline generation technique is capable of producing quality headlines with higher degree of fidelity as compared to the compared benchmarks.

Original languageMalay
Pages (from-to)42-59
Number of pages18
JournalGEMA Online Journal of Language Studies
Volume18
Issue number4
DOIs
Publication statusPublished - 1 Nov 2018

Fingerprint

news
language
linguistics
weighting
Values
candidacy
statistics
Headlines
Language
News
human being
discourse
evaluation

ASJC Scopus subject areas

  • Language and Linguistics
  • Linguistics and Language
  • Literature and Literary Theory

Cite this

Penjanaan ringkasan isi utama berita bahasa melayu berdasarkan ciri kata. / Mohd Noah, Shahrul Azman; Mohamad Ali, Nazlena; Hasan, Mohd Sabri.

In: GEMA Online Journal of Language Studies, Vol. 18, No. 4, 01.11.2018, p. 42-59.

Research output: Contribution to journalArticle

@article{1b43e1366dd84de2a745b1108f353fb9,
title = "Penjanaan ringkasan isi utama berita bahasa melayu berdasarkan ciri kata",
abstract = "Headline generation is an information extraction process to generate a single sentence that represents the content of a text. In Malay language context, research in this area is limited to machine translation approaches. This study is divided into three phases: analysis of news discourse, development of headline generation technique and evaluation of the quality of generated headlines. The study aims to develop headline using statistical and linguistic methods. The statistic method used to identify significant words and sentences based in term weighting approach. The linguistic method is used to increase its preciseness. 140 news and their corresponding headlines model were constructed. Analysis of the news collection shows that the main idea of written text can be identified based on four characteristics: word location in sentences, sentence location in texts, acronym word types and words that represent the person name. Significant words with main idea of written text are determined based on the words weighted values. The values are determined by combining the frequency of words and word location in sentences. The content of the first two sentences are suitable candidates for recognising important sentences in text. Results showed that mean percentage for important sentence recognition 82.9{\%}, mean quality of generated headlines are 0.3194 (precision), 0.5656 (recall), 0.4012 (F-measure), 0.5656 (ROUGE–N), 0.3392 (ROUGE–L), 0.1186 (ROUGE–W) and 0.1232 (ROUGE–S). In conclusion, the consideration of language factors in headline generation technique is capable of producing quality headlines with higher degree of fidelity as compared to the compared benchmarks.",
keywords = "Malay news article, Term features, Text summarisation, Unsupervised approach",
author = "{Mohd Noah}, {Shahrul Azman} and {Mohamad Ali}, Nazlena and Hasan, {Mohd Sabri}",
year = "2018",
month = "11",
day = "1",
doi = "10.17576/gema-2018-1804-04",
language = "Malay",
volume = "18",
pages = "42--59",
journal = "GEMA Online Journal of Language Studies",
issn = "1675-8021",
publisher = "Universiti Kebangsaan Malaysia",
number = "4",

}

TY - JOUR

T1 - Penjanaan ringkasan isi utama berita bahasa melayu berdasarkan ciri kata

AU - Mohd Noah, Shahrul Azman

AU - Mohamad Ali, Nazlena

AU - Hasan, Mohd Sabri

PY - 2018/11/1

Y1 - 2018/11/1

N2 - Headline generation is an information extraction process to generate a single sentence that represents the content of a text. In Malay language context, research in this area is limited to machine translation approaches. This study is divided into three phases: analysis of news discourse, development of headline generation technique and evaluation of the quality of generated headlines. The study aims to develop headline using statistical and linguistic methods. The statistic method used to identify significant words and sentences based in term weighting approach. The linguistic method is used to increase its preciseness. 140 news and their corresponding headlines model were constructed. Analysis of the news collection shows that the main idea of written text can be identified based on four characteristics: word location in sentences, sentence location in texts, acronym word types and words that represent the person name. Significant words with main idea of written text are determined based on the words weighted values. The values are determined by combining the frequency of words and word location in sentences. The content of the first two sentences are suitable candidates for recognising important sentences in text. Results showed that mean percentage for important sentence recognition 82.9%, mean quality of generated headlines are 0.3194 (precision), 0.5656 (recall), 0.4012 (F-measure), 0.5656 (ROUGE–N), 0.3392 (ROUGE–L), 0.1186 (ROUGE–W) and 0.1232 (ROUGE–S). In conclusion, the consideration of language factors in headline generation technique is capable of producing quality headlines with higher degree of fidelity as compared to the compared benchmarks.

AB - Headline generation is an information extraction process to generate a single sentence that represents the content of a text. In Malay language context, research in this area is limited to machine translation approaches. This study is divided into three phases: analysis of news discourse, development of headline generation technique and evaluation of the quality of generated headlines. The study aims to develop headline using statistical and linguistic methods. The statistic method used to identify significant words and sentences based in term weighting approach. The linguistic method is used to increase its preciseness. 140 news and their corresponding headlines model were constructed. Analysis of the news collection shows that the main idea of written text can be identified based on four characteristics: word location in sentences, sentence location in texts, acronym word types and words that represent the person name. Significant words with main idea of written text are determined based on the words weighted values. The values are determined by combining the frequency of words and word location in sentences. The content of the first two sentences are suitable candidates for recognising important sentences in text. Results showed that mean percentage for important sentence recognition 82.9%, mean quality of generated headlines are 0.3194 (precision), 0.5656 (recall), 0.4012 (F-measure), 0.5656 (ROUGE–N), 0.3392 (ROUGE–L), 0.1186 (ROUGE–W) and 0.1232 (ROUGE–S). In conclusion, the consideration of language factors in headline generation technique is capable of producing quality headlines with higher degree of fidelity as compared to the compared benchmarks.

KW - Malay news article

KW - Term features

KW - Text summarisation

KW - Unsupervised approach

UR - http://www.scopus.com/inward/record.url?scp=85057768205&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85057768205&partnerID=8YFLogxK

U2 - 10.17576/gema-2018-1804-04

DO - 10.17576/gema-2018-1804-04

M3 - Article

VL - 18

SP - 42

EP - 59

JO - GEMA Online Journal of Language Studies

JF - GEMA Online Journal of Language Studies

SN - 1675-8021

IS - 4

ER -