Abstract
Headline generation is an information extraction process to generate a single sentence that represents the content of a text. In Malay language context, research in this area is limited to machine translation approaches. This study is divided into three phases: analysis of news discourse, development of headline generation technique and evaluation of the quality of generated headlines. The study aims to develop headline using statistical and linguistic methods. The statistic method used to identify significant words and sentences based in term weighting approach. The linguistic method is used to increase its preciseness. 140 news and their corresponding headlines model were constructed. Analysis of the news collection shows that the main idea of written text can be identified based on four characteristics: word location in sentences, sentence location in texts, acronym word types and words that represent the person name. Significant words with main idea of written text are determined based on the words weighted values. The values are determined by combining the frequency of words and word location in sentences. The content of the first two sentences are suitable candidates for recognising important sentences in text. Results showed that mean percentage for important sentence recognition 82.9%, mean quality of generated headlines are 0.3194 (precision), 0.5656 (recall), 0.4012 (F-measure), 0.5656 (ROUGE–N), 0.3392 (ROUGE–L), 0.1186 (ROUGE–W) and 0.1232 (ROUGE–S). In conclusion, the consideration of language factors in headline generation technique is capable of producing quality headlines with higher degree of fidelity as compared to the compared benchmarks.
Original language | Malay |
---|---|
Pages (from-to) | 42-59 |
Number of pages | 18 |
Journal | GEMA Online Journal of Language Studies |
Volume | 18 |
Issue number | 4 |
DOIs | |
Publication status | Published - 1 Nov 2018 |
Fingerprint
ASJC Scopus subject areas
- Language and Linguistics
- Linguistics and Language
- Literature and Literary Theory
Cite this
Penjanaan ringkasan isi utama berita bahasa melayu berdasarkan ciri kata. / Mohd Noah, Shahrul Azman; Mohamad Ali, Nazlena; Hasan, Mohd Sabri.
In: GEMA Online Journal of Language Studies, Vol. 18, No. 4, 01.11.2018, p. 42-59.Research output: Contribution to journal › Article
}
TY - JOUR
T1 - Penjanaan ringkasan isi utama berita bahasa melayu berdasarkan ciri kata
AU - Mohd Noah, Shahrul Azman
AU - Mohamad Ali, Nazlena
AU - Hasan, Mohd Sabri
PY - 2018/11/1
Y1 - 2018/11/1
N2 - Headline generation is an information extraction process to generate a single sentence that represents the content of a text. In Malay language context, research in this area is limited to machine translation approaches. This study is divided into three phases: analysis of news discourse, development of headline generation technique and evaluation of the quality of generated headlines. The study aims to develop headline using statistical and linguistic methods. The statistic method used to identify significant words and sentences based in term weighting approach. The linguistic method is used to increase its preciseness. 140 news and their corresponding headlines model were constructed. Analysis of the news collection shows that the main idea of written text can be identified based on four characteristics: word location in sentences, sentence location in texts, acronym word types and words that represent the person name. Significant words with main idea of written text are determined based on the words weighted values. The values are determined by combining the frequency of words and word location in sentences. The content of the first two sentences are suitable candidates for recognising important sentences in text. Results showed that mean percentage for important sentence recognition 82.9%, mean quality of generated headlines are 0.3194 (precision), 0.5656 (recall), 0.4012 (F-measure), 0.5656 (ROUGE–N), 0.3392 (ROUGE–L), 0.1186 (ROUGE–W) and 0.1232 (ROUGE–S). In conclusion, the consideration of language factors in headline generation technique is capable of producing quality headlines with higher degree of fidelity as compared to the compared benchmarks.
AB - Headline generation is an information extraction process to generate a single sentence that represents the content of a text. In Malay language context, research in this area is limited to machine translation approaches. This study is divided into three phases: analysis of news discourse, development of headline generation technique and evaluation of the quality of generated headlines. The study aims to develop headline using statistical and linguistic methods. The statistic method used to identify significant words and sentences based in term weighting approach. The linguistic method is used to increase its preciseness. 140 news and their corresponding headlines model were constructed. Analysis of the news collection shows that the main idea of written text can be identified based on four characteristics: word location in sentences, sentence location in texts, acronym word types and words that represent the person name. Significant words with main idea of written text are determined based on the words weighted values. The values are determined by combining the frequency of words and word location in sentences. The content of the first two sentences are suitable candidates for recognising important sentences in text. Results showed that mean percentage for important sentence recognition 82.9%, mean quality of generated headlines are 0.3194 (precision), 0.5656 (recall), 0.4012 (F-measure), 0.5656 (ROUGE–N), 0.3392 (ROUGE–L), 0.1186 (ROUGE–W) and 0.1232 (ROUGE–S). In conclusion, the consideration of language factors in headline generation technique is capable of producing quality headlines with higher degree of fidelity as compared to the compared benchmarks.
KW - Malay news article
KW - Term features
KW - Text summarisation
KW - Unsupervised approach
UR - http://www.scopus.com/inward/record.url?scp=85057768205&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85057768205&partnerID=8YFLogxK
U2 - 10.17576/gema-2018-1804-04
DO - 10.17576/gema-2018-1804-04
M3 - Article
AN - SCOPUS:85057768205
VL - 18
SP - 42
EP - 59
JO - GEMA Online Journal of Language Studies
JF - GEMA Online Journal of Language Studies
SN - 1675-8021
IS - 4
ER -