Pendekatan teknik pengecaman entiti nama bagi capaian berita jenayah bahasa melayu

Translated title of the contribution: Named entity recognition approach for malay crime news retrieval

Saidah Saad, Mohamed Kamil Mansor

Research output: Contribution to journalArticle

Abstract

Information extraction is a process of obtaining an important concept in representing the textual content of unstructured documents. At present, there are a lot of unstructured documents such as news, articles, blogs, forums, tweets and micro-blogs of social networks. These documents are very difficult to be understood by the computer. Therefore, studies on the extraction of information is very important to overcome this problem. One extractiontechnique that is widely used is the entity name recognition. This research aims to implement the entity name recognition techniques of crime news source document in Malay language. The main objective of this study is to develop a prototype system model information extraction crime news in the Malay language using name entity recognition through a rule-based approach. This assessment is done by creating a corpus of crime news in the Malay language which is derived from the archival source; BERNAMA news. The corpus is then examined manually by linguists to identify individual entities such as name, organization, location, date, time, financial, percentage, crime and weapons. At the same time, a prototype system was developed and tested with the same corpus and the results of these tests were compared with the results of an expert. Overall, these tests showed good results with the findings for the recall at 78.67%, while precision is at 71.11% and for F-measure at 74.7%. The results of this study are expected to contribute knowledge regarding the effectiveness of the entity's name recognition techniques for crime news Malay language. This could further assist investigators, police, lawyers and authorities involved in the field of crime in solving crimes more quickly and effectively.

Original languageMalay
Pages (from-to)216-235
Number of pages20
JournalGEMA Online Journal of Language Studies
Volume18
Issue number4
DOIs
Publication statusPublished - 1 Nov 2018

Fingerprint

news
offense
language
weblog
system model
Entity
News
Crime
weapon
lawyer
social network
police
Names
expert
organization
Language
Prototype
Information Extraction
Blogs

ASJC Scopus subject areas

  • Language and Linguistics
  • Linguistics and Language
  • Literature and Literary Theory

Cite this

Pendekatan teknik pengecaman entiti nama bagi capaian berita jenayah bahasa melayu. / Saad, Saidah; Mansor, Mohamed Kamil.

In: GEMA Online Journal of Language Studies, Vol. 18, No. 4, 01.11.2018, p. 216-235.

Research output: Contribution to journalArticle

@article{b4171a7f5868423dacea7d9117b04fd1,
title = "Pendekatan teknik pengecaman entiti nama bagi capaian berita jenayah bahasa melayu",
abstract = "Information extraction is a process of obtaining an important concept in representing the textual content of unstructured documents. At present, there are a lot of unstructured documents such as news, articles, blogs, forums, tweets and micro-blogs of social networks. These documents are very difficult to be understood by the computer. Therefore, studies on the extraction of information is very important to overcome this problem. One extractiontechnique that is widely used is the entity name recognition. This research aims to implement the entity name recognition techniques of crime news source document in Malay language. The main objective of this study is to develop a prototype system model information extraction crime news in the Malay language using name entity recognition through a rule-based approach. This assessment is done by creating a corpus of crime news in the Malay language which is derived from the archival source; BERNAMA news. The corpus is then examined manually by linguists to identify individual entities such as name, organization, location, date, time, financial, percentage, crime and weapons. At the same time, a prototype system was developed and tested with the same corpus and the results of these tests were compared with the results of an expert. Overall, these tests showed good results with the findings for the recall at 78.67{\%}, while precision is at 71.11{\%} and for F-measure at 74.7{\%}. The results of this study are expected to contribute knowledge regarding the effectiveness of the entity's name recognition techniques for crime news Malay language. This could further assist investigators, police, lawyers and authorities involved in the field of crime in solving crimes more quickly and effectively.",
keywords = "Crime news, Information extraction, Malay language, Named entity recognition, Rule-based approach",
author = "Saidah Saad and Mansor, {Mohamed Kamil}",
year = "2018",
month = "11",
day = "1",
doi = "10.17576/gema-2018-1804-14",
language = "Malay",
volume = "18",
pages = "216--235",
journal = "GEMA Online Journal of Language Studies",
issn = "1675-8021",
publisher = "Universiti Kebangsaan Malaysia",
number = "4",

}

TY - JOUR

T1 - Pendekatan teknik pengecaman entiti nama bagi capaian berita jenayah bahasa melayu

AU - Saad, Saidah

AU - Mansor, Mohamed Kamil

PY - 2018/11/1

Y1 - 2018/11/1

N2 - Information extraction is a process of obtaining an important concept in representing the textual content of unstructured documents. At present, there are a lot of unstructured documents such as news, articles, blogs, forums, tweets and micro-blogs of social networks. These documents are very difficult to be understood by the computer. Therefore, studies on the extraction of information is very important to overcome this problem. One extractiontechnique that is widely used is the entity name recognition. This research aims to implement the entity name recognition techniques of crime news source document in Malay language. The main objective of this study is to develop a prototype system model information extraction crime news in the Malay language using name entity recognition through a rule-based approach. This assessment is done by creating a corpus of crime news in the Malay language which is derived from the archival source; BERNAMA news. The corpus is then examined manually by linguists to identify individual entities such as name, organization, location, date, time, financial, percentage, crime and weapons. At the same time, a prototype system was developed and tested with the same corpus and the results of these tests were compared with the results of an expert. Overall, these tests showed good results with the findings for the recall at 78.67%, while precision is at 71.11% and for F-measure at 74.7%. The results of this study are expected to contribute knowledge regarding the effectiveness of the entity's name recognition techniques for crime news Malay language. This could further assist investigators, police, lawyers and authorities involved in the field of crime in solving crimes more quickly and effectively.

AB - Information extraction is a process of obtaining an important concept in representing the textual content of unstructured documents. At present, there are a lot of unstructured documents such as news, articles, blogs, forums, tweets and micro-blogs of social networks. These documents are very difficult to be understood by the computer. Therefore, studies on the extraction of information is very important to overcome this problem. One extractiontechnique that is widely used is the entity name recognition. This research aims to implement the entity name recognition techniques of crime news source document in Malay language. The main objective of this study is to develop a prototype system model information extraction crime news in the Malay language using name entity recognition through a rule-based approach. This assessment is done by creating a corpus of crime news in the Malay language which is derived from the archival source; BERNAMA news. The corpus is then examined manually by linguists to identify individual entities such as name, organization, location, date, time, financial, percentage, crime and weapons. At the same time, a prototype system was developed and tested with the same corpus and the results of these tests were compared with the results of an expert. Overall, these tests showed good results with the findings for the recall at 78.67%, while precision is at 71.11% and for F-measure at 74.7%. The results of this study are expected to contribute knowledge regarding the effectiveness of the entity's name recognition techniques for crime news Malay language. This could further assist investigators, police, lawyers and authorities involved in the field of crime in solving crimes more quickly and effectively.

KW - Crime news

KW - Information extraction

KW - Malay language

KW - Named entity recognition

KW - Rule-based approach

UR - http://www.scopus.com/inward/record.url?scp=85057718700&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85057718700&partnerID=8YFLogxK

U2 - 10.17576/gema-2018-1804-14

DO - 10.17576/gema-2018-1804-14

M3 - Article

AN - SCOPUS:85057718700

VL - 18

SP - 216

EP - 235

JO - GEMA Online Journal of Language Studies

JF - GEMA Online Journal of Language Studies

SN - 1675-8021

IS - 4

ER -