Proper nouns recognition in Arabic crime text using machine learning approach

Suhad Al-Shoukry, Nazlia Omar

Research output: Contribution to journalArticle

1 Citation (Scopus)

Abstract

Named Entity Recognition (NER) identifies proper nouns in a text and categorizes it as a distinct kind of named entities. This function enables the extraction of peoples name, locations, organizations, and currencies. Several research abound in this area in Arabic NER is concerned. However, recognizing Arabic named entities is challenging due to the complexity in the Arabic language. These complexities are represented by non-existence of capitalization feature which facilitates the process of NER. Furthermore, there is a lack of lexical corpora that may include all the Arabic NEs. On other hand, most of the approaches that have been proposed for Arabic NER were based on handcrafted rule-based methods which can be laborious and time consuming. Therefore, this paper presents our attempt at recognizing and extracting the most important named entities, such as names of persons, locations, organizations, crime types, dates and times in Arabic crime documents using the Decision Tree classifier and feature extraction for crime dataset. The dataset consists of varying data sizes collected from online resources and undergone multiple pre-processing tasks. Additionally, the feature extraction task which includes POS tagging, keyword trigger, definite articles and affixes has been performed. Furthermore the classifier will utilize these features in order to classify the named entities. The results demonstrate that the use of the Decision Tree (DT) yields good results in small size datasets, but failed when the dataset is large in the case of the crime domain. This is due to the difficulty of relevant keywords and features in such fields. The best result for this experiment is 81.35% F-measure.

Original languageEnglish
Pages (from-to)506-513
Number of pages8
JournalJournal of Theoretical and Applied Information Technology
Volume79
Issue number3
Publication statusPublished - 30 Oct 2015

Fingerprint

Named Entity Recognition
Crime
Learning systems
Machine Learning
Decision trees
Decision tree
Feature Extraction
Feature extraction
Classifiers
Classifier
Currency
Tagging
Date
Trigger
Nonexistence
Preprocessing
Person
Classify
Distinct
Resources

Keywords

  • Arabic crime document
  • Decision tree
  • Machine learning
  • Named entity recognition
  • Natural language processing

ASJC Scopus subject areas

  • Computer Science(all)
  • Theoretical Computer Science

Cite this

Proper nouns recognition in Arabic crime text using machine learning approach. / Al-Shoukry, Suhad; Omar, Nazlia.

In: Journal of Theoretical and Applied Information Technology, Vol. 79, No. 3, 30.10.2015, p. 506-513.

Research output: Contribution to journalArticle

@article{b589681c7ebe4577a1e7b1a5ddb9e5d7,
title = "Proper nouns recognition in Arabic crime text using machine learning approach",
abstract = "Named Entity Recognition (NER) identifies proper nouns in a text and categorizes it as a distinct kind of named entities. This function enables the extraction of peoples name, locations, organizations, and currencies. Several research abound in this area in Arabic NER is concerned. However, recognizing Arabic named entities is challenging due to the complexity in the Arabic language. These complexities are represented by non-existence of capitalization feature which facilitates the process of NER. Furthermore, there is a lack of lexical corpora that may include all the Arabic NEs. On other hand, most of the approaches that have been proposed for Arabic NER were based on handcrafted rule-based methods which can be laborious and time consuming. Therefore, this paper presents our attempt at recognizing and extracting the most important named entities, such as names of persons, locations, organizations, crime types, dates and times in Arabic crime documents using the Decision Tree classifier and feature extraction for crime dataset. The dataset consists of varying data sizes collected from online resources and undergone multiple pre-processing tasks. Additionally, the feature extraction task which includes POS tagging, keyword trigger, definite articles and affixes has been performed. Furthermore the classifier will utilize these features in order to classify the named entities. The results demonstrate that the use of the Decision Tree (DT) yields good results in small size datasets, but failed when the dataset is large in the case of the crime domain. This is due to the difficulty of relevant keywords and features in such fields. The best result for this experiment is 81.35{\%} F-measure.",
keywords = "Arabic crime document, Decision tree, Machine learning, Named entity recognition, Natural language processing",
author = "Suhad Al-Shoukry and Nazlia Omar",
year = "2015",
month = "10",
day = "30",
language = "English",
volume = "79",
pages = "506--513",
journal = "Journal of Theoretical and Applied Information Technology",
issn = "1992-8645",
publisher = "Asian Research Publishing Network (ARPN)",
number = "3",

}

TY - JOUR

T1 - Proper nouns recognition in Arabic crime text using machine learning approach

AU - Al-Shoukry, Suhad

AU - Omar, Nazlia

PY - 2015/10/30

Y1 - 2015/10/30

N2 - Named Entity Recognition (NER) identifies proper nouns in a text and categorizes it as a distinct kind of named entities. This function enables the extraction of peoples name, locations, organizations, and currencies. Several research abound in this area in Arabic NER is concerned. However, recognizing Arabic named entities is challenging due to the complexity in the Arabic language. These complexities are represented by non-existence of capitalization feature which facilitates the process of NER. Furthermore, there is a lack of lexical corpora that may include all the Arabic NEs. On other hand, most of the approaches that have been proposed for Arabic NER were based on handcrafted rule-based methods which can be laborious and time consuming. Therefore, this paper presents our attempt at recognizing and extracting the most important named entities, such as names of persons, locations, organizations, crime types, dates and times in Arabic crime documents using the Decision Tree classifier and feature extraction for crime dataset. The dataset consists of varying data sizes collected from online resources and undergone multiple pre-processing tasks. Additionally, the feature extraction task which includes POS tagging, keyword trigger, definite articles and affixes has been performed. Furthermore the classifier will utilize these features in order to classify the named entities. The results demonstrate that the use of the Decision Tree (DT) yields good results in small size datasets, but failed when the dataset is large in the case of the crime domain. This is due to the difficulty of relevant keywords and features in such fields. The best result for this experiment is 81.35% F-measure.

AB - Named Entity Recognition (NER) identifies proper nouns in a text and categorizes it as a distinct kind of named entities. This function enables the extraction of peoples name, locations, organizations, and currencies. Several research abound in this area in Arabic NER is concerned. However, recognizing Arabic named entities is challenging due to the complexity in the Arabic language. These complexities are represented by non-existence of capitalization feature which facilitates the process of NER. Furthermore, there is a lack of lexical corpora that may include all the Arabic NEs. On other hand, most of the approaches that have been proposed for Arabic NER were based on handcrafted rule-based methods which can be laborious and time consuming. Therefore, this paper presents our attempt at recognizing and extracting the most important named entities, such as names of persons, locations, organizations, crime types, dates and times in Arabic crime documents using the Decision Tree classifier and feature extraction for crime dataset. The dataset consists of varying data sizes collected from online resources and undergone multiple pre-processing tasks. Additionally, the feature extraction task which includes POS tagging, keyword trigger, definite articles and affixes has been performed. Furthermore the classifier will utilize these features in order to classify the named entities. The results demonstrate that the use of the Decision Tree (DT) yields good results in small size datasets, but failed when the dataset is large in the case of the crime domain. This is due to the difficulty of relevant keywords and features in such fields. The best result for this experiment is 81.35% F-measure.

KW - Arabic crime document

KW - Decision tree

KW - Machine learning

KW - Named entity recognition

KW - Natural language processing

UR - http://www.scopus.com/inward/record.url?scp=84943149593&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84943149593&partnerID=8YFLogxK

M3 - Article

VL - 79

SP - 506

EP - 513

JO - Journal of Theoretical and Applied Information Technology

JF - Journal of Theoretical and Applied Information Technology

SN - 1992-8645

IS - 3

ER -