Building the classical arabic named entity recognition corpus (Canercorpus)

Ramzi Esmail Salah, Lailatul Qadri Zakaria

Research output: Contribution to journalArticle

Abstract

The past decade has witnessed construction of the background information resources to overcome several challenges in text mining tasks. For non-English languages with poor knowledge sources such as Arabic, these challenges have become more salient especially for handling the natural language processing applications that require human annotation. In the Named Entity Recognition (NER) task, several researches have been introduced to address the complexity of Arabic in terms of morphological and syntactical variations. However, there are a small number of studies dealing with Classical Arabic (CA) that is the official language of Quran and Hadith. CA was also used for archiving the Islamic topics that contain a lot of useful information which could of great value if extracted. Therefore, in this paper, we introduce Classical Arabic Named Entity Recognition corpus as a new corpus of tagged data that can be useful for handling the issues in recognition of Arabic named entities. It is freely available and manual annotation by human experts, containing more than 7,000 Hadiths. Based on Islamic topics, we classify named entities into 20 types which include the specific-domain entities that have not been handled before such as Allah, Prophet, Paradise, Hell, and Religion. The differences between the standard and classical Arabic are described in details during this work. Moreover, the comprehensive statistical analysis is introduced to measure the factors that play important role in manual human annotation.

Original languageEnglish
Pages (from-to)8340-8351
Number of pages12
JournalJournal of Theoretical and Applied Information Technology
Volume96
Issue number24
DOIs
Publication statusPublished - 31 Dec 2018

Fingerprint

Named Entity Recognition
Annotation
Statistical methods
Processing
Text Mining
Natural Language
Statistical Analysis
Classify
Resources
Human
Language

Keywords

  • Arabic corpus
  • Classical arabic
  • Corpus generation
  • Named entity

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Computer Science(all)

Cite this

Building the classical arabic named entity recognition corpus (Canercorpus). / Salah, Ramzi Esmail; Zakaria, Lailatul Qadri.

In: Journal of Theoretical and Applied Information Technology, Vol. 96, No. 24, 31.12.2018, p. 8340-8351.

Research output: Contribution to journalArticle

@article{b884456086a24f77b0307e6b732dac8a,
title = "Building the classical arabic named entity recognition corpus (Canercorpus)",
abstract = "The past decade has witnessed construction of the background information resources to overcome several challenges in text mining tasks. For non-English languages with poor knowledge sources such as Arabic, these challenges have become more salient especially for handling the natural language processing applications that require human annotation. In the Named Entity Recognition (NER) task, several researches have been introduced to address the complexity of Arabic in terms of morphological and syntactical variations. However, there are a small number of studies dealing with Classical Arabic (CA) that is the official language of Quran and Hadith. CA was also used for archiving the Islamic topics that contain a lot of useful information which could of great value if extracted. Therefore, in this paper, we introduce Classical Arabic Named Entity Recognition corpus as a new corpus of tagged data that can be useful for handling the issues in recognition of Arabic named entities. It is freely available and manual annotation by human experts, containing more than 7,000 Hadiths. Based on Islamic topics, we classify named entities into 20 types which include the specific-domain entities that have not been handled before such as Allah, Prophet, Paradise, Hell, and Religion. The differences between the standard and classical Arabic are described in details during this work. Moreover, the comprehensive statistical analysis is introduced to measure the factors that play important role in manual human annotation.",
keywords = "Arabic corpus, Classical arabic, Corpus generation, Named entity",
author = "Salah, {Ramzi Esmail} and Zakaria, {Lailatul Qadri}",
year = "2018",
month = "12",
day = "31",
doi = "10.1109/INFRKM.2018.8464820",
language = "English",
volume = "96",
pages = "8340--8351",
journal = "Journal of Theoretical and Applied Information Technology",
issn = "1992-8645",
publisher = "Asian Research Publishing Network (ARPN)",
number = "24",

}

TY - JOUR

T1 - Building the classical arabic named entity recognition corpus (Canercorpus)

AU - Salah, Ramzi Esmail

AU - Zakaria, Lailatul Qadri

PY - 2018/12/31

Y1 - 2018/12/31

N2 - The past decade has witnessed construction of the background information resources to overcome several challenges in text mining tasks. For non-English languages with poor knowledge sources such as Arabic, these challenges have become more salient especially for handling the natural language processing applications that require human annotation. In the Named Entity Recognition (NER) task, several researches have been introduced to address the complexity of Arabic in terms of morphological and syntactical variations. However, there are a small number of studies dealing with Classical Arabic (CA) that is the official language of Quran and Hadith. CA was also used for archiving the Islamic topics that contain a lot of useful information which could of great value if extracted. Therefore, in this paper, we introduce Classical Arabic Named Entity Recognition corpus as a new corpus of tagged data that can be useful for handling the issues in recognition of Arabic named entities. It is freely available and manual annotation by human experts, containing more than 7,000 Hadiths. Based on Islamic topics, we classify named entities into 20 types which include the specific-domain entities that have not been handled before such as Allah, Prophet, Paradise, Hell, and Religion. The differences between the standard and classical Arabic are described in details during this work. Moreover, the comprehensive statistical analysis is introduced to measure the factors that play important role in manual human annotation.

AB - The past decade has witnessed construction of the background information resources to overcome several challenges in text mining tasks. For non-English languages with poor knowledge sources such as Arabic, these challenges have become more salient especially for handling the natural language processing applications that require human annotation. In the Named Entity Recognition (NER) task, several researches have been introduced to address the complexity of Arabic in terms of morphological and syntactical variations. However, there are a small number of studies dealing with Classical Arabic (CA) that is the official language of Quran and Hadith. CA was also used for archiving the Islamic topics that contain a lot of useful information which could of great value if extracted. Therefore, in this paper, we introduce Classical Arabic Named Entity Recognition corpus as a new corpus of tagged data that can be useful for handling the issues in recognition of Arabic named entities. It is freely available and manual annotation by human experts, containing more than 7,000 Hadiths. Based on Islamic topics, we classify named entities into 20 types which include the specific-domain entities that have not been handled before such as Allah, Prophet, Paradise, Hell, and Religion. The differences between the standard and classical Arabic are described in details during this work. Moreover, the comprehensive statistical analysis is introduced to measure the factors that play important role in manual human annotation.

KW - Arabic corpus

KW - Classical arabic

KW - Corpus generation

KW - Named entity

UR - http://www.scopus.com/inward/record.url?scp=85059474436&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85059474436&partnerID=8YFLogxK

U2 - 10.1109/INFRKM.2018.8464820

DO - 10.1109/INFRKM.2018.8464820

M3 - Article

AN - SCOPUS:85059474436

VL - 96

SP - 8340

EP - 8351

JO - Journal of Theoretical and Applied Information Technology

JF - Journal of Theoretical and Applied Information Technology

SN - 1992-8645

IS - 24

ER -