Building the Classical Arabic Named Entity Recognition Corpus (CANERCorpus)

Ramzi Esmail Salah, Lailatul Qadri Zakaria

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

The past decade has witnessed construction of the background information resources to overcome several challenges in text mining tasks. For non-English languages with poor knowledge sources such as Arabic, these challenges have become more salient especially for handling the natural language processing applications that require human annotation. In the Named Entity Recognition (NER) task, several researches have been introduced to address the complexity of Arabic in terms of morphological and syntactical variations. However, there are a small number of studies dealing with Classical Arabic (CA) that is the official language of Quran and Hadith. CA was also used for archiving the Islamic topics that contain a lot of useful information which could of great value if extracted. Therefore, in this paper, we introduce Classical Arabic Named Entity Recognition corpus as a new corpus of tagged data that can be useful for handling the issues in recognition of Arabic named entities. It is freely available and manual annotation by human experts, containing more than 7,000 Hadiths. Based on Islamic topics, we classify named entities into 20 types which include the specific-domain entities that have not been handled before such as Allah, Prophet, Paradise, Hell, and Religion. The differences between the standard and classical Arabic are described in details during this work. Moreover, the comprehensive statistical analysis is introduced to measure the factors that play important role in manual human annotation.

Original languageEnglish
Title of host publicationProceedings - 2018 4th International Conference on Information Retrieval and Knowledge Management
Subtitle of host publicationDiving into Data Sciences, CAMP 2018
EditorsShyamala Doraisamy, Azreen Azman, Dayang Nurfatimah Awg Iskandar, Muthukkaruppan Annamalai, Stefan Ruger, Fakhrul Hazman Yusoff, Nurazzah Abd. Rahman, Alistair Moffat, Shahrul Azman Mohd Noah
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages150-157
Number of pages8
ISBN (Print)9781538638125
DOIs
Publication statusPublished - 13 Sep 2018
Event4th International Conference on Information Retrieval and Knowledge Management: Diving into Data Sciences, CAMP 2018 - Kota Kinabalu, Sabah, Malaysia
Duration: 26 Mar 201828 Mar 2018

Other

Other4th International Conference on Information Retrieval and Knowledge Management: Diving into Data Sciences, CAMP 2018
CountryMalaysia
CityKota Kinabalu, Sabah
Period26/3/1828/3/18

Fingerprint

Statistical methods
official language
role play
Processing
language
statistical analysis
Religion
expert
resources
knowledge
Values
Named entity recognition
Annotation
Named entity
Language
Factors
Statistical analysis
Natural language processing
Text mining
Information resources

Keywords

  • Arabic corpus
  • classical Arabic
  • corpus generation
  • named entity

ASJC Scopus subject areas

  • Library and Information Sciences
  • Artificial Intelligence
  • Information Systems
  • Decision Sciences (miscellaneous)
  • Information Systems and Management

Cite this

Salah, R. E., & Zakaria, L. Q. (2018). Building the Classical Arabic Named Entity Recognition Corpus (CANERCorpus). In S. Doraisamy, A. Azman, D. N. A. Iskandar, M. Annamalai, S. Ruger, F. H. Yusoff, N. Abd. Rahman, A. Moffat, ... S. A. M. Noah (Eds.), Proceedings - 2018 4th International Conference on Information Retrieval and Knowledge Management: Diving into Data Sciences, CAMP 2018 (pp. 150-157). [8464820] Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/INFRKM.2018.8464820

Building the Classical Arabic Named Entity Recognition Corpus (CANERCorpus). / Salah, Ramzi Esmail; Zakaria, Lailatul Qadri.

Proceedings - 2018 4th International Conference on Information Retrieval and Knowledge Management: Diving into Data Sciences, CAMP 2018. ed. / Shyamala Doraisamy; Azreen Azman; Dayang Nurfatimah Awg Iskandar; Muthukkaruppan Annamalai; Stefan Ruger; Fakhrul Hazman Yusoff; Nurazzah Abd. Rahman; Alistair Moffat; Shahrul Azman Mohd Noah. Institute of Electrical and Electronics Engineers Inc., 2018. p. 150-157 8464820.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Salah, RE & Zakaria, LQ 2018, Building the Classical Arabic Named Entity Recognition Corpus (CANERCorpus). in S Doraisamy, A Azman, DNA Iskandar, M Annamalai, S Ruger, FH Yusoff, N Abd. Rahman, A Moffat & SAM Noah (eds), Proceedings - 2018 4th International Conference on Information Retrieval and Knowledge Management: Diving into Data Sciences, CAMP 2018., 8464820, Institute of Electrical and Electronics Engineers Inc., pp. 150-157, 4th International Conference on Information Retrieval and Knowledge Management: Diving into Data Sciences, CAMP 2018, Kota Kinabalu, Sabah, Malaysia, 26/3/18. https://doi.org/10.1109/INFRKM.2018.8464820
Salah RE, Zakaria LQ. Building the Classical Arabic Named Entity Recognition Corpus (CANERCorpus). In Doraisamy S, Azman A, Iskandar DNA, Annamalai M, Ruger S, Yusoff FH, Abd. Rahman N, Moffat A, Noah SAM, editors, Proceedings - 2018 4th International Conference on Information Retrieval and Knowledge Management: Diving into Data Sciences, CAMP 2018. Institute of Electrical and Electronics Engineers Inc. 2018. p. 150-157. 8464820 https://doi.org/10.1109/INFRKM.2018.8464820
Salah, Ramzi Esmail ; Zakaria, Lailatul Qadri. / Building the Classical Arabic Named Entity Recognition Corpus (CANERCorpus). Proceedings - 2018 4th International Conference on Information Retrieval and Knowledge Management: Diving into Data Sciences, CAMP 2018. editor / Shyamala Doraisamy ; Azreen Azman ; Dayang Nurfatimah Awg Iskandar ; Muthukkaruppan Annamalai ; Stefan Ruger ; Fakhrul Hazman Yusoff ; Nurazzah Abd. Rahman ; Alistair Moffat ; Shahrul Azman Mohd Noah. Institute of Electrical and Electronics Engineers Inc., 2018. pp. 150-157
@inproceedings{45dfbd8b703d4ee3b288bbdbf9c64781,
title = "Building the Classical Arabic Named Entity Recognition Corpus (CANERCorpus)",
abstract = "The past decade has witnessed construction of the background information resources to overcome several challenges in text mining tasks. For non-English languages with poor knowledge sources such as Arabic, these challenges have become more salient especially for handling the natural language processing applications that require human annotation. In the Named Entity Recognition (NER) task, several researches have been introduced to address the complexity of Arabic in terms of morphological and syntactical variations. However, there are a small number of studies dealing with Classical Arabic (CA) that is the official language of Quran and Hadith. CA was also used for archiving the Islamic topics that contain a lot of useful information which could of great value if extracted. Therefore, in this paper, we introduce Classical Arabic Named Entity Recognition corpus as a new corpus of tagged data that can be useful for handling the issues in recognition of Arabic named entities. It is freely available and manual annotation by human experts, containing more than 7,000 Hadiths. Based on Islamic topics, we classify named entities into 20 types which include the specific-domain entities that have not been handled before such as Allah, Prophet, Paradise, Hell, and Religion. The differences between the standard and classical Arabic are described in details during this work. Moreover, the comprehensive statistical analysis is introduced to measure the factors that play important role in manual human annotation.",
keywords = "Arabic corpus, classical Arabic, corpus generation, named entity",
author = "Salah, {Ramzi Esmail} and Zakaria, {Lailatul Qadri}",
year = "2018",
month = "9",
day = "13",
doi = "10.1109/INFRKM.2018.8464820",
language = "English",
isbn = "9781538638125",
pages = "150--157",
editor = "Shyamala Doraisamy and Azreen Azman and Iskandar, {Dayang Nurfatimah Awg} and Muthukkaruppan Annamalai and Stefan Ruger and Yusoff, {Fakhrul Hazman} and {Abd. Rahman}, Nurazzah and Alistair Moffat and Noah, {Shahrul Azman Mohd}",
booktitle = "Proceedings - 2018 4th International Conference on Information Retrieval and Knowledge Management",
publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - GEN

T1 - Building the Classical Arabic Named Entity Recognition Corpus (CANERCorpus)

AU - Salah, Ramzi Esmail

AU - Zakaria, Lailatul Qadri

PY - 2018/9/13

Y1 - 2018/9/13

N2 - The past decade has witnessed construction of the background information resources to overcome several challenges in text mining tasks. For non-English languages with poor knowledge sources such as Arabic, these challenges have become more salient especially for handling the natural language processing applications that require human annotation. In the Named Entity Recognition (NER) task, several researches have been introduced to address the complexity of Arabic in terms of morphological and syntactical variations. However, there are a small number of studies dealing with Classical Arabic (CA) that is the official language of Quran and Hadith. CA was also used for archiving the Islamic topics that contain a lot of useful information which could of great value if extracted. Therefore, in this paper, we introduce Classical Arabic Named Entity Recognition corpus as a new corpus of tagged data that can be useful for handling the issues in recognition of Arabic named entities. It is freely available and manual annotation by human experts, containing more than 7,000 Hadiths. Based on Islamic topics, we classify named entities into 20 types which include the specific-domain entities that have not been handled before such as Allah, Prophet, Paradise, Hell, and Religion. The differences between the standard and classical Arabic are described in details during this work. Moreover, the comprehensive statistical analysis is introduced to measure the factors that play important role in manual human annotation.

AB - The past decade has witnessed construction of the background information resources to overcome several challenges in text mining tasks. For non-English languages with poor knowledge sources such as Arabic, these challenges have become more salient especially for handling the natural language processing applications that require human annotation. In the Named Entity Recognition (NER) task, several researches have been introduced to address the complexity of Arabic in terms of morphological and syntactical variations. However, there are a small number of studies dealing with Classical Arabic (CA) that is the official language of Quran and Hadith. CA was also used for archiving the Islamic topics that contain a lot of useful information which could of great value if extracted. Therefore, in this paper, we introduce Classical Arabic Named Entity Recognition corpus as a new corpus of tagged data that can be useful for handling the issues in recognition of Arabic named entities. It is freely available and manual annotation by human experts, containing more than 7,000 Hadiths. Based on Islamic topics, we classify named entities into 20 types which include the specific-domain entities that have not been handled before such as Allah, Prophet, Paradise, Hell, and Religion. The differences between the standard and classical Arabic are described in details during this work. Moreover, the comprehensive statistical analysis is introduced to measure the factors that play important role in manual human annotation.

KW - Arabic corpus

KW - classical Arabic

KW - corpus generation

KW - named entity

UR - http://www.scopus.com/inward/record.url?scp=85054397712&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85054397712&partnerID=8YFLogxK

U2 - 10.1109/INFRKM.2018.8464820

DO - 10.1109/INFRKM.2018.8464820

M3 - Conference contribution

AN - SCOPUS:85054397712

SN - 9781538638125

SP - 150

EP - 157

BT - Proceedings - 2018 4th International Conference on Information Retrieval and Knowledge Management

A2 - Doraisamy, Shyamala

A2 - Azman, Azreen

A2 - Iskandar, Dayang Nurfatimah Awg

A2 - Annamalai, Muthukkaruppan

A2 - Ruger, Stefan

A2 - Yusoff, Fakhrul Hazman

A2 - Abd. Rahman, Nurazzah

A2 - Moffat, Alistair

A2 - Noah, Shahrul Azman Mohd

PB - Institute of Electrical and Electronics Engineers Inc.

ER -