Tokenizer for the Malay language using pattern matching

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Tokenization is a fundamental task focused on text processing. Among other tasks, the segmentation process is used to identify information units, such as sentences and words. In this paper, we discuss the Natural Language ToolKit (NLTK) tokenizer as a step to manipulate patterns within text. The purpose of this work is to build up Natural Language Processing (NLP) base for Jawi corpus. A series of experiments was performed, to validate the corpus and fulfill the requirement of the Jawi script tokenizer, with the promising results. Based on these promising results, the token will be used for tagging process.

Original languageEnglish
Title of host publicationInternational Conference on Intelligent Systems Design and Applications, ISDA
PublisherIEEE Computer Society
Pages140-144
Number of pages5
Volume2015-January
ISBN (Print)9781479979387
DOIs
Publication statusPublished - 23 Mar 2015
Event2014 14th International Conference on Intelligent Systems Design and Applications, ISDA 2014 - Okinawa, Japan
Duration: 28 Nov 201430 Nov 2014

Other

Other2014 14th International Conference on Intelligent Systems Design and Applications, ISDA 2014
CountryJapan
CityOkinawa
Period28/11/1430/11/14

Fingerprint

Text processing
Pattern matching
Processing
Experiments

Keywords

  • Jawi corpus
  • Pattern matching
  • Regular expression
  • Tokenization

ASJC Scopus subject areas

  • Artificial Intelligence
  • Computer Science Applications
  • Signal Processing
  • Control and Systems Engineering

Cite this

Bakar, J. A., Omar, K., Nasrudin, M. F., & Murah, M. Z. (2015). Tokenizer for the Malay language using pattern matching. In International Conference on Intelligent Systems Design and Applications, ISDA (Vol. 2015-January, pp. 140-144). [7066258] IEEE Computer Society. https://doi.org/10.1109/ISDA.2014.7066258

Tokenizer for the Malay language using pattern matching. / Bakar, Juhaida Abu; Omar, Khairuddin; Nasrudin, Mohammad Faidzul; Murah, Mohd. Zamri.

International Conference on Intelligent Systems Design and Applications, ISDA. Vol. 2015-January IEEE Computer Society, 2015. p. 140-144 7066258.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Bakar, JA, Omar, K, Nasrudin, MF & Murah, MZ 2015, Tokenizer for the Malay language using pattern matching. in International Conference on Intelligent Systems Design and Applications, ISDA. vol. 2015-January, 7066258, IEEE Computer Society, pp. 140-144, 2014 14th International Conference on Intelligent Systems Design and Applications, ISDA 2014, Okinawa, Japan, 28/11/14. https://doi.org/10.1109/ISDA.2014.7066258
Bakar JA, Omar K, Nasrudin MF, Murah MZ. Tokenizer for the Malay language using pattern matching. In International Conference on Intelligent Systems Design and Applications, ISDA. Vol. 2015-January. IEEE Computer Society. 2015. p. 140-144. 7066258 https://doi.org/10.1109/ISDA.2014.7066258
Bakar, Juhaida Abu ; Omar, Khairuddin ; Nasrudin, Mohammad Faidzul ; Murah, Mohd. Zamri. / Tokenizer for the Malay language using pattern matching. International Conference on Intelligent Systems Design and Applications, ISDA. Vol. 2015-January IEEE Computer Society, 2015. pp. 140-144
@inproceedings{0534f377e3dc468f8661aad7ccb61c20,
title = "Tokenizer for the Malay language using pattern matching",
abstract = "Tokenization is a fundamental task focused on text processing. Among other tasks, the segmentation process is used to identify information units, such as sentences and words. In this paper, we discuss the Natural Language ToolKit (NLTK) tokenizer as a step to manipulate patterns within text. The purpose of this work is to build up Natural Language Processing (NLP) base for Jawi corpus. A series of experiments was performed, to validate the corpus and fulfill the requirement of the Jawi script tokenizer, with the promising results. Based on these promising results, the token will be used for tagging process.",
keywords = "Jawi corpus, Pattern matching, Regular expression, Tokenization",
author = "Bakar, {Juhaida Abu} and Khairuddin Omar and Nasrudin, {Mohammad Faidzul} and Murah, {Mohd. Zamri}",
year = "2015",
month = "3",
day = "23",
doi = "10.1109/ISDA.2014.7066258",
language = "English",
isbn = "9781479979387",
volume = "2015-January",
pages = "140--144",
booktitle = "International Conference on Intelligent Systems Design and Applications, ISDA",
publisher = "IEEE Computer Society",

}

TY - GEN

T1 - Tokenizer for the Malay language using pattern matching

AU - Bakar, Juhaida Abu

AU - Omar, Khairuddin

AU - Nasrudin, Mohammad Faidzul

AU - Murah, Mohd. Zamri

PY - 2015/3/23

Y1 - 2015/3/23

N2 - Tokenization is a fundamental task focused on text processing. Among other tasks, the segmentation process is used to identify information units, such as sentences and words. In this paper, we discuss the Natural Language ToolKit (NLTK) tokenizer as a step to manipulate patterns within text. The purpose of this work is to build up Natural Language Processing (NLP) base for Jawi corpus. A series of experiments was performed, to validate the corpus and fulfill the requirement of the Jawi script tokenizer, with the promising results. Based on these promising results, the token will be used for tagging process.

AB - Tokenization is a fundamental task focused on text processing. Among other tasks, the segmentation process is used to identify information units, such as sentences and words. In this paper, we discuss the Natural Language ToolKit (NLTK) tokenizer as a step to manipulate patterns within text. The purpose of this work is to build up Natural Language Processing (NLP) base for Jawi corpus. A series of experiments was performed, to validate the corpus and fulfill the requirement of the Jawi script tokenizer, with the promising results. Based on these promising results, the token will be used for tagging process.

KW - Jawi corpus

KW - Pattern matching

KW - Regular expression

KW - Tokenization

UR - http://www.scopus.com/inward/record.url?scp=84940532653&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84940532653&partnerID=8YFLogxK

U2 - 10.1109/ISDA.2014.7066258

DO - 10.1109/ISDA.2014.7066258

M3 - Conference contribution

SN - 9781479979387

VL - 2015-January

SP - 140

EP - 144

BT - International Conference on Intelligent Systems Design and Applications, ISDA

PB - IEEE Computer Society

ER -