Word Embedding for Small and Domain-specific Malay Corpus

Sabrina Tiun, Nor Fariza Mohd Nor, Azhar Jalaludin, Anis Nadiah Che Abdul Rahman

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

In this paper, we present the process of training the word embedding (WE) model for a small, domain-specific Malay corpus. In this study, Hansard corpus of Malaysia Parliament for specific years was trained on the Word2vec model. However, a specific setting of the hyperparameters is required to obtain an accurate WE model because changing one of the hyperparameters would affect the model’s performance. We trained the corpus into a series of WE model on a set of hyperparameters where one of the parameter values was different from each model. The model performances were intrinsically evaluated using three semantic word relations, namely; word similarity, dissimilarity and analogy. The evaluation was performed based on the model output and analysed by experts (corpus linguists). Experts’ evaluation result on a small, domain-specific corpus showed that the suitable hyperparameters were a window size of 5 or 10, a vector size of 50 to 100 and Skip-gram architecture.

Original languageEnglish
Title of host publicationComputational Science and Technology - 6th ICCST 2019
EditorsRayner Alfred, Yuto Lim, Haviluddin Haviluddin, Chin Kim On
PublisherSpringer Verlag
Pages435-443
Number of pages9
ISBN (Print)9789811500572
DOIs
Publication statusPublished - 1 Jan 2020
Event6th International Conference on Computational Science and Technology, ICCST 2019 - Kota Kinabalu, Malaysia
Duration: 29 Aug 201930 Aug 2019

Publication series

NameLecture Notes in Electrical Engineering
Volume603
ISSN (Print)1876-1100
ISSN (Electronic)1876-1119

Conference

Conference6th International Conference on Computational Science and Technology, ICCST 2019
CountryMalaysia
CityKota Kinabalu
Period29/8/1930/8/19

Fingerprint

Semantics

Keywords

  • Corpus linguistic
  • Malay corpus
  • word embedding
  • Word2vec

ASJC Scopus subject areas

  • Industrial and Manufacturing Engineering

Cite this

Tiun, S., Nor, N. F. M., Jalaludin, A., & Rahman, A. N. C. A. (2020). Word Embedding for Small and Domain-specific Malay Corpus. In R. Alfred, Y. Lim, H. Haviluddin, & C. K. On (Eds.), Computational Science and Technology - 6th ICCST 2019 (pp. 435-443). (Lecture Notes in Electrical Engineering; Vol. 603). Springer Verlag. https://doi.org/10.1007/978-981-15-0058-9_42

Word Embedding for Small and Domain-specific Malay Corpus. / Tiun, Sabrina; Nor, Nor Fariza Mohd; Jalaludin, Azhar; Rahman, Anis Nadiah Che Abdul.

Computational Science and Technology - 6th ICCST 2019. ed. / Rayner Alfred; Yuto Lim; Haviluddin Haviluddin; Chin Kim On. Springer Verlag, 2020. p. 435-443 (Lecture Notes in Electrical Engineering; Vol. 603).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Tiun, S, Nor, NFM, Jalaludin, A & Rahman, ANCA 2020, Word Embedding for Small and Domain-specific Malay Corpus. in R Alfred, Y Lim, H Haviluddin & CK On (eds), Computational Science and Technology - 6th ICCST 2019. Lecture Notes in Electrical Engineering, vol. 603, Springer Verlag, pp. 435-443, 6th International Conference on Computational Science and Technology, ICCST 2019, Kota Kinabalu, Malaysia, 29/8/19. https://doi.org/10.1007/978-981-15-0058-9_42
Tiun S, Nor NFM, Jalaludin A, Rahman ANCA. Word Embedding for Small and Domain-specific Malay Corpus. In Alfred R, Lim Y, Haviluddin H, On CK, editors, Computational Science and Technology - 6th ICCST 2019. Springer Verlag. 2020. p. 435-443. (Lecture Notes in Electrical Engineering). https://doi.org/10.1007/978-981-15-0058-9_42
Tiun, Sabrina ; Nor, Nor Fariza Mohd ; Jalaludin, Azhar ; Rahman, Anis Nadiah Che Abdul. / Word Embedding for Small and Domain-specific Malay Corpus. Computational Science and Technology - 6th ICCST 2019. editor / Rayner Alfred ; Yuto Lim ; Haviluddin Haviluddin ; Chin Kim On. Springer Verlag, 2020. pp. 435-443 (Lecture Notes in Electrical Engineering).
@inproceedings{e1fcbbfed95f40e3aee10b50ab368d42,
title = "Word Embedding for Small and Domain-specific Malay Corpus",
abstract = "In this paper, we present the process of training the word embedding (WE) model for a small, domain-specific Malay corpus. In this study, Hansard corpus of Malaysia Parliament for specific years was trained on the Word2vec model. However, a specific setting of the hyperparameters is required to obtain an accurate WE model because changing one of the hyperparameters would affect the model’s performance. We trained the corpus into a series of WE model on a set of hyperparameters where one of the parameter values was different from each model. The model performances were intrinsically evaluated using three semantic word relations, namely; word similarity, dissimilarity and analogy. The evaluation was performed based on the model output and analysed by experts (corpus linguists). Experts’ evaluation result on a small, domain-specific corpus showed that the suitable hyperparameters were a window size of 5 or 10, a vector size of 50 to 100 and Skip-gram architecture.",
keywords = "Corpus linguistic, Malay corpus, word embedding, Word2vec",
author = "Sabrina Tiun and Nor, {Nor Fariza Mohd} and Azhar Jalaludin and Rahman, {Anis Nadiah Che Abdul}",
year = "2020",
month = "1",
day = "1",
doi = "10.1007/978-981-15-0058-9_42",
language = "English",
isbn = "9789811500572",
series = "Lecture Notes in Electrical Engineering",
publisher = "Springer Verlag",
pages = "435--443",
editor = "Rayner Alfred and Yuto Lim and Haviluddin Haviluddin and On, {Chin Kim}",
booktitle = "Computational Science and Technology - 6th ICCST 2019",

}

TY - GEN

T1 - Word Embedding for Small and Domain-specific Malay Corpus

AU - Tiun, Sabrina

AU - Nor, Nor Fariza Mohd

AU - Jalaludin, Azhar

AU - Rahman, Anis Nadiah Che Abdul

PY - 2020/1/1

Y1 - 2020/1/1

N2 - In this paper, we present the process of training the word embedding (WE) model for a small, domain-specific Malay corpus. In this study, Hansard corpus of Malaysia Parliament for specific years was trained on the Word2vec model. However, a specific setting of the hyperparameters is required to obtain an accurate WE model because changing one of the hyperparameters would affect the model’s performance. We trained the corpus into a series of WE model on a set of hyperparameters where one of the parameter values was different from each model. The model performances were intrinsically evaluated using three semantic word relations, namely; word similarity, dissimilarity and analogy. The evaluation was performed based on the model output and analysed by experts (corpus linguists). Experts’ evaluation result on a small, domain-specific corpus showed that the suitable hyperparameters were a window size of 5 or 10, a vector size of 50 to 100 and Skip-gram architecture.

AB - In this paper, we present the process of training the word embedding (WE) model for a small, domain-specific Malay corpus. In this study, Hansard corpus of Malaysia Parliament for specific years was trained on the Word2vec model. However, a specific setting of the hyperparameters is required to obtain an accurate WE model because changing one of the hyperparameters would affect the model’s performance. We trained the corpus into a series of WE model on a set of hyperparameters where one of the parameter values was different from each model. The model performances were intrinsically evaluated using three semantic word relations, namely; word similarity, dissimilarity and analogy. The evaluation was performed based on the model output and analysed by experts (corpus linguists). Experts’ evaluation result on a small, domain-specific corpus showed that the suitable hyperparameters were a window size of 5 or 10, a vector size of 50 to 100 and Skip-gram architecture.

KW - Corpus linguistic

KW - Malay corpus

KW - word embedding

KW - Word2vec

UR - http://www.scopus.com/inward/record.url?scp=85072949716&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85072949716&partnerID=8YFLogxK

U2 - 10.1007/978-981-15-0058-9_42

DO - 10.1007/978-981-15-0058-9_42

M3 - Conference contribution

AN - SCOPUS:85072949716

SN - 9789811500572

T3 - Lecture Notes in Electrical Engineering

SP - 435

EP - 443

BT - Computational Science and Technology - 6th ICCST 2019

A2 - Alfred, Rayner

A2 - Lim, Yuto

A2 - Haviluddin, Haviluddin

A2 - On, Chin Kim

PB - Springer Verlag

ER -